# Data Analysis for Software Engineers

# 1. Pandas Tutorual

<img src="img/pandas.png" width="600">

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Series

`Pandas` helps to work with tables in python. The basic structures are `Series` и `DataFrame`. <br/>
`Series` – indexed one-dimensional array of values.

It is possible to create `Series` with an array:

In [2]:
arr = np.random.rand(5)
ser = pd.Series(arr)

In [3]:
ser

0    0.857896
1    0.890765
2    0.851902
3    0.174322
4    0.406879
dtype: float64

Left column is index.

In [4]:
ser.index

RangeIndex(start=0, stop=5, step=1)

In [5]:
ser.values

array([0.85789588, 0.89076459, 0.85190222, 0.17432244, 0.40687928])

In [6]:
# Take several elements based on their index
ser.loc[3:5, ]

3    0.174322
4    0.406879
dtype: float64

In [7]:
# Take several elements based on their positions
ser.iloc[3:5, ]

3    0.174322
4    0.406879
dtype: float64

One more example:

In [8]:
ser = pd.Series(np.random.rand(8), index=['s', 'o', 'f', 't', 'w', 'a', 'r', 'e'])

In [9]:
ser

s    0.679546
o    0.778898
f    0.538662
t    0.430176
w    0.910613
a    0.262670
r    0.871588
e    0.441173
dtype: float64

In [10]:
# Take several elements based on their index
ser.loc['o':'t', ]

o    0.778898
f    0.538662
t    0.430176
dtype: float64

In [11]:
# Take several elements based on their positions
ser.iloc[3:5, ]

t    0.430176
w    0.910613
dtype: float64

## DataFrame

`DataFrame` is indexed multidimentional array, where each column is `Series`.

In [12]:
# Create a data frame
df = pd.DataFrame(np.random.randn(10, 3),
                  index=range(10),
                  columns=['A', 'B', 'C'])

In [13]:
# Show the first 5 rows
df.head(5)

Unnamed: 0,A,B,C
0,-0.67169,1.009903,-1.747954
1,1.244333,-0.21899,0.375456
2,0.526325,0.477482,0.789586
3,-1.556308,-0.572554,-0.282444
4,-0.314908,-1.725073,0.134216


In [14]:
# Print index and columns of the data frame
print ("Index: ", df.index)
print ("Columns: ", df.columns)

Index:  RangeIndex(start=0, stop=10, step=1)
Columns:  Index(['A', 'B', 'C'], dtype='object')


In [15]:
# Select elements on their index and column
df.loc[1:3, ['A', 'B']]

Unnamed: 0,A,B
1,1.244333,-0.21899
2,0.526325,0.477482
3,-1.556308,-0.572554


In [16]:
# Select elements on their position in the data frame
df.iloc[1:3, 0:2]

Unnamed: 0,A,B
1,1.244333,-0.21899
2,0.526325,0.477482


In [17]:
# Transpose the data frame
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
A,-0.67169,1.244333,0.526325,-1.556308,-0.314908,2.413872,2.016924,0.324842,0.007645,0.995691
B,1.009903,-0.21899,0.477482,-0.572554,-1.725073,1.427378,-0.455745,0.383101,-0.693051,-0.361286
C,-1.747954,0.375456,0.789586,-0.282444,0.134216,-0.719024,-0.132391,1.754869,-0.401459,1.033772


Aggregation (on columns by default)

In [18]:
# Aggregation on all columns
df.mean()

A    0.498673
B   -0.072883
C    0.080463
dtype: float64

In [19]:
# Aggregation on 'A' column only
df.A.mean()

0.49867269072337095

# 2. Pandas with Real Data

### LAD and Academic Performance

[Article](http://www.ncbi.nlm.nih.gov/pubmed/5676802) Correlation of Performance Test Scores with Tissue Concentration of Lysergic Acid Diethylamide in Human Subjects was published in 1968. 

There was a [sample](https://www.dropbox.com/s/ui14yeeckbc6z7c/drugs-and-math.csv?dl=0) with 7 observables.

In [20]:
# Run it if you are in Colab
# !wget 'https://raw.githubusercontent.com/hushchyn-mikhail/hse_se_ml/s01/2020/s01-intro-to-python/drugs-and-math.csv'

In [21]:
# Read data from a .csv file 
df = pd.read_csv('drugs-and-math.csv', index_col=0, sep=',')

In [22]:
# Show the first 5 rows
df.head()

Unnamed: 0,Drugs,Score
0,1.17,78.93
1,2.97,58.2
2,3.26,67.47
3,4.69,37.47
4,5.83,45.65


In [23]:
print (df.shape) # Size of the data frame
print (df.columns) # List of the columns
print (df.index) # Index of rows in the data frame

(7, 2)
Index(['Drugs', 'Score'], dtype='object')
Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')


Lets sort the DataFrame by Score

In [24]:
# Sorting by 'Score' column
df = df.sort_values('Score', ascending=False)

In [25]:
df.head()

Unnamed: 0,Drugs,Score
0,1.17,78.93
2,3.26,67.47
1,2.97,58.2
4,5.83,45.65
3,4.69,37.47


In [26]:
# Show a report with several statistics
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Drugs,7.0,4.332857,1.935413,1.17,3.115,4.69,5.915,6.41
Score,7.0,50.087143,18.610854,29.97,35.195,45.65,62.835,78.93
