# Data Analysis for Software Engineers

# 1. Pandas Tutorual

<img src="img/pandas.png" width="600">

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Series

`Pandas` helps to work with tables in python. The basic structures are `Series` и `DataFrame`. <br/>
`Series` – indexed one-dimensional array of values.

It is possible to create `Series` with an array:

In [2]:
arr = np.random.rand(5)
ser = pd.Series(arr)

In [3]:
ser

0    0.506989
1    0.028978
2    0.551114
3    0.512980
4    0.366621
dtype: float64

Left column is index.

In [4]:
ser.index

RangeIndex(start=0, stop=5, step=1)

In [5]:
ser.values

array([0.50698924, 0.02897829, 0.55111371, 0.5129798 , 0.36662078])

In [6]:
# Take several elements based on their index
ser.loc[3:5, ]

3    0.512980
4    0.366621
dtype: float64

In [7]:
# Take several elements based on their positions
ser.iloc[3:5, ]

3    0.512980
4    0.366621
dtype: float64

One more example:

In [8]:
ser = pd.Series(np.random.rand(8), index=['s', 'o', 'f', 't', 'w', 'a', 'r', 'e'])

In [9]:
ser

s    0.088484
o    0.508483
f    0.345642
t    0.234989
w    0.192007
a    0.967440
r    0.644168
e    0.397441
dtype: float64

In [10]:
# Take several elements based on their index
ser.loc['o':'t', ]

o    0.508483
f    0.345642
t    0.234989
dtype: float64

In [11]:
# Take several elements based on their positions
ser.iloc[3:5, ]

t    0.234989
w    0.192007
dtype: float64

## DataFrame

`DataFrame` is indexed multidimentional array, where each column is `Series`.

In [12]:
# Create a data frame
df = pd.DataFrame(np.random.randn(10, 3),
                  index=range(10),
                  columns=['A', 'B', 'C'])

In [13]:
# Show the first 5 rows
df.head(5)

Unnamed: 0,A,B,C
0,3.049501,0.113386,-0.475713
1,-0.827165,-0.012722,0.664108
2,1.486224,0.276726,0.634813
3,-0.967201,0.915769,0.240875
4,0.538395,0.221779,-0.288014


In [14]:
# Print index and columns of the data frame
print ("Index: ", df.index)
print ("Columns: ", df.columns)

Index:  RangeIndex(start=0, stop=10, step=1)
Columns:  Index(['A', 'B', 'C'], dtype='object')


In [15]:
# Select elements on their index and column
df.loc[1:3, ['A', 'B']]

Unnamed: 0,A,B
1,-0.827165,-0.012722
2,1.486224,0.276726
3,-0.967201,0.915769


In [16]:
# Select elements on their position in the data frame
df.iloc[1:3, 0:2]

Unnamed: 0,A,B
1,-0.827165,-0.012722
2,1.486224,0.276726


In [17]:
# Transpose the data frame
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
A,3.049501,-0.827165,1.486224,-0.967201,0.538395,-0.010515,-0.101982,-0.365163,-1.754301,-1.647303
B,0.113386,-0.012722,0.276726,0.915769,0.221779,-0.340152,-0.657546,1.668015,0.775294,0.332427
C,-0.475713,0.664108,0.634813,0.240875,-0.288014,-1.614617,0.573497,-1.470215,0.224301,0.004327


Aggregation (on columns by default)

In [18]:
# Aggregation on all columns
df.mean()

A   -0.059951
B    0.329298
C   -0.150664
dtype: float64

In [19]:
# Aggregation on 'A' column only
df.A.mean()

-0.05995082728831571

# 2. Pandas with Real Data

### LAD and Academic Performance

[Article](http://www.ncbi.nlm.nih.gov/pubmed/5676802) Correlation of Performance Test Scores with Tissue Concentration of Lysergic Acid Diethylamide in Human Subjects was published in 1968. 

There was a [sample](https://www.dropbox.com/s/ui14yeeckbc6z7c/drugs-and-math.csv?dl=0) with 7 observables.

In [20]:
# Read data from a .csv file 
df = pd.read_csv('drugs-and-math.csv', index_col=0, sep=',')

In [21]:
# Show the first 5 rows
df.head()

Unnamed: 0,Drugs,Score
0,1.17,78.93
1,2.97,58.2
2,3.26,67.47
3,4.69,37.47
4,5.83,45.65


In [22]:
print (df.shape) # Size of the data frame
print (df.columns) # List of the columns
print (df.index) # Index of rows in the data frame

(7, 2)
Index(['Drugs', 'Score'], dtype='object')
Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')


Lets sort the DataFrame by Score

In [23]:
# Sorting by 'Score' column
df = df.sort_values('Score', ascending=False)

In [24]:
df.head()

Unnamed: 0,Drugs,Score
0,1.17,78.93
2,3.26,67.47
1,2.97,58.2
4,5.83,45.65
3,4.69,37.47


In [25]:
# Show a report with several statistics
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Drugs,7.0,4.332857,1.935413,1.17,3.115,4.69,5.915,6.41
Score,7.0,50.087143,18.610854,29.97,35.195,45.65,62.835,78.93
