# Getting started with pandas

In [1]:
import numpy as np
import pandas as pd

## Playing around with Series

To start, create a series from the following dict:

In [2]:
d = dict({'a': 'Poisson', 'b': 'Binomial', 'c': 'Geometric', 'd': 'Logistic'})
d

{'a': 'Poisson', 'b': 'Binomial', 'c': 'Geometric', 'd': 'Logistic'}

In [3]:
s = 
s

a      Poisson
b     Binomial
c    Geometric
d     Logistic
dtype: object

Convert the function names to lower case:

In [4]:
s = 

Now sort the series by value:

In [5]:
s.

In [6]:
s

b     binomial
c    geometric
d     logistic
a      poisson
dtype: object

## DataFrame: Indexing

Create a 2-dimensional DataFrame with values from a standard normal distribution, in 6 rows and 3 columns, the column names being 'col1', 'col2' and 'col3', and the rows indexed by month names. 

In [8]:
df = 
df

Unnamed: 0,col1,col2,col3
jan,0.947332,0.186477,-2.871911
feb,0.286528,-1.501146,-0.089162
march,-0.233333,0.542328,-0.76331
april,-0.288128,0.048626,0.41252
mai,-0.077549,1.768271,-0.018231
june,-0.360088,-0.646813,0.703988


Now, display 
- col1 only
- march only
- col1 of march only
- row 2 only (row 1 in 0-based thinking)
- row 2, col2 only

In [9]:
# display col1

jan      0.947332
feb      0.286528
march   -0.233333
april   -0.288128
mai     -0.077549
june    -0.360088
Name: col1, dtype: float64

In [10]:
# display 'march' row

col1   -0.233333
col2    0.542328
col3   -0.763310
Name: march, dtype: float64

In [11]:
# display col1 of 'march' row

-0.23333329284822824

In [12]:
# display second row

col1    0.286528
col2   -1.501146
col3   -0.089162
Name: feb, dtype: float64

In [13]:
# display second column of second row

-1.5011459184858036

## Vectorization

Multiply the values in column march by 7:

Unnamed: 0,col1,col2,col3
jan,0.947332,0.186477,-2.871911
feb,0.286528,-1.501146,-0.089162
march,-1.633333,3.796299,-5.343173
april,-0.288128,0.048626,0.41252
mai,-0.077549,1.768271,-0.018231
june,-0.360088,-0.646813,0.703988


Now add 10 to all values in col2:

Unnamed: 0,col1,col2,col3
jan,0.947332,10.186477,-2.871911
feb,0.286528,8.498854,-0.089162
march,-1.633333,13.796299,-5.343173
april,-0.288128,10.048626,0.41252
mai,-0.077549,11.768271,-0.018231
june,-0.360088,9.353187,0.703988


Now display statistical summary values:

Unnamed: 0,col1,col2,col3
count,6.0,6.0,6.0
mean,-0.18754,10.608619,-1.200995
std,0.854983,1.89869,2.400712
min,-1.633333,8.498854,-5.343173
25%,-0.342098,9.527047,-2.176224
50%,-0.182838,10.117551,-0.053696
75%,0.195509,11.372822,0.304832
max,0.947332,13.796299,0.703988


## DataFrame: Joins

Let's shortly explore joins. 

In [17]:
df2 = pd.DataFrame(np.random.randn(18).reshape(6,3), 
                  index = ['jan', 'feb', 'march', 'mai', 'june', 'july'],
                  columns = ['col2', 'col3', 'col4'])
df2

Unnamed: 0,col2,col3,col4
jan,0.042804,-1.021198,-0.78563
feb,-1.01349,-0.132451,-1.088217
march,-0.499,0.147573,-0.236867
mai,0.53777,-0.928538,0.39276
june,-0.122828,2.36794,0.641117
july,0.350646,0.538053,0.780259


Use join() to join df and df2 on the index:

In [18]:
# specify how=<...> to change the default join mode from 'left' to e.g. 'inner'
# as we have identically named columns in both dataframes, we need to specify suffixes
j = 
j

Unnamed: 0,col1,col2orig,col3orig,col2other,col3other,col4
jan,0.947332,10.186477,-2.871911,0.042804,-1.021198,-0.78563
feb,0.286528,8.498854,-0.089162,-1.01349,-0.132451,-1.088217
march,-1.633333,13.796299,-5.343173,-0.499,0.147573,-0.236867
mai,-0.077549,11.768271,-0.018231,0.53777,-0.928538,0.39276
june,-0.360088,9.353187,0.703988,-0.122828,2.36794,0.641117


Now use merge() to join on columns, doing a database-style join.
To get a nonempty result, we first make some modifications to our dataframes.

In [19]:
df = df.applymap(round)
df

Unnamed: 0,col1,col2,col3
jan,1,10,-3
feb,0,8,0
march,-2,14,-5
april,0,10,0
mai,0,12,0
june,0,9,1


In [20]:
df2 = df2.applymap(round)
df2

Unnamed: 0,col2,col3,col4
jan,0,-1,-1
feb,-1,0,-1
march,0,0,0
mai,1,-1,0
june,0,2,1
july,0,1,1


Now join the modified dataframes on df.col1 == df2.col4.

In [21]:
m = 
m

Unnamed: 0,col1,col2_x,col3_x,col2_y,col3_y,col4
0,1,10,-3,0,2,1
1,1,10,-3,0,1,1
2,0,8,0,0,0,0
3,0,8,0,1,-1,0
4,0,10,0,0,0,0
5,0,10,0,1,-1,0
6,0,12,0,0,0,0
7,0,12,0,1,-1,0
8,0,9,1,0,0,0
9,0,9,1,1,-1,0


## DataFrame: Group by

In [22]:
df3 = pd.DataFrame({'col1': ['blue', 'green', 'blue', 'cyan', 'yellow', 'cyan'],
                    'col2': ['yes', 'no', 'no', 'yes', 'no', 'no'],
                    'col3': np.random.normal(100, 20, 6)})
df3

Unnamed: 0,col1,col2,col3
0,blue,yes,85.14416
1,green,no,111.871226
2,blue,no,72.592945
3,cyan,yes,100.975036
4,yellow,no,66.084287
5,cyan,no,112.966054


Compute col3 means for different values of col1:

Unnamed: 0_level_0,col3
col2,Unnamed: 1_level_1
no,90.878628
yes,93.059598


Now, compute col3 maxima for all combinations of col1 and col2:

Unnamed: 0_level_0,Unnamed: 1_level_0,col3
col1,col2,Unnamed: 2_level_1
blue,no,72.592945
blue,yes,85.14416
cyan,no,112.966054
cyan,yes,100.975036
green,no,111.871226
yellow,no,66.084287


## Computing statistics

In [25]:
df

Unnamed: 0,col1,col2,col3
jan,1,10,-3
feb,0,8,0
march,-2,14,-5
april,0,10,0
mai,0,12,0
june,0,9,1


Compute the column means:

col1    -0.166667
col2    10.500000
col3    -1.166667
dtype: float64

Now compute the row means (= per month):

jan      2.666667
feb      2.666667
march    2.333333
april    3.333333
mai      4.000000
june     3.333333
dtype: float64

## Missing data

In [30]:
df = pd.DataFrame({'one': pd.Series(['a', 'a', np.NaN, 'b']), 'two':pd.Series(['a', np.NaN, 'c', np.NaN])})
df

Unnamed: 0,one,two
0,a,a
1,a,
2,,c
3,b,


Remove all rows that have missing data:

Unnamed: 0,one,two
0,a,a


Now, instead replace all missing values by "unknown":

Unnamed: 0,one,two
0,a,a
1,a,unknown
2,unknown,c
3,b,unknown


## Discretization and binarization

Assume we have a continuous variable, but we want to look at effects of levels/ranges of that variable.
We need to group the values into levels, and then, create an indicator variables for every level.

In [37]:
df = pd.DataFrame({'cat': ['one', 'one', 'one', 'two', 'three', 'one', 'three', 'two', 'one'],
                   'val': [2.7, 31.0, 23.1,1.1, 5.5, 3.3, 45.0,7.0,8.0]})
df

Unnamed: 0,cat,val
0,one,2.7
1,one,31.0
2,one,23.1
3,two,1.1
4,three,5.5
5,one,3.3
6,three,45.0
7,two,7.0
8,one,8.0


First, use qcut to arrange the val data into 3 same-size bins:

In [47]:
df['val_levels'] = 
df

Unnamed: 0,cat,val,val_levels
0,one,2.7,low
1,one,31.0,high
2,one,23.1,high
3,two,1.1,low
4,three,5.5,medium
5,one,3.3,low
6,three,45.0,high
7,two,7.0,medium
8,one,8.0,medium


Now, create dummy variables for the categorical variables:

In [50]:
df = 
df

Unnamed: 0,val,cat_one,cat_three,cat_two,val_levels_low,val_levels_medium,val_levels_high
0,2.7,1,0,0,1,0,0
1,31.0,1,0,0,0,0,1
2,23.1,1,0,0,0,0,1
3,1.1,0,0,1,1,0,0
4,5.5,0,1,0,0,1,0
5,3.3,1,0,0,1,0,0
6,45.0,0,1,0,0,0,1
7,7.0,0,0,1,0,1,0
8,8.0,1,0,0,0,1,0


Finally, drop the original 'val' variable as we don't need it anymore:

In [55]:
df = 
df