Brief Introduction to Pandas
==

We will go thru basic rational and operations in Pandas data frame. This tutorial mainly focuses on

1. Create
2. View
3. Select
4. Set
5. Apply
6. Group
7. Merge


Let's import the package first

In [43]:
import numpy as np
import pandas as pd

Create
--
We usually create a data frame from a dict type. Each key is a column name while each value is a column.

In [44]:
df2 = pd.DataFrame({'A': 1.,    
                    'B': pd.to_datetime('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


View
--
The first thing is always to view input data to get general ideas, especially when it is very large. head() and tail() can help.

In [45]:
df2.head(2)

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo


In [46]:
df2.tail(3)

Unnamed: 0,A,B,C,D,E,F
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


Understanding each column's data type is important. We care about numerical, string and categorical most.

In [47]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

For columns in numerical, we can view their distributions statistic.

In [48]:
df2.describe()

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,1.0,1.0,3.0
std,0.0,0.0,0.0
min,1.0,1.0,3.0
25%,1.0,1.0,3.0
50%,1.0,1.0,3.0
75%,1.0,1.0,3.0
max,1.0,1.0,3.0


Select
==

In [49]:
df = pd.DataFrame(np.random.randn(8, 3), 
                  index=pd.date_range('1/1/2000', periods=8),
                  columns=['A', 'B', 'C'])
df


Unnamed: 0,A,B,C
2000-01-01,-1.19639,-1.037491,0.877998
2000-01-02,-1.482742,0.18983,0.670487
2000-01-03,-0.190781,-1.746123,-0.398068
2000-01-04,-0.193676,0.878935,0.65031
2000-01-05,1.477564,1.106824,1.296083
2000-01-06,-0.642507,-1.119144,2.014241
2000-01-07,1.29225,0.388352,-1.07115
2000-01-08,-1.422954,-0.234342,-0.479789


Three selection methods are very popular:
    1. By Label
    2. By Position
    3. By Boolean Condition

###### By Label

.loc[row selection, column selection] returns a subset of the orginal table with matched conditions in row and column. ':' means selecting all.

In [51]:
df.loc[[pd.to_datetime('20000106'), pd.to_datetime('20000107')], ['A', 'B']]

Unnamed: 0,A,B
2000-01-06,-0.642507,-1.119144
2000-01-07,1.29225,0.388352


In [52]:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2000-01-01,-1.19639,-1.037491
2000-01-02,-1.482742,0.18983
2000-01-03,-0.190781,-1.746123
2000-01-04,-0.193676,0.878935
2000-01-05,1.477564,1.106824
2000-01-06,-0.642507,-1.119144
2000-01-07,1.29225,0.388352
2000-01-08,-1.422954,-0.234342


######  By Position

Similar to .loc, .iloc[row id, column id] returns a subset of the orginal table with selected row id and column id. ':' means selecting all.

In [53]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2000-01-04,-0.193676,0.878935
2000-01-05,1.477564,1.106824


In [54]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C
2000-01-02,-1.482742,0.18983,0.670487
2000-01-03,-0.190781,-1.746123,-0.398068


######  By Boolean Condition

Sometime we need to select the subset table that matches our constraints like all positive in A column.

In [55]:
df[df.A > 0]

Unnamed: 0,A,B,C
2000-01-05,1.477564,1.106824,1.296083
2000-01-07,1.29225,0.388352,-1.07115


We can combine by label with boolean selection methods.

In [56]:
df.loc[df.A > 0,:]

Unnamed: 0,A,B,C
2000-01-05,1.477564,1.106824,1.296083
2000-01-07,1.29225,0.388352,-1.07115


Set
--
We introduce 4 types of set new values in the existing data frame. It is closely related to selection methods discussed above

1. By Indexes
2. By Label
3. By Position
4. By Numpy Array

###### By Indexes
Notice two NaNs appeared on 2000-01-01 and 2000-01-08 because s1 doesn't have matched values.

In [57]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20000102', periods=6))
df['F'] = s1
df

Unnamed: 0,A,B,C,F
2000-01-01,-1.19639,-1.037491,0.877998,
2000-01-02,-1.482742,0.18983,0.670487,1.0
2000-01-03,-0.190781,-1.746123,-0.398068,2.0
2000-01-04,-0.193676,0.878935,0.65031,3.0
2000-01-05,1.477564,1.106824,1.296083,4.0
2000-01-06,-0.642507,-1.119144,2.014241,5.0
2000-01-07,1.29225,0.388352,-1.07115,6.0
2000-01-08,-1.422954,-0.234342,-0.479789,


###### By Label

In [58]:
df.at[pd.to_datetime('20000102'), 'A'] = 0
df

Unnamed: 0,A,B,C,F
2000-01-01,-1.19639,-1.037491,0.877998,
2000-01-02,0.0,0.18983,0.670487,1.0
2000-01-03,-0.190781,-1.746123,-0.398068,2.0
2000-01-04,-0.193676,0.878935,0.65031,3.0
2000-01-05,1.477564,1.106824,1.296083,4.0
2000-01-06,-0.642507,-1.119144,2.014241,5.0
2000-01-07,1.29225,0.388352,-1.07115,6.0
2000-01-08,-1.422954,-0.234342,-0.479789,


###### By Position

In [59]:
df.iat[0, 1] = 0
df

Unnamed: 0,A,B,C,F
2000-01-01,-1.19639,0.0,0.877998,
2000-01-02,0.0,0.18983,0.670487,1.0
2000-01-03,-0.190781,-1.746123,-0.398068,2.0
2000-01-04,-0.193676,0.878935,0.65031,3.0
2000-01-05,1.477564,1.106824,1.296083,4.0
2000-01-06,-0.642507,-1.119144,2.014241,5.0
2000-01-07,1.29225,0.388352,-1.07115,6.0
2000-01-08,-1.422954,-0.234342,-0.479789,


###### By Numpy Array

In [60]:
df.loc[:, 'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,C,F,D
2000-01-01,-1.19639,0.0,0.877998,,5
2000-01-02,0.0,0.18983,0.670487,1.0,5
2000-01-03,-0.190781,-1.746123,-0.398068,2.0,5
2000-01-04,-0.193676,0.878935,0.65031,3.0,5
2000-01-05,1.477564,1.106824,1.296083,4.0,5
2000-01-06,-0.642507,-1.119144,2.014241,5.0,5
2000-01-07,1.29225,0.388352,-1.07115,6.0,5
2000-01-08,-1.422954,-0.234342,-0.479789,,5


Apply
--
Apply() processes the same execution across selected columns. np.cumsum() returns the cumulative sum.

In [61]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,F,D
2000-01-01,-1.19639,0.0,0.877998,,5
2000-01-02,-1.19639,0.18983,1.548485,1.0,10
2000-01-03,-1.387172,-1.556293,1.150418,3.0,15
2000-01-04,-1.580847,-0.677358,1.800727,6.0,20
2000-01-05,-0.103284,0.429466,3.09681,10.0,25
2000-01-06,-0.74579,-0.689678,5.111051,15.0,30
2000-01-07,0.546459,-0.301326,4.039901,21.0,35
2000-01-08,-0.876494,-0.535668,3.560113,,40


We can also use self-defined function in lambda

In [62]:
df.apply(lambda x: x.max() - x.min())

A    2.900517
B    2.852946
C    3.085391
F    5.000000
D    0.000000
dtype: float64

Group
--

In [63]:
df2 = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})

df2

Unnamed: 0,A,B,C,D
0,foo,one,-1.862232,0.257821
1,bar,one,-0.572357,1.515196
2,foo,two,-0.994933,0.444104
3,bar,three,0.441701,-0.480375
4,foo,two,-1.567959,-0.861752
5,bar,two,0.051634,-0.245915
6,foo,one,0.080736,1.606136
7,foo,three,-0.468119,0.812452


We can get statistics about the sample conditioned on different groups

In [64]:
df2.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.572357,1.515196
bar,three,0.441701,-0.480375
bar,two,0.051634,-0.245915
foo,one,-1.781496,1.863957
foo,three,-0.468119,0.812452
foo,two,-2.562892,-0.417648


Join
--
Join has four types, left, right, inner and outer. In pd.merge(x, y), x is the left data frame and y is the right.  
    1. left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
    2. right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
    3. outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
    4. inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
    

In [65]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
left

Unnamed: 0,key,lval
0,foo,1
1,foo,2


In [66]:
right

Unnamed: 0,key,rval
0,foo,4
1,foo,5


Now we merge left and right on column 'Key'

In [67]:
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


Summary
==
We should always be aware of data type, understand the internal structures and be clear about relation between different tables.

Most information above can be found from [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#min). 

At last, practise makes perfect.