![MLA Logo](https://drive.corp.amazon.com/view/mrruckma@/MLA_headerv2.png?download=true)

## Pandas Tutorial:
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

One fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column


Import numpy and pandas 

In [1]:
import numpy as np
import pandas as pd

Create a pandas series

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


Create a pandas DataFrame from a NumPy array. We will add an index (in this case dates) as well as headings for the columns. This is a very common format for representing feature data in ML

In [3]:
dates = pd.date_range('20200101', periods=6)
print(dates)

pretend_data = np.random.randn(6, 4)
df = pd.DataFrame(pretend_data, index=dates, columns=['FeatureA', 'FeatureB', 'FeatureC', 'FeatureD'])
print(df)

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')
            FeatureA  FeatureB  FeatureC  FeatureD
2020-01-01  0.109174  1.057742  2.330293 -1.894381
2020-01-02  0.464629 -0.680121  0.251704 -1.298829
2020-01-03 -0.832654 -2.375967  0.320228 -0.778832
2020-01-04 -1.210329  2.775254 -0.846846  0.779661
2020-01-05  0.843720  0.753930  0.813752 -0.331170
2020-01-06 -0.173002 -0.322687  0.099733 -0.305327


Creating a DataFrame by passing a dict of objects that can be converted to series-like 

In [4]:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
print(df2)

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo


What are the different type items in the DataFrame?

In [5]:
print(df2.dtypes)

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object


### Viewing DataFrames

View the top and bottom rows (note default number of rows shown by head and tail is 5)

In [6]:
df.head()
df2.tail(3)

Unnamed: 0,A,B,C,D,E,F
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


Display the index and the columns separately

In [7]:
print(df.index)
print(df2.columns)

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')


Convert from pandas DataFrame to NumPy array. <br>
Note1: This will cause pandas to find a NumPy dtype that is valid for all dtypes in the DataFrame <br>
Note2: Converstion to numpy will lose the index column as well as column labels

df is a DataFrame consisting entirely of elements of type float, so the resultant numpy array will preserve this type

In [8]:
a=df.to_numpy()
print(a)
a.dtype

[[ 0.10917417  1.05774194  2.33029347 -1.89438085]
 [ 0.46462866 -0.68012069  0.25170406 -1.29882949]
 [-0.83265376 -2.3759669   0.32022814 -0.77883189]
 [-1.21032883  2.77525355 -0.8468457   0.77966124]
 [ 0.84371972  0.75392987  0.81375216 -0.3311704 ]
 [-0.17300231 -0.32268733  0.09973338 -0.3053274 ]]


dtype('float64')

df2 is a DataFrame consisting multiple datatypes, so the resultant numpy array will consist of elements that have been cast to Python objects

In [9]:
b=df2.to_numpy()
print(b)
print(b.dtype)

[[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']]
object


Generate quick statistical summary of your numerical column data using describe()

In [10]:
df.describe()

Unnamed: 0,FeatureA,FeatureB,FeatureC,FeatureD
count,6.0,6.0,6.0,6.0
mean,-0.133077,0.201358,0.494811,-0.638146
std,0.777373,1.752592,1.050374,0.921911
min,-1.210329,-2.375967,-0.846846,-1.894381
25%,-0.667741,-0.590762,0.137726,-1.16883
50%,-0.031914,0.215621,0.285966,-0.555001
75%,0.375765,0.981789,0.690371,-0.311788
max,0.84372,2.775254,2.330293,0.779661


In [11]:
df2.describe()

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,1.0,1.0,3.0
std,0.0,0.0,0.0
min,1.0,1.0,3.0
25%,1.0,1.0,3.0
50%,1.0,1.0,3.0
75%,1.0,1.0,3.0
max,1.0,1.0,3.0


Transpose

In [12]:
df.T

Unnamed: 0,2020-01-01 00:00:00,2020-01-02 00:00:00,2020-01-03 00:00:00,2020-01-04 00:00:00,2020-01-05 00:00:00,2020-01-06 00:00:00
FeatureA,0.109174,0.464629,-0.832654,-1.210329,0.84372,-0.173002
FeatureB,1.057742,-0.680121,-2.375967,2.775254,0.75393,-0.322687
FeatureC,2.330293,0.251704,0.320228,-0.846846,0.813752,0.099733
FeatureD,-1.894381,-1.298829,-0.778832,0.779661,-0.33117,-0.305327


Sorting by an axis: <br>
Note: In this case axis 0 is our rows (dates) and axis 1 is our columns (feature labels) <br>
Experiment with differenct axis and True/False for ascending

In [13]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,FeatureD,FeatureC,FeatureB,FeatureA
2020-01-01,-1.894381,2.330293,1.057742,0.109174
2020-01-02,-1.298829,0.251704,-0.680121,0.464629
2020-01-03,-0.778832,0.320228,-2.375967,-0.832654
2020-01-04,0.779661,-0.846846,2.775254,-1.210329
2020-01-05,-0.33117,0.813752,0.75393,0.84372
2020-01-06,-0.305327,0.099733,-0.322687,-0.173002


Sort by values in a particular columns

In [14]:
df.sort_values(by='FeatureB', ascending=False)

Unnamed: 0,FeatureA,FeatureB,FeatureC,FeatureD
2020-01-04,-1.210329,2.775254,-0.846846,0.779661
2020-01-01,0.109174,1.057742,2.330293,-1.894381
2020-01-05,0.84372,0.75393,0.813752,-0.33117
2020-01-06,-0.173002,-0.322687,0.099733,-0.305327
2020-01-02,0.464629,-0.680121,0.251704,-1.298829
2020-01-03,-0.832654,-2.375967,0.320228,-0.778832


### Selection

Select a single column

In [15]:
df['FeatureA']

2020-01-01    0.109174
2020-01-02    0.464629
2020-01-03   -0.832654
2020-01-04   -1.210329
2020-01-05    0.843720
2020-01-06   -0.173002
Freq: D, Name: FeatureA, dtype: float64

Select rows

In [16]:
df[0:3]

Unnamed: 0,FeatureA,FeatureB,FeatureC,FeatureD
2020-01-01,0.109174,1.057742,2.330293,-1.894381
2020-01-02,0.464629,-0.680121,0.251704,-1.298829
2020-01-03,-0.832654,-2.375967,0.320228,-0.778832


Numpy arrays can be sliced using the position indexes.

In [17]:
df['2020-01-02':'2020-01-04']

Unnamed: 0,FeatureA,FeatureB,FeatureC,FeatureD
2020-01-02,0.464629,-0.680121,0.251704,-1.298829
2020-01-03,-0.832654,-2.375967,0.320228,-0.778832
2020-01-04,-1.210329,2.775254,-0.846846,0.779661
