# Python BaseCamp (Chennai)

### By Karthikeyan Sankaran, 17th June, 2018

# Introduction to Python

- Start with Jake Vanderplas video: https://www.youtube.com/watch?v=DifMYH3iuFw
- Python docs are at: https://www.python.org/doc/

## Jupyter Notebook Basics

- Code cell & Markdown cell
- Edit mode & Run mode
- Shift + Enter - To run the cell and move to next cell
- Ctrl + Enter - To run the cell and stay in that cell
- Alt + Enter - To run the cell, create a new empty cell and move there
- Tab - Autofill a method or function
- Shift + Tab - Will invoke help on function

### Simple Exercises to start

** 1. Add 1 + 2 and store in result **

In [1]:
result = 1 + 2
result

3

** 2. What is 7 to the power of 4 **

In [2]:
7 ** 4

2401

** 3. Split this string:**

    s = "Hi there Sam!"
    
**into a list. **

In [3]:
s = "Hi there Sam!"
s.split()

['Hi', 'there', 'Sam!']

** 4. Create a list and grab items of them **

In [4]:
lst = [100,101,102,103]
lst[2]

102

In [5]:
lst=[100,101,['a','b','c',['first','second','third']]]
lst[2][3][1]

'second'

** 5. Create a dictionary and grab items **

In [6]:
dt = {'k1':10,'k2':20,'k3':30}
dt['k2']

20

** 6. Conditional Flow **

In [7]:
# 'IF' statement
choice = 'a'

if choice == 'a':
    print("You chose 'a'.")
elif choice == 'b':
    print("You chose 'b'.")
elif choice == 'c':
    print("You chose 'c'.")
else:
    print("Invalid choice.")

You chose 'a'.


In [8]:
# 'For' statement
for friend in ['Margot', 'Kathryn', 'Prisila']:
    invitation = "Hi " + friend + ".  Please come to my party on Saturday!"
    print(invitation)
    
for i in range(5):
    print('i is now:', i)

Hi Margot.  Please come to my party on Saturday!
Hi Kathryn.  Please come to my party on Saturday!
Hi Prisila.  Please come to my party on Saturday!
i is now: 0
i is now: 1
i is now: 2
i is now: 3
i is now: 4


In [9]:
# 'while' statement
number = 0

while number != 5:
    print(number)
    number = number+1

0
1
2
3
4


# Introduction to Numpy

Numpy is the underlying arrays data structure that makes Python fast & efficient. 

In [10]:
import numpy as np

In [11]:
# Create an array of zeros
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [12]:
# Create an array of integers from 10 to 50
np.arange(10,51,1)

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
       44, 45, 46, 47, 48, 49, 50])

In [13]:
# Create a random number
np.random.rand(2,3)

array([[0.23712626, 0.9129753 , 0.39959021],
       [0.49295914, 0.62536189, 0.21662924]])

In [14]:
#Create a matrix
mat = np.arange(1,26).reshape(5,5)
mat

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

In [15]:
# Numpy indexing and selection
#mat[3,4]
#mat[2:,1:]
mat[3:5,:]

array([[16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

In [16]:
# Perform aggregation
#mat.sum()
mat.sum(axis=0)

array([55, 60, 65, 70, 75])

# Introduction to Pandas

In this section of the course we will learn how to use pandas for data analysis. Pandas is the R equivalent of dataframes and is the most powerful data structure used for Machine Learning

* Series
* DataFrames
* Missing Data
* GroupBy
* Merging,Joining,and Concatenating
* Operations
* Data Input and Output

### Series

The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

In [17]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [18]:
labels = ['USA','Germany','India']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

In [19]:
ser1 = pd.Series(data=arr,index=labels)
ser1

USA        10
Germany    20
India      30
dtype: int32

### Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

In [20]:
ser1['USA']

10

## DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

** Creating a dataframe **

In [21]:
from numpy.random import randn
np.random.seed(101)

mat = randn(5,4)

In [22]:
df = pd.DataFrame(mat,index=['row1','row2','row3','row4','row5'],columns=['col1','col2','col3','col4'])
df

Unnamed: 0,col1,col2,col3,col4
row1,2.70685,0.628133,0.907969,0.503826
row2,0.651118,-0.319318,-0.848077,0.605965
row3,-2.018168,0.740122,0.528813,-0.589001
row4,0.188695,-0.758872,-0.933237,0.955057
row5,0.190794,1.978757,2.605967,0.683509


** Selection and Indexing **

Let's learn the various methods to grab data from a DataFrame

In [23]:
# Selecting a column
df['col1']

row1    2.706850
row2    0.651118
row3   -2.018168
row4    0.188695
row5    0.190794
Name: col1, dtype: float64

In [24]:
# Selecting multiple columns
df[['col1','col2']]

Unnamed: 0,col1,col2
row1,2.70685,0.628133
row2,0.651118,-0.319318
row3,-2.018168,0.740122
row4,0.188695,-0.758872
row5,0.190794,1.978757


In [25]:
# Selecting a column using .dot notation (not recommended)
df.col3

row1    0.907969
row2   -0.848077
row3    0.528813
row4   -0.933237
row5    2.605967
Name: col3, dtype: float64

In [26]:
type(df['col1'])

pandas.core.series.Series

In [27]:
# Select rows using index names
df.loc['row1']

col1    2.706850
col2    0.628133
col3    0.907969
col4    0.503826
Name: row1, dtype: float64

In [28]:
df.loc[['row1','row2']]

Unnamed: 0,col1,col2,col3,col4
row1,2.70685,0.628133,0.907969,0.503826
row2,0.651118,-0.319318,-0.848077,0.605965


In [29]:
df.loc[['row1','row2'],['col1','col4']]

Unnamed: 0,col1,col4
row1,2.70685,0.503826
row2,0.651118,0.605965


In [30]:
# Select rows using index positions
df.iloc[2]

col1   -2.018168
col2    0.740122
col3    0.528813
col4   -0.589001
Name: row3, dtype: float64

In [31]:
# Select portions of a dataframe - subset of rows & columns
df.loc['row1','col2']

0.6281327087844596

In [32]:
df.loc['row1':'row3','col1':'col2']

Unnamed: 0,col1,col2
row1,2.70685,0.628133
row2,0.651118,-0.319318
row3,-2.018168,0.740122


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [33]:
df

Unnamed: 0,col1,col2,col3,col4
row1,2.70685,0.628133,0.907969,0.503826
row2,0.651118,-0.319318,-0.848077,0.605965
row3,-2.018168,0.740122,0.528813,-0.589001
row4,0.188695,-0.758872,-0.933237,0.955057
row5,0.190794,1.978757,2.605967,0.683509


In [34]:
# df > 0 evaluates to a boolean value
df[df>0]

Unnamed: 0,col1,col2,col3,col4
row1,2.70685,0.628133,0.907969,0.503826
row2,0.651118,,,0.605965
row3,,0.740122,0.528813,
row4,0.188695,,,0.955057
row5,0.190794,1.978757,2.605967,0.683509


In [35]:
df[df['col1']>0]

Unnamed: 0,col1,col2,col3,col4
row1,2.70685,0.628133,0.907969,0.503826
row2,0.651118,-0.319318,-0.848077,0.605965
row4,0.188695,-0.758872,-0.933237,0.955057
row5,0.190794,1.978757,2.605967,0.683509


In [36]:
df[df['col1']>0]['col3']

row1    0.907969
row2   -0.848077
row4   -0.933237
row5    2.605967
Name: col3, dtype: float64

In [37]:
df[df['col1']>0][['col3','col4']]

Unnamed: 0,col3,col4
row1,0.907969,0.503826
row2,-0.848077,0.605965
row4,-0.933237,0.955057
row5,2.605967,0.683509


In [38]:
# For 2 conditions you can use & or | in parenthesis
df[(df['col1']>0) & (df['col2']>0)]

Unnamed: 0,col1,col2,col3,col4
row1,2.70685,0.628133,0.907969,0.503826
row5,0.190794,1.978757,2.605967,0.683509


### Some More index details

In [39]:
df

Unnamed: 0,col1,col2,col3,col4
row1,2.70685,0.628133,0.907969,0.503826
row2,0.651118,-0.319318,-0.848077,0.605965
row3,-2.018168,0.740122,0.528813,-0.589001
row4,0.188695,-0.758872,-0.933237,0.955057
row5,0.190794,1.978757,2.605967,0.683509


In [40]:
df.reset_index()

Unnamed: 0,index,col1,col2,col3,col4
0,row1,2.70685,0.628133,0.907969,0.503826
1,row2,0.651118,-0.319318,-0.848077,0.605965
2,row3,-2.018168,0.740122,0.528813,-0.589001
3,row4,0.188695,-0.758872,-0.933237,0.955057
4,row5,0.190794,1.978757,2.605967,0.683509


In [41]:
new_index = ['A','B','C','D','E']
df['col5'] = new_index
df

Unnamed: 0,col1,col2,col3,col4,col5
row1,2.70685,0.628133,0.907969,0.503826,A
row2,0.651118,-0.319318,-0.848077,0.605965,B
row3,-2.018168,0.740122,0.528813,-0.589001,C
row4,0.188695,-0.758872,-0.933237,0.955057,D
row5,0.190794,1.978757,2.605967,0.683509,E


In [42]:
df.set_index('col5')

Unnamed: 0_level_0,col1,col2,col3,col4
col5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [43]:
df

Unnamed: 0,col1,col2,col3,col4,col5
row1,2.70685,0.628133,0.907969,0.503826,A
row2,0.651118,-0.319318,-0.848077,0.605965,B
row3,-2.018168,0.740122,0.528813,-0.589001,C
row4,0.188695,-0.758872,-0.933237,0.955057,D
row5,0.190794,1.978757,2.605967,0.683509,E


In [44]:
df.set_index('col5',inplace=True)

In [45]:
df

Unnamed: 0_level_0,col1,col2,col3,col4
col5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [46]:
# Homework: Read about Hierarchical Indexing

### Missing data in pandas

In [47]:
# Creating a dataframe using a dictionary
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [48]:
df.isnull()

Unnamed: 0,A,B,C
0,False,False,False
1,False,True,False
2,True,True,False


In [49]:
df.dropna()  #By default, axis = 0 which means 'by rows'

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [50]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [51]:
df.fillna(value='Fill Value')

Unnamed: 0,A,B,C
0,1,5,1
1,2,Fill Value,2
2,Fill Value,Fill Value,3


In [52]:
df['A'].fillna(value=df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

In [53]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


### Groupby operations

The groupby method allows you to group rows of data together and call aggregate functions

In [54]:
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


** Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:**

In [55]:
by_comp = df.groupby('Company')
type(by_comp)

pandas.core.groupby.DataFrameGroupBy

In [56]:
by_comp.mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [57]:
df.groupby('Company').mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [58]:
df.groupby('Company').min()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Carl,243
GOOG,Charlie,120
MSFT,Amy,124


In [59]:
df.groupby('Company').describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


In [60]:
df.groupby('Company').mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [61]:
df.groupby('Company').mean().reset_index()

Unnamed: 0,Company,Sales
0,FB,296.5
1,GOOG,160.0
2,MSFT,232.0


In [62]:
df.groupby('Company').max()[['Person','Sales']]

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Sarah,350
GOOG,Sam,200
MSFT,Vanessa,340


### Merging, Joining, and Concatenating

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating.

In [63]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3']},index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7']},index=[4, 5, 6, 7]) 

In [64]:
df1

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3


In [65]:
df2

Unnamed: 0,A,B,C
4,A4,B4,C4
5,A5,B5,C5
6,A6,B6,C6
7,A7,B7,C7


### Concatenation

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use **pd.concat** and pass in a list of DataFrames to concatenate together:

In [66]:
pd.concat([df1,df2])

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3
4,A4,B4,C4
5,A5,B5,C5
6,A6,B6,C6
7,A7,B7,C7


In [67]:
pd.concat([df1,df2],axis=1)

Unnamed: 0,A,B,C,A.1,B.1,C.1
0,A0,B0,C0,,,
1,A1,B1,C1,,,
2,A2,B2,C2,,,
3,A3,B3,C3,,,
4,,,,A4,B4,C4
5,,,,A5,B5,C5
6,,,,A6,B6,C6
7,,,,A7,B7,C7


### Merging

The **merge** function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. This is carried out on columns of a dataframe

In [68]:
left = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key2': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})    

In [69]:
left

Unnamed: 0,A,B,key1
0,A0,B0,K0
1,A1,B1,K1
2,A2,B2,K2
3,A3,B3,K3


In [70]:
right

Unnamed: 0,C,D,key2
0,C0,D0,K0
1,C1,D1,K1
2,C2,D2,K2
3,C3,D3,K3


In [71]:
pd.merge(left,right,how='inner',left_on='key1',right_on ='key2') # how can be - inner, outer, right, left

Unnamed: 0,A,B,key1,C,D,key2
0,A0,B0,K0,C0,D0,K0
1,A1,B1,K1,C1,D1,K1
2,A2,B2,K2,C2,D2,K2
3,A3,B3,K3,C3,D3,K3


### Joining
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

In [72]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [73]:
left.join(right)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [74]:
left.join(right, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


### Operations

There are lots of operations with pandas that will be really useful, but don't fall into any distinct category.

In [75]:
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


### Info on Unique Values

In [76]:
df['col2'].unique()

array([444, 555, 666], dtype=int64)

In [77]:
df['col2'].value_counts()

444    2
555    1
666    1
Name: col2, dtype: int64

In [78]:
df['col2'].nunique()

3

### Applying Functions

In [79]:
# Standard functions
df['col1'].sum()

10

In [80]:
# Use apply to have functions act on specific elements in a dataframe
df['col3'].apply(len)

0    3
1    3
2    3
3    3
Name: col3, dtype: int64

In [81]:
# Using lambda functions
df['col1'].apply(lambda x: x*2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

In [82]:
# User defined functions
def times2(x):
    return x*2

df['col1'].apply(times2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

** Get column and index names: **

In [83]:
df.columns

Index(['col1', 'col2', 'col3'], dtype='object')

In [84]:
df.index

RangeIndex(start=0, stop=4, step=1)

** Sorting a dataframe **

In [85]:
df.sort_values(by='col2') #inplace=False by default

Unnamed: 0,col1,col2,col3
0,1,444,abc
3,4,444,xyz
1,2,555,def
2,3,666,ghi


** Creating a pivot table **

In [86]:
df.pivot_table(values='col2',index=['col3'],columns=['col1'])

col1,1,2,3,4
col3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
abc,444.0,,,
def,,555.0,,
ghi,,,666.0,
xyz,,,,444.0


### Data Input & Output
 
- Pandas supports a lot of functions using the pd.read_xxx method
- We will use the pd.read_csv method to read in data from csv files into a dataframe