## Introduction to Pandas: Series and Dataframes

Series: like one dimensional array, not restricted to just numeric types, optimised for iterating through values. Built on top of numpy.

Dataframe : like Two dimensional array with row indices and column names. Can contain Mixed type attributes.

Series:

In [11]:
# Create a Series : using pd.Series(numpy_array)
import pandas as pd 
import numpy as np

m = pd.Series([1,2,3,4,5])

# print m
m

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [12]:
# Accessing elements of a series : like a 1d list 
# to print single index
m[1]

2

In [13]:
# to print set of indices(rows) : print all rows from index 1
m[1:]

1    2
2    3
3    4
4    5
dtype: int64

In [14]:
# we can print random indexed rows 
# NOTE: m[1,3] will not work to print indices 1 and 3, We need pass in a list like [1,3] in original []
m[[1,3]]


1    2
3    4
dtype: int64

In [15]:
#To contrast with numpy array and pandas series, we have an apply() that applies a function to each element of it. and this is supported only for pandas series.
np.arange(10).apply(lambda x:x+1)

AttributeError: 'numpy.ndarray' object has no attribute 'apply'

In [None]:
pd.Series([1,2,3,4]).apply(lambda x:x+1)

Dataframes: Real world data is mentioned i this format. Every row is an object and every column is an attribute.

In [None]:
Creating Dataframes : many Ways
1) from Dictionary
2) from csv file
3) from json file
4) from text file

In [None]:
# 1) from Dictionary:
df = pd.DataFrame({'Name':['Santoshkumar vagga', 'Suraj Chauhan', 'Satish Biradar', 'Uday Poddar'],
                    'Age':[25,26,24,27],
                    'Education':['M.Sc','M.D', 'B.E', 'M.E']})
df

In [None]:
# 2) from csv: NOTE: Save as CSV(Comma delimited)
df = pd.read_csv("sample_book.csv")
df

Reading and Summarising Dataframes

In [None]:
# Print top 5 rows
df.head()

In [None]:
# print last 5 rows
df.tail()

In [None]:
# to know datatypes of each column
df.info()

In [None]:
# to know total rows and columns
df.shape

In [None]:
# to get numerical statistics of each column like mean, min, max(only for numeric type columns)
df.describe()

In [None]:
# get all column names of dataframe
df.columns

In [None]:
# get each row as numpy array
df.values

Set custom index column: using set_index()

In [None]:
df = pd.read_csv("sample_book.csv")
df.set_index('Team', inplace=True)
df

Sorting Dataframes:


1) Sorting Index:

In [None]:
# 1) Sort Index: using sort_index() 
df = pd.read_csv("sample_book.csv")
df.set_index('Team', inplace=True)
df.sort_index(ascending=True, inplace = True)
df

2) Sorting Values: We can also sort by any custom column(s)

In [None]:
# using sort_values(axis=0, ascending = Truem inplace=True)
# Note: if axis =1, it considers coulumn wise. if axis = 0, then it considers row wise.

In [None]:
df.sort_values('Squad_team', ascending=False, inplace=True)
df

In [None]:
# We can also perform sorting using >1 columns: It will sort using second column , then for the result it applies sorting based on first column given. (REVERSE order)
df.sort_values(by=['Total_Seasons', 'Squad_team'], ascending=True, inplace=True)
df

Indexing and Selecting data:

1) Selecting rows from a dataframe
2) Selecting columns from a dataframe
3) Selecting columns from a dataframe

In [None]:
# 1) Selecting rows:
df[2:6] 

In [None]:
# Selecting alternate rows (from 3rd row to till last but in alternate fashion)
df[3::2]

In [None]:
# 2) Selecting Columns: Each Column is a pandas series. 2 ways: a) using [] 2) using .NOTE: We can extract both as Series and Dataframe
# a) using [] : as Series.
df['Captain']


In [None]:
type(df['Captain'])

In [None]:
# b) using . : as Series
df.Captain


In [None]:
type(df.Captain)

In [None]:
# as Dataframe : Just embed in []
df[["Captain"]]

In [None]:
type(df[["Captain"]])

In [None]:
# Selecting multiple columns : returns always a Dataframe
# NOTE: dont specify Index in list, since it appears by default for every row
df[['Captain', 'Total_Seasons']]

In [None]:
type(df[['Captain', 'Total_Seasons']])

In [None]:
# Unlike Series(1D array), We cannot access a Dataframe by Just One parameter like Dataframe[2]. 

In [34]:
df

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SRH,Dwarner,M Pandey,1.0,7,39
KKR,D Kartik,A Russel,2.0,8,38
RR,S Smith,A Rahane,1.0,8,39
CSK,MS Dhoni,A Jadeja,3.0,10,41
KXP,KL Rahul,M Agarwal,,11,42
MI,Rohit Sharma,K Pollard,5.0,12,38
DC,S Iyer,R Pant,2.0,12,40
RCB,Virat Kohli,AB de villers,,12,40


In [33]:
# To print only even number rows in Dataframe
df[2::2]

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RR,S Smith,A Rahane,1.0,8,39
KXP,KL Rahul,M Agarwal,,11,42
DC,S Iyer,R Pant,2.0,12,40


## Pandas recommneds to use below 2 approcahes for indexing, subsetting. Since they are more explicit.
#1) Position based Indexing: using df.iloc
#2) Label based indexing: using df.loc

In [45]:
# 1) Position based Indexing: # Use help(pd.DataFrame.iloc) for detailed info
# df.iloc[  a , b ] 
# a = row info, can be single digit or a list. 
# b = column info, can be single digit or a list

# Possible Combinations for a or b:
# m:n
# here, m is starting index and n-1 is ending Index 

In [51]:
df.iloc[2] # series output, print third row, all columns

Captain           S Smith
Vice_Captain     A Rahane
Won_times               1
Total_Seasons           8
Squad_team             39
Name: RR, dtype: object

In [50]:
df.iloc[[2]] # Dataframe output, print tird row, all columns

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RR,S Smith,A Rahane,1.0,8,39


In [57]:
df.iloc[[2,3,4],[1,2,3]] # 3,4,5th rows and 2,3,4th columns

Unnamed: 0_level_0,Vice_Captain,Won_times,Total_Seasons
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
RR,A Rahane,1.0,8
CSK,A Jadeja,3.0,10
KXP,M Agarwal,,11


In [58]:
df.iloc[2:6,3:5] # 3,4,5,6th row and 4, 5th column

Unnamed: 0_level_0,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
RR,8,39
CSK,10,41
KXP,11,42
MI,12,38


In [60]:
df.iloc[[2:4]]

SyntaxError: invalid syntax (<ipython-input-60-d9dc4857998e>, line 1)

In [96]:
# Using boolena array: selects only rows corresponding to true.
df.iloc[[True, True,False, True, False, True, False,True]]

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SRH,Dwarner,M Pandey,1.0,7,39
KKR,D Kartik,A Russel,2.0,8,38
CSK,MS Dhoni,A Jadeja,3.0,10,41
MI,Rohit Sharma,K Pollard,5.0,12,38
RCB,Virat Kohli,AB de villers,,12,40


In [97]:
# 1) Selecting based on Labels: using df.loc
# WKT it is possible to have our own custom index.

df

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SRH,Dwarner,M Pandey,1.0,7,39
KKR,D Kartik,A Russel,2.0,8,38
RR,S Smith,A Rahane,1.0,8,39
CSK,MS Dhoni,A Jadeja,3.0,10,41
KXP,KL Rahul,M Agarwal,,11,42
MI,Rohit Sharma,K Pollard,5.0,12,38
DC,S Iyer,R Pant,2.0,12,40
RCB,Virat Kohli,AB de villers,,12,40


In [103]:
df.loc[['SRH', 'KXP', 'MI'], :]

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SRH,Dwarner,M Pandey,1.0,7,39
KXP,KL Rahul,M Agarwal,,11,42
MI,Rohit Sharma,K Pollard,5.0,12,38


In [106]:
df.loc[['SRH', 'KXP', 'MI'], 'Vice_Captain':'Squad_team']

Unnamed: 0_level_0,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SRH,M Pandey,1.0,7,39
KXP,M Agarwal,,11,42
MI,K Pollard,5.0,12,38


In [None]:
Subsetting Dataframes based on Conditions:

In [107]:
df

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SRH,Dwarner,M Pandey,1.0,7,39
KKR,D Kartik,A Russel,2.0,8,38
RR,S Smith,A Rahane,1.0,8,39
CSK,MS Dhoni,A Jadeja,3.0,10,41
KXP,KL Rahul,M Agarwal,,11,42
MI,Rohit Sharma,K Pollard,5.0,12,38
DC,S Iyer,R Pant,2.0,12,40
RCB,Virat Kohli,AB de villers,,12,40


In [112]:
# using a boolean array: df['Won_times']>2.0
df.loc[df['Won_times']>2.0]

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CSK,MS Dhoni,A Jadeja,3.0,10,41
MI,Rohit Sharma,K Pollard,5.0,12,38


In [113]:
# Equivalent to above one
df.loc[df['Won_times']>2.0, ]

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CSK,MS Dhoni,A Jadeja,3.0,10,41
MI,Rohit Sharma,K Pollard,5.0,12,38


In [114]:
df

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SRH,Dwarner,M Pandey,1.0,7,39
KKR,D Kartik,A Russel,2.0,8,38
RR,S Smith,A Rahane,1.0,8,39
CSK,MS Dhoni,A Jadeja,3.0,10,41
KXP,KL Rahul,M Agarwal,,11,42
MI,Rohit Sharma,K Pollard,5.0,12,38
DC,S Iyer,R Pant,2.0,12,40
RCB,Virat Kohli,AB de villers,,12,40


In [116]:
df.loc[(df.Total_Seasons>8) & (df.Squad_team>40)]

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CSK,MS Dhoni,A Jadeja,3.0,10,41
KXP,KL Rahul,M Agarwal,,11,42


In [118]:
df.loc[(df.Total_Seasons>8) | (df.Squad_team>40)]

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CSK,MS Dhoni,A Jadeja,3.0,10,41
KXP,KL Rahul,M Agarwal,,11,42
MI,Rohit Sharma,K Pollard,5.0,12,38
DC,S Iyer,R Pant,2.0,12,40
RCB,Virat Kohli,AB de villers,,12,40


In [125]:
# We can filter column to have selected rows. NOTE: Items in List should be part of Column under filter.
valid_rows = ['A Jadeja', 'R Pant']
df.loc[df.Vice_Captain.isin(valid_rows)]

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CSK,MS Dhoni,A Jadeja,3.0,10,41
DC,S Iyer,R Pant,2.0,12,40


## Merge and Append(Concat):

##1) Merging two or more dataframes using pd.merge()

In [18]:
df_1 = df
df_2 = pd.read_csv("sample_workbook_2.csv")
df_2

Unnamed: 0,Captain,ODI_RUNS,Catches
0,R Pointing,14000,700
1,S Ganguly,12000,710
2,R Dravid,11000,800


In [20]:
# how = "inner" makes merge using same column in both dataframes so we will have all columns in both dataframes but only one common column
pd.merge(df_1, df_2, how="inner", on="Captain")

Unnamed: 0,Team,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team,Type,ODI_RUNS,Catches


Concatinate: using pd.concat()

In [None]:
# 2 ways :
# 1) Concatenate along rows : use axis = 0 (default) [NOTE: no. of columns and thier labels should be same]
# 2) Concatenate along column: use axis = 1 [ NOTE: no. of rows should be same]

In [133]:
df_3 = pd.read_csv("sample_workbook_3.csv")
df_3

Unnamed: 0,Captain,ODI_RUNS,Catches
0,R Pointing,14000,700
1,S Ganguly,12000,710
2,R Dravid,11000,800


In [136]:
df_2

Unnamed: 0,Captain,ODI_RUNS,Catches
0,Virat Kohli,25000,840
1,Rohit Sharma,13000,490
2,MS Dhoni,12000,1280
3,KL Rahul,3000,390
4,D Kartik,3500,320
5,Dwarner,8000,790
6,S Iyer,1000,321
7,S Smith,8000,800


In [138]:
# Lets Concatennate df_2 and df_3 along rows(axis = 0)
pd.concat([df_2, df_3], axis=0)

Unnamed: 0,Captain,ODI_RUNS,Catches
0,Virat Kohli,25000,840
1,Rohit Sharma,13000,490
2,MS Dhoni,12000,1280
3,KL Rahul,3000,390
4,D Kartik,3500,320
5,Dwarner,8000,790
6,S Iyer,1000,321
7,S Smith,8000,800
0,R Pointing,14000,700
1,S Ganguly,12000,710


In [139]:
# Lets Concatenate along columns (axis = 1)
pd.concat([df_2, df_3], axis =1)

Unnamed: 0,Captain,ODI_RUNS,Catches,Captain.1,ODI_RUNS.1,Catches.1
0,Virat Kohli,25000,840,R Pointing,14000.0,700.0
1,Rohit Sharma,13000,490,S Ganguly,12000.0,710.0
2,MS Dhoni,12000,1280,R Dravid,11000.0,800.0
3,KL Rahul,3000,390,,,
4,D Kartik,3500,320,,,
5,Dwarner,8000,790,,,
6,S Iyer,1000,321,,,
7,S Smith,8000,800,,,


In [140]:
# NOTE: We can also do Concatenation using append() of Dataframe. (Till now we used pandas method)
df_1

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SRH,Dwarner,M Pandey,1.0,7,39
KKR,D Kartik,A Russel,2.0,8,38
RR,S Smith,A Rahane,1.0,8,39
CSK,MS Dhoni,A Jadeja,3.0,10,41
KXP,KL Rahul,M Agarwal,,11,42
MI,Rohit Sharma,K Pollard,5.0,12,38
DC,S Iyer,R Pant,2.0,12,40
RCB,Virat Kohli,AB de villers,,12,40


In [141]:
df_2

Unnamed: 0,Captain,ODI_RUNS,Catches
0,Virat Kohli,25000,840
1,Rohit Sharma,13000,490
2,MS Dhoni,12000,1280
3,KL Rahul,3000,390
4,D Kartik,3500,320
5,Dwarner,8000,790
6,S Iyer,1000,321
7,S Smith,8000,800


In [142]:
df_1.append(df_2)

Unnamed: 0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team,ODI_RUNS,Catches
SRH,Dwarner,M Pandey,1.0,7.0,39.0,,
KKR,D Kartik,A Russel,2.0,8.0,38.0,,
RR,S Smith,A Rahane,1.0,8.0,39.0,,
CSK,MS Dhoni,A Jadeja,3.0,10.0,41.0,,
KXP,KL Rahul,M Agarwal,,11.0,42.0,,
MI,Rohit Sharma,K Pollard,5.0,12.0,38.0,,
DC,S Iyer,R Pant,2.0,12.0,40.0,,
RCB,Virat Kohli,AB de villers,,12.0,40.0,,
0,Virat Kohli,,,,,25000.0,840.0
1,Rohit Sharma,,,,,13000.0,490.0


Grouping and Summarizing Dataframes

GroupBy:

In [23]:
df = pd.read_csv("sample_book.csv")
df

Unnamed: 0,Team,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team,Type,Age
0,RCB,Virat Kohli,AB de villers,,12,40,IPL,30-35
1,MI,Rohit Sharma,K Pollard,5.0,12,38,IPL,30-35
2,CSK,MS Dhoni,A Jadeja,3.0,10,41,ODI,35-40
3,KXP,KL Rahul,M Agarwal,,11,42,ODI,25-30
4,KKR,D Kartik,A Russel,2.0,8,38,ODI,30-35
5,SRH,Dwarner,M Pandey,1.0,7,39,IPL,20-25
6,DC,S Iyer,R Pant,2.0,12,40,Domestic,20-25
7,RR,S Smith,A Rahane,1.0,8,39,Domestic,25-30


In [24]:
# first create few records with same values for some column for which we need to group
df_gp_ob = df.groupby("Type")
df_gp_ob["Total_Seasons"].sum()

Type
Domestic    20
IPL         31
ODI         29
Name: Total_Seasons, dtype: int64

In [25]:
# convert to Dataframe
pd.DataFrame(df_gp_ob["Total_Seasons"].sum())

Unnamed: 0_level_0,Total_Seasons
Type,Unnamed: 1_level_1
Domestic,20
IPL,31
ODI,29


Aggregation : using group by object

In [26]:
# Aggregation operations: Mean, Median, Min, Max, (Usally describe() will give you all of these at once even 25% 50% 75% as well)

df_gp_ob["Total_Seasons"].mean()

Type
Domestic    10.000000
IPL         10.333333
ODI          9.666667
Name: Total_Seasons, dtype: float64

In [27]:
df_gp_ob["Total_Seasons"].median()

Type
Domestic    10
IPL         12
ODI         10
Name: Total_Seasons, dtype: int64

In [29]:
df_gp_ob["Total_Seasons"].min()

Type
Domestic    8
IPL         7
ODI         8
Name: Total_Seasons, dtype: int64

In [30]:
df_gp_ob["Total_Seasons"].max()

Type
Domestic    12
IPL         12
ODI         11
Name: Total_Seasons, dtype: int64

In [31]:
df_gp_ob["Total_Seasons"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Domestic,2.0,10.0,2.828427,8.0,9.0,10.0,11.0,12.0
IPL,3.0,10.333333,2.886751,7.0,9.5,12.0,12.0,12.0
ODI,3.0,9.666667,1.527525,8.0,9.0,10.0,10.5,11.0


In [32]:
# we can groupby using multiple columns
df

Unnamed: 0,Team,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team,Type,Age
0,RCB,Virat Kohli,AB de villers,,12,40,IPL,30-35
1,MI,Rohit Sharma,K Pollard,5.0,12,38,IPL,30-35
2,CSK,MS Dhoni,A Jadeja,3.0,10,41,ODI,35-40
3,KXP,KL Rahul,M Agarwal,,11,42,ODI,25-30
4,KKR,D Kartik,A Russel,2.0,8,38,ODI,30-35
5,SRH,Dwarner,M Pandey,1.0,7,39,IPL,20-25
6,DC,S Iyer,R Pant,2.0,12,40,Domestic,20-25
7,RR,S Smith,A Rahane,1.0,8,39,Domestic,25-30


In [34]:
df_gp_obj = df.groupby(['Total_Seasons', 'Age'])
# here, grouby occurs first on basis of first argument, then by second argument, so on.
pd.DataFrame(df_gp_obj['Won_times'])

Unnamed: 0,0,1
0,"(7, 20-25)","5 1.0 Name: Won_times, dtype: float64"
1,"(8, 25-30)","7 1.0 Name: Won_times, dtype: float64"
2,"(8, 30-35)","4 2.0 Name: Won_times, dtype: float64"
3,"(10, 35-40)","2 3.0 Name: Won_times, dtype: float64"
4,"(11, 25-30)","3 NaN Name: Won_times, dtype: float64"
5,"(12, 20-25)","6 2.0 Name: Won_times, dtype: float64"
6,"(12, 30-35)","0 NaN 1 5.0 Name: Won_times, dtype: float64"
