## Introduction to Pandas: Series and Dataframes

Series: like one dimensional array, not restricted to just numeric types, optimised for iterating through values. Built on top of numpy.

Dataframe : like Two dimensional array with row indices and column names. Can contain Mixed type attributes.

In [10]:
# Create a Series : using pd.Series(numpy_array)
import pandas as pd 
import numpy as np

m = pd.Series([1,2,3,4,5])

# print m
m

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [11]:
# Accessing elements of a series : like a 1d list 
# to print single index
m[1]

2

In [12]:
# to print set of indices(rows) : print all rows from index 1
m[1:]

1    2
2    3
3    4
4    5
dtype: int64

In [13]:
# we can print random indexed rows 
# NOTE: m[1,3] will not work to print indices 1 and 3, We need pass in a list like [1,3] in original []
m[[1,3]]


1    2
3    4
dtype: int64

In [14]:
#To contrast with numpy array and pandas series, we have an apply() that applies a function to each element of it. and this is supported only for pandas series.
np.arange(10).apply(lambda x:x+1)

AttributeError: 'numpy.ndarray' object has no attribute 'apply'

In [32]:
pd.Series([1,2,3,4]).apply(lambda x:x+1)

0    2
1    3
2    4
3    5
dtype: int64

Dataframes: Real world data is mentioned i this format. Every row is an object and every column is an attribute.

In [None]:
Creating Dataframes : many Ways
1) from Dictionary
2) from csv file
3) from json file
4) from text file

In [38]:
# 1) from Dictionary:
df = pd.DataFrame({'Name':['Santoshkumar vagga', 'Suraj Chauhan', 'Satish Biradar', 'Uday Poddar'],
                    'Age':[25,26,24,27],
                    'Education':['M.Sc','M.D', 'B.E', 'M.E']})
df

Unnamed: 0,Name,Age,Education
0,Santoshkumar vagga,25,M.Sc
1,Suraj Chauhan,26,M.D
2,Satish Biradar,24,B.E
3,Uday Poddar,27,M.E


In [8]:
# 2) from csv: NOTE: Save as CSV(Comma delimited)
df = pd.read_csv("sample_book.csv")
df

Unnamed: 0,Team,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
0,RCB,Virat Kohli,AB de villers,,12,40
1,MI,Rohit Sharma,K Pollard,5.0,12,38
2,CSK,MS Dhoni,A Jadeja,3.0,10,41
3,KXP,KL Rahul,M Agarwal,,11,42
4,KKR,D Kartik,A Russel,2.0,8,39
5,SRH,Dwarner,M Pandey,1.0,7,39
6,DC,S Iyer,R Pant,,12,40
7,RR,S Smith,A Rahane,1.0,8,38


Reading and Summarising Dataframes

In [9]:
# Print top 5 rows
df.head()

Unnamed: 0,Team,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
0,RCB,Virat Kohli,AB de villers,,12,40
1,MI,Rohit Sharma,K Pollard,5.0,12,38
2,CSK,MS Dhoni,A Jadeja,3.0,10,41
3,KXP,KL Rahul,M Agarwal,,11,42
4,KKR,D Kartik,A Russel,2.0,8,39


In [10]:
# print last 5 rows
df.tail()

Unnamed: 0,Team,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
3,KXP,KL Rahul,M Agarwal,,11,42
4,KKR,D Kartik,A Russel,2.0,8,39
5,SRH,Dwarner,M Pandey,1.0,7,39
6,DC,S Iyer,R Pant,,12,40
7,RR,S Smith,A Rahane,1.0,8,38


In [11]:
# to know datatypes of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Team           8 non-null      object 
 1   Captain        8 non-null      object 
 2   Vice_Captain   8 non-null      object 
 3   Won_times      5 non-null      float64
 4   Total_Seasons  8 non-null      int64  
 5   Squad_team     8 non-null      int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 512.0+ bytes


In [12]:
# to know total rows and columns
df.shape

(8, 6)

In [13]:
# to get numerical statistics of each column like mean, min, max(only for numeric type columns)
df.describe()

Unnamed: 0,Won_times,Total_Seasons,Squad_team
count,5.0,8.0,8.0
mean,2.4,10.0,39.625
std,1.67332,2.070197,1.407886
min,1.0,7.0,38.0
25%,1.0,8.0,38.75
50%,2.0,10.5,39.5
75%,3.0,12.0,40.25
max,5.0,12.0,42.0


In [14]:
# get all column names of dataframe
df.columns

Index(['Team', 'Captain', 'Vice_Captain', 'Won_times', 'Total_Seasons',
       'Squad_team'],
      dtype='object')

In [15]:
# get each row as numpy array
df.values

array([['RCB', 'Virat Kohli', 'AB de villers', nan, 12, 40],
       ['MI', 'Rohit Sharma', 'K Pollard', 5.0, 12, 38],
       ['CSK', 'MS Dhoni', 'A Jadeja', 3.0, 10, 41],
       ['KXP', 'KL Rahul', 'M Agarwal', nan, 11, 42],
       ['KKR', 'D Kartik', 'A Russel', 2.0, 8, 39],
       ['SRH', 'Dwarner', 'M Pandey', 1.0, 7, 39],
       ['DC', 'S Iyer', 'R Pant', nan, 12, 40],
       ['RR', 'S Smith', 'A Rahane', 1.0, 8, 38]], dtype=object)

Set custom index column: using set_index()

In [21]:
df = pd.read_csv("sample_book.csv")
df.set_index('Team', inplace=True)
df

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RCB,Virat Kohli,AB de villers,,12,40
MI,Rohit Sharma,K Pollard,5.0,12,38
CSK,MS Dhoni,A Jadeja,3.0,10,41
KXP,KL Rahul,M Agarwal,,11,42
KKR,D Kartik,A Russel,2.0,8,39
SRH,Dwarner,M Pandey,1.0,7,39
DC,S Iyer,R Pant,,12,40
RR,S Smith,A Rahane,1.0,8,38


Sorting Dataframes:


1) Sorting Index:

In [15]:
# 1) Sort Index: using sort_index() 
df = pd.read_csv("sample_book.csv")
df.set_index('Team', inplace=True)
df.sort_index(ascending=True, inplace = True)
df

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CSK,MS Dhoni,A Jadeja,3.0,10,41
DC,S Iyer,R Pant,2.0,12,40
KKR,D Kartik,A Russel,2.0,8,38
KXP,KL Rahul,M Agarwal,,11,42
MI,Rohit Sharma,K Pollard,5.0,12,38
RCB,Virat Kohli,AB de villers,,12,40
RR,S Smith,A Rahane,1.0,8,39
SRH,Dwarner,M Pandey,1.0,7,39


2) Sorting Values: We can also sort by any custom column(s)

In [16]:
# using sort_values(axis=0, ascending = Truem inplace=True)
# Note: if axis =1, it considers coulumn wise. if axis = 0, then it considers row wise.

In [19]:
df.sort_values('Squad_team', ascending=False, inplace=True)
df

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
KXP,KL Rahul,M Agarwal,,11,42
CSK,MS Dhoni,A Jadeja,3.0,10,41
DC,S Iyer,R Pant,2.0,12,40
RCB,Virat Kohli,AB de villers,,12,40
SRH,Dwarner,M Pandey,1.0,7,39
RR,S Smith,A Rahane,1.0,8,39
KKR,D Kartik,A Russel,2.0,8,38
MI,Rohit Sharma,K Pollard,5.0,12,38


In [21]:
# We can also perform sorting using >1 columns: It will sort using second column , then for the result it applies sorting based on first column given. (REVERSE order)
df.sort_values(by=['Total_Seasons', 'Squad_team'], ascending=True, inplace=True)
df

Unnamed: 0_level_0,Captain,Vice_Captain,Won_times,Total_Seasons,Squad_team
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SRH,Dwarner,M Pandey,1.0,7,39
KKR,D Kartik,A Russel,2.0,8,38
RR,S Smith,A Rahane,1.0,8,39
CSK,MS Dhoni,A Jadeja,3.0,10,41
KXP,KL Rahul,M Agarwal,,11,42
MI,Rohit Sharma,K Pollard,5.0,12,38
DC,S Iyer,R Pant,2.0,12,40
RCB,Virat Kohli,AB de villers,,12,40
