## Data Exploration using Pandas

#### How to install a package

In [None]:
!pip install pandas

#### How to import a package in python

In [1]:
import math

In [2]:
math.sqrt(4)

2.0

In [3]:
import math as m

In [4]:
m.sqrt(4)

2.0

In [5]:
from math import sqrt

In [6]:
sqrt(4)

2.0

In [7]:
import pandas as pd

#### Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.

In [None]:
a = [1,2,3,4]

In [8]:
s = pd.Series([1,3,5,7,9])
s

0    1
1    3
2    5
3    7
4    9
dtype: int64

In [9]:
s[2]

5

In [11]:
s = pd.Series([1,3,5,7,9], index = [1,1,2,3,4])
s

1    1
1    3
2    5
3    7
4    9
dtype: int64

In [12]:
s[1]

1    1
1    3
dtype: int64

#### DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table
![DataFrame](images/Pandas_df.png)

In [13]:
df = pd.DataFrame()

In [14]:
type(df)

pandas.core.frame.DataFrame

In [17]:
pd.DataFrame([1,2,3,4], columns = ['Numbers'])

Unnamed: 0,Numbers
0,1
1,2
2,3
3,4


In [18]:
# How to create dataframe from lists

df = pd.DataFrame(['a','b','c','e'])
df

Unnamed: 0,0
0,a
1,b
2,c
3,e


In [20]:
import numpy as np

In [25]:
# How to create dataframe from lists

df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]).T, columns=['a', 'b', 'c'])
df2

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


In [23]:
# How to create a dataframe from Dictionaries

df = pd.DataFrame({'Id':[1,2,3,4],
                   'Name': ['A','B','C','D']}, )
df

Unnamed: 0,Id,Name
0,1,A
1,2,B
2,3,C
3,4,D


#### Reading files using pandas read functions

In [26]:
df = pd.read_csv('titanic.csv')

In [10]:
# How to check top rows of your data

df.head()

Unnamed: 0,PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked
1|0|3|Braund,Mr. Owen Harris|male|22.0|1|0|A/5 21171|7.25||S
2|1|1|Cumings,Mrs. John Bradley (Florence Briggs Thayer)|fe...
3|1|3|Heikkinen,Miss. Laina|female|26.0|0|0|STON/O2. 3101282|...
4|1|1|Futrelle,Mrs. Jacques Heath (Lily May Peel)|female|35....
5|0|3|Allen,Mr. William Henry|male|35.0|0|0|373450|8.05||S


In [11]:
# How to check bottom rows of your data

df.tail()

Unnamed: 0,PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked
887|0|2|Montvila,Rev. Juozas|male|27.0|0|0|211536|13.0||S
888|1|1|Graham,Miss. Margaret Edith|female|19.0|0|0|112053|3...
"889|0|3|""Johnston","Miss. Catherine Helen """"Carrie""""""|female||1|2..."
890|1|1|Behr,Mr. Karl Howell|male|26.0|0|0|111369|30.0|C148|C
891|0|3|Dooley,Mr. Patrick|male|32.0|0|0|370376|7.75||Q


In [12]:
# How to check number of rows and columns in your data

df.shape

(891, 1)

In [37]:
# Defining delimiter/separator other than ','

df = pd.read_csv('titanic.csv', sep = '|')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Data Dictionary**

**PassengerId** - Id of the Passenger
**Survived** - Survival	(0 = No, 1 = Yes)<br>
**Pclass** - Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)<br>
**Name** - Name of the Passenger<br>
**Sex** - Sex<br>
**Age** - Age in years<br>
**SibSp** - # of siblings / spouses aboard the Titanic<br>
**Parch** - # of parents / children aboard the Titanic<br>
**Ticket** - Ticket number<br>
**Fare** - Passenger fare<br>
**Cabin** - Cabin number<br>
**Embarked** - Port of Embarkation

In [42]:
# How to change index

df.set_index("PassengerId", inplace =True)

In [43]:
df.head(3)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [44]:
# How to rename axis

df.rename_axis("ID", inplace=True)

In [46]:
df.head(3)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [47]:
df.index

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            882, 883, 884, 885, 886, 887, 888, 889, 890, 891],
           dtype='int64', name='ID', length=891)

In [48]:
# index name is optional. Another way to rename index
df.index.name = 'ID'
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [49]:
# Defining index column

df = pd.read_csv('titanic.csv', sep = '|', index_col = "PassengerId") #index_col
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [52]:
# How to read Selected columns and specific number of rows

df = pd.read_csv('titanic.csv', sep = '|',usecols = ['PassengerId','Survived','Sex'], nrows = 100)
df.head()

Unnamed: 0,PassengerId,Survived,Sex
0,1,0,male
1,2,1,female
2,3,1,female
3,4,1,female
4,5,0,male


In [53]:
df.shape

(100, 3)

In [55]:
# How to skip reading header

df = pd.read_csv('titanic.csv', sep = '|',header = None,)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


In [20]:
#skiprows and skipfooters

#### Exploring type and content of dataframe

In [56]:
df = pd.read_csv('titanic.csv', sep = '|')

In [57]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [58]:
df.shape

(891, 12)

In [59]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [68]:
# How to check quick statistics of the variables

df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [69]:
df.describe(include = 'object')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Hendekovic, Mr. Ignjac",male,347082,C23 C25 C27,S
freq,1,577,7,4,644


In [70]:
df.describe(include = 'all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Hendekovic, Mr. Ignjac",male,,,,347082.0,,C23 C25 C27,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [29]:
# How to check quick stats of Individual columns - Numerical and Categorical

#df['Age'].min(),df['Age'].max(), df['Age'].mean(), df['Age'].median()

In [75]:
# What is average age

df.Age.mean(), df.Age.max(), df.Age.min()

(29.69911764705882, 80.0, 0.42)

In [77]:
df['Age'].agg(['min','max','mean','median','count'])

min         0.420000
max        80.000000
mean       29.699118
median     28.000000
count     714.000000
Name: Age, dtype: float64

In [78]:
df['PassengerId'].agg(['min','max','mean','median','count'])

min         1.0
max       891.0
mean      446.0
median    446.0
count     891.0
Name: PassengerId, dtype: float64

In [79]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [32]:
# What is Survival rate

In [84]:
df['Survived'].unique()

array([0, 1], dtype=int64)

In [34]:
df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [86]:
# display percentages instead of row counts

df['Survived'].value_counts(normalize = True)

#OUt of 891 people in the sample,we know that 342 has survived.That is a survival rate of 38%.

0    0.616162
1    0.383838
Name: Survived, dtype: float64

In [88]:
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [89]:
# To check unique values in all the columns

df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

In [90]:
# To get mean of all continuous columns

df.mean()

PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

In [91]:
df.mean(axis = 1)

0        4.892857
1       16.326186
2        5.846429
3       13.585714
4        7.292857
          ...    
886    132.714286
887    134.142857
888    153.075000
889    135.428571
890    133.392857
Length: 891, dtype: float64

In [94]:
# How to change the type of column

df['Pclass'] = df['Pclass'].astype('object')

In [95]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass          object
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [97]:
# How to access column names

df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [99]:
# How to rename column names

df.rename(columns = {'Embarked':'Embarked_renamed' }, inplace=True)

In [101]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [111]:
# How to drop columns

df.drop(columns = ['Cabin','Embarked_renamed']).head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925


In [114]:
# How to drop rows

df.drop(2, axis = 0).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


#### Accessing rows, columns and cells using loc and iloc

With **loc** and **iloc** you can do practically any data selection operation on DataFrames you can think of. **loc** is label-based, which means that you have to specify rows and columns based on their row and column labels. **iloc** is integer index based, so you have to specify rows and columns by their integer index

In [123]:
df.loc[0:5,:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [135]:
df.loc[0:10,'Pclass':'Parch'].head(3)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0


In [136]:
df.loc[3, 'Name']

'Futrelle, Mrs. Jacques Heath (Lily May Peel)'

In [137]:
# Accessing specific columns using loc

df.loc[:,'Survived']

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [138]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [139]:
# rows 0 through 10 (inclusive), columns 'Survived' through 'Parch' (inclusive)

df.loc[0:10, 'Survived':'Parch']

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0
5,0,3,"Moran, Mr. James",male,,0,0
6,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0
7,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1
8,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2
9,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0


In [140]:
df.shape

(891, 12)

In [141]:
# Accessing specific rows and columns using iloc

df.iloc[[0,1,3],0:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex
0,1,0,3,"Braund, Mr. Owen Harris",male
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female


#### Filtering for rows and columns

In [147]:
df[['Sex','PassengerId','Pclass']].head(3)

Unnamed: 0,Sex,PassengerId,Pclass
0,male,1,3
1,female,2,1
2,female,3,3


In [53]:
# How to select subset of columns

df[['PassengerId','Survived']].head()

Unnamed: 0,PassengerId,Survived
0,1,0
1,2,1
2,3,1
3,4,1
4,5,0


In [148]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [54]:
# How to subset for rows and columns

In [154]:
df.loc[df['Age'] > 30][['Survived','Pclass']].head()

Unnamed: 0,Survived,Pclass
1,1,1
3,1,1
4,0,3
6,0,1
11,1,1


In [157]:
# Subset the data for Sex is male

df.loc[df.Sex == 'male']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [160]:
# Subset the dataframe for sex is male and Age is greater than 30

df.loc[(df.Sex == 'male') & (df.Age>30)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S
20,21,0,2,"Fynney, Mr. Joseph J",male,35.0,0,0,239865,26.0000,,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0000,D56,S
...,...,...,...,...,...,...,...,...,...,...,...,...
867,868,0,1,"Roebling, Mr. Washington Augustus II",male,31.0,0,0,PC 17590,50.4958,A24,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
873,874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,,S
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S


In [161]:
# Subsetting the dataframe for rows and columns

df.loc[(df.Sex == 'male') & (df.Age>30),['Survived','Pclass']]

Unnamed: 0,Survived,Pclass
4,0,3
6,0,1
13,0,3
20,0,2
21,1,2
...,...,...
867,0,1
872,0,1
873,0,3
881,0,3


In [58]:
# Use of isin command

In [164]:
df['Embarked_renamed'].value_counts()

S    644
C    168
Q     77
Name: Embarked_renamed, dtype: int64

In [165]:
list_out = ['S','C']

In [168]:
df[df.Embarked_renamed.isin(list_out)].head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [61]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [62]:
#Exercise Question - Select rows where Fare is less than equal to 20 and Survived is 1

In [171]:
df.loc[(df.Fare <= 20) & (df.Survived == 1)].head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S


#### Sorting data in dataframe

In [173]:
df.sort_values(by = 'Age', ascending = False)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0000,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.7500,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


In [63]:
# Sorting dataframe on one variable

df.sort_values(by  = 'Age', ascending = False).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q


In [64]:
# Select top 10 records with highest Fare

In [174]:
df.sort_values(by = 'Fare', ascending = False).head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0,C23 C25 C27,S
438,439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0,C23 C25 C27,S
311,312,1,1,"Ryerson, Miss. Emily Borie",female,18.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C
742,743,1,1,"Ryerson, Miss. Susan Parker ""Suzette""",female,21.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C
118,119,0,1,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC 17558,247.5208,B58 B60,C


In [177]:
pd.options.display.max_rows = 1000
pd.options.display.max_columns = 100

In [179]:
# Sorting dataframe on mulitple variables

df.sort_values(by = ['Pclass','Fare'], ascending = [True, False]).head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_renamed
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C


#### Adding, Subtracting, Concatenating columns

In [180]:
df['Age']+df['Fare']

0       29.2500
1      109.2833
2       33.9250
3       88.1000
4       43.0500
5           NaN
6      105.8625
7       23.0750
8       38.1333
9       44.0708
10      20.7000
11      84.5500
12      28.0500
13      70.2750
14      21.8542
15      71.0000
16      31.1250
17          NaN
18      49.0000
19          NaN
20      61.0000
21      47.0000
22      23.0292
23      63.5000
24      29.0750
25      69.3875
26          NaN
27     282.0000
28          NaN
29          NaN
30      67.7208
31          NaN
32          NaN
33      76.5000
34     110.1708
35      94.0000
36          NaN
37      29.0500
38      36.0000
39      25.2417
40      49.4750
41      48.0000
42          NaN
43      44.5792
44      26.8792
45          NaN
46          NaN
47          NaN
48          NaN
49      35.8000
50      46.6875
51      28.8000
52     125.7292
53      55.0000
54     126.9792
55          NaN
56      31.5000
57      35.7292
58      32.7500
59      57.9000
60      29.2292
61     118.0000
62     1

In [182]:
# Concatenating String Columns

df['Name'] + ', ' + df['Sex']

0                          Braund, Mr. Owen Harris, male
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                         Heikkinen, Miss. Laina, female
3      Futrelle, Mrs. Jacques Heath (Lily May Peel), ...
4                         Allen, Mr. William Henry, male
5                                 Moran, Mr. James, male
6                          McCarthy, Mr. Timothy J, male
7                   Palsson, Master. Gosta Leonard, male
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Be...
9            Nasser, Mrs. Nicholas (Adele Achem), female
10               Sandstrom, Miss. Marguerite Rut, female
11                      Bonnell, Miss. Elizabeth, female
12                  Saundercock, Mr. William Henry, male
13                     Andersson, Mr. Anders Johan, male
14          Vestrom, Miss. Hulda Amanda Adolfina, female
15              Hewlett, Mrs. (Mary D Kingcome) , female
16                            Rice, Master. Eugene, male
17                    Williams,