# Titanic Case Study
Perform followings on dataset:
- Analyse the data and find out the columns which will not be part of further data analysis and provide the reason why?
- Find out the data type of each column
- Find out the number of entries in each column
- Find out which columns is having more number of missing values
- Drop those columns which needs to be dropped
- Replace missing values in each column and provide the reason for choosing one function to replace missing values
- Find out total number of Male/Female passengers
- Find out total number of passengers in each passenger class
- Find out total number of Survived/Not-survived passengers
- Find out total number of passengers of various age groups (0-30, 31-60 and >60)


In [2]:
import numpy as np
import pandas as pd

In [8]:
df = pd.read_csv("titanic_dataset.csv")

In [9]:
df.head(2)

Unnamed: 0,pclass,survived,name,Gender,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"


# Analyse the data and dropping out the columns which will not be part of further data analysis

In [11]:
df.drop(["name","ticket","cabin","boat","body","home.dest"],axis=1,inplace=True) 
# inplace: used for modifying original df

In [12]:
df.head(2)

Unnamed: 0,pclass,survived,Gender,age,sibsp,parch,fare,embarked
0,1,1,female,29.0,0,0,211.3375,S
1,1,1,male,0.9167,1,2,151.55,S


# Find out the number of entries in each column

In [13]:
df.age.size # provides the count for total number of records (inclusive of nulls)

1309

In [14]:
df.age.count() # provides the count of number of records that has values (ignores nulls)

1046

# Statistics of dataset ( Count,Mean,Std,Min,Max,Quantile values)

In [15]:
df.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668
min,1.0,0.0,0.1667,0.0,0.0,0.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958
50%,3.0,0.0,28.0,0.0,0.0,14.4542
75%,3.0,1.0,39.0,1.0,0.0,31.275
max,3.0,1.0,80.0,8.0,9.0,512.3292


In [16]:
df.describe(include="all")

Unnamed: 0,pclass,survived,Gender,age,sibsp,parch,fare,embarked
count,1309.0,1309.0,1309,1046.0,1309.0,1309.0,1308.0,1307
unique,,,2,,,,,3
top,,,male,,,,,S
freq,,,843,,,,,914
mean,2.294882,0.381971,,29.881135,0.498854,0.385027,33.295479,
std,0.837836,0.486055,,14.4135,1.041658,0.86556,51.758668,
min,1.0,0.0,,0.1667,0.0,0.0,0.0,
25%,2.0,0.0,,21.0,0.0,0.0,7.8958,
50%,3.0,0.0,,28.0,0.0,0.0,14.4542,
75%,3.0,1.0,,39.0,1.0,0.0,31.275,


# List columns which has mising values

In [34]:
df.isnull().any()

pclass      False
survived    False
Gender      False
age          True
sibsp       False
parch       False
fare         True
embarked     True
dtype: bool

# List number of records that are null

In [18]:
df.age.isnull().value_counts()

False    1046
True      263
Name: age, dtype: int64

# Find out the data type of each column

In [19]:
df.dtypes

pclass        int64
survived      int64
Gender       object
age         float64
sibsp         int64
parch         int64
fare        float64
embarked     object
dtype: object

# Find out total number of Male/Female passengers

In [21]:
df.Gender.value_counts()

male      843
female    466
Name: Gender, dtype: int64

# Find out total number of passengers in each passenger class

In [35]:
df.pclass.value_counts()

3    709
1    323
2    277
Name: pclass, dtype: int64

# Find out total number of Survived/Not-survived passengers

In [41]:
df.survived.value_counts()

0    809
1    500
Name: survived, dtype: int64

# Find out total number of passengers of various age groups (0-30, 31-60 and >60)

In [31]:
group1 = df.age.where((df.age>=0) & (df.age<=30))
group2 = df.age.where((df.age>=31) & (df.age<=60))
group3 = df.age.where(df.age>60)
print("Group 1 : ",group1.count())
print("Group 2 : ",group2.count())
print("Group 3 : ",group3.count())

Group 1 :  609
Group 2 :  402
Group 3 :  33
