### Data Exploration and Preparation

#### The Titanic Case - Prediction on Survival

- The sinking of the Titanic is one of the most infamous shipwrecks in history

- Reasons for passengers who survived
    - Women, Children
    - Upper-class / Social status
    - ...

- In this very first project, we are trying to analyze **what types of people were likely to survive**

- You can refer to the following link to know more about the project https://www.kaggle.com/c/titanic 

#### Data Exploration

- Before we start our analysis, it is always a good practice to look into our data first

In [1]:
# import library used for data management
import numpy as np 
import pandas as pd 

In [2]:
# load datasets
train = pd.read_csv('train.csv',index_col='PassengerId')


In [3]:
# to view the whole training set
train

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [4]:
# to view the first 5 lines in the training set
train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
# to view training set information
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [6]:
# to descirbe the numerical variables in training set
train.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
# to describe categorical variables in traning set
train['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

In [8]:
# Change data type from numeric to categorical
train['Survived'] = train['Survived'].astype(str)

# change data type for 'Pclass'
train['Pclass'] = train['Pclass'].astype(str)

In [9]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    object 
 1   Pclass    891 non-null    object 
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(2), object(7)
memory usage: 83.5+ KB


# Replace missing values

In [11]:
# Change the following code to show missing values for 'Age' 
train[train['Age'].isnull()]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...
860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


In [None]:
# replace missing value in "Age" with mean
train['Age'].fillna(train['Age'].mean(),inplace = True)


In [None]:
# replace missing values in "Embarked" with most frequent appeared value
train['Embarked'].fillna(train['Embarked'].value_counts().index[0],inplace= True)


In [None]:
# value counts for "Embarked" to find the mode, you can use .describe() function as well
train['Embarked'].value_counts()

In [None]:
# drop "Cabin" column 
train = train.drop(columns='Cabin')


In [None]:
# Check the training set info again



# Data Visualization
- visual representation of data
- to communicate information clearly and efficiently
- effective visualization helps users analyze and reason about data and evidence

In [None]:
# import libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure(figsize=(18,6))

In [None]:
# boxplot for Age 
ax = sns.boxplot(train['Age'])


In [None]:
# boxplot for Age Regarding Survived
bx = sns.boxplot(x = train['Survived'], y = train['Age'])


In [None]:
# histogram for Age
train['Age'].hist()


In [None]:
# histogram for all numeric variables
train.hist(bins=10,figsize=(18,10),grid=False)


In [None]:
# Bar chart for Survived
train['Survived'].value_counts().plot(kind='bar')



In [None]:
# Bar chart as %
train['Survived'].value_counts(normalize=True).plot(kind='bar')


In [None]:
# show Survived % 
train['Survived'].value_counts('1')


In [None]:
# histogram for categorical variables
g = sns.FacetGrid(train, col='Sex', row='Survived', margin_titles=True)
g.map(plt.hist,'Age',color='blue')



In [None]:
g = sns.FacetGrid(train, hue='Survived', col='Pclass', margin_titles=True)
g=g.map(plt.scatter, 'Fare', 'Age',edgecolor='w').add_legend()


# Feature Engineering

- The process of using domain knowledge of the data to create features that make machine learning algorithms work
- Considered essential in applied machine learning / data analytics
- Difficult and expensive

In [None]:
# Bining / Descritization

# give names for different age group
group_names = ['Young', 'Middle aged', 'Senior']



In [None]:
# divide Age into 3 equal interval groups and give corresponding names

train['Age-binned']=pd.cut(train['Age'], 3 , labels=group_names)


In [None]:
# View Age-binned in bar chart
train['Age-binned'].value_counts().plot(kind='bar')



In [None]:
# view the changes for "Age-binned"


In [None]:
# normalize Fare
# import library 

from sklearn import preprocessing


In [None]:
# Apply min-max normalization on a single attribute

minmax_scaler = preprocessing.MinMaxScaler().fit(train[['Fare']])
train['Fare_minmax']=minmax_scaler.transform(train[['Fare']])


In [None]:
# Show fare-minmax



In [None]:
#Apply Zscore normalization on a single attribute

zscore_scaler = preprocessing.StandardScaler().fit(train[['Fare']])
train['Fare_zscore']=zscore_scaler.transform(train[['Fare']])


In [None]:
# get dummy variables for categorical varialbes

sexdummy =pd.get_dummies(train['Sex'])



In [None]:
sexdummy

In [None]:
# get dummy variables for "Embarked"
embarkeddummy = pd.get_dummies(train['Embarked'],prefix ='Embarked')

In [None]:
embarkeddummy

In [None]:
#Adding the dummy variables to the data frame
trainwithdummy = pd.concat([train,sexdummy,embarkeddummy],axis=1,sort=True)


In [None]:
#drop columns from data frame
traindrop=trainwithdummy.drop(columns=['Name','Sex','Ticket','Embarked','Age'])

In [None]:
# check the updated data frame
traindrop.head()