### Data Exploration and Preparation

#### The Titanic Case - Prediction on Survival

- The sinking of the Titanic is one of the most infamous shipwrecks in history

- Reasons for passengers who survived
    - Women, Children
    - Upper-class / Social status
    - ...

- In this very first project, we are trying to analyze **what types of people were likely to survive**

- You can refer to the following link to know more about the project https://www.kaggle.com/c/titanic 

#### Data Exploration

- Before we start our analysis, it is always a good practice to look into our data first

In [None]:
# import library used for data management
import numpy as np 
import pandas as pd 

In [None]:
# load datasets
train = pd.read_csv('train.csv',index_col='PassengerId')


In [None]:
# to view the whole training set
train

In [None]:
# to view the first 5 lines in the training set
train.head()

In [None]:
# to view training set information
train.info()

In [None]:
# to descirbe the numerical variables in training set
train.describe()

In [None]:
# to describe categorical variables in traning set
train['Embarked'].describe()

In [None]:
# Change data type from numeric to categorical
train['Survived'] = train['Survived'].astype(str)

# change data type for 'Pclass'
train['Pclass'] = train['Pclass'].astype(str)

In [None]:
train.info()

# Replace missing values

In [None]:
# Change the following code to show missing values for 'Age' 
train[train['Embarked'].isnull()]

In [None]:
# replace missing value in "Age" with mean
train['Age'].fillna(train['Age'].mean(),inplace = True)


In [None]:
# replace missing values in "Embarked" with most frequent appeared value
train['Embarked'].fillna(train['Embarked'].value_counts().index[0],inplace= True)


In [None]:
# value counts for "Embarked" to find the mode, you can use .describe() function as well
train['Embarked'].value_counts()

In [None]:
# drop "Cabin" column 
train = train.drop(columns='Cabin')


In [None]:
# Check the training set info again



# Data Visualization
- visual representation of data
- to communicate information clearly and efficiently
- effective visualization helps users analyze and reason about data and evidence

In [None]:
# import libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure(figsize=(18,6))

In [None]:
# boxplot for Age 
ax = sns.boxplot(train['Age'])


In [None]:
# boxplot for Age Regarding Survived
bx = sns.boxplot(x = train['Survived'], y = train['Age'])


In [None]:
# histogram for Age
train['Age'].hist()


In [None]:
# histogram for all numeric variables
train.hist(bins=10,figsize=(18,10),grid=False)


In [None]:
# Bar chart for Survived
train['Survived'].value_counts().plot(kind='bar')



In [None]:
# Bar chart as %
train['Survived'].value_counts(normalize=True).plot(kind='bar')


In [None]:
# show Survived % 
train['Survived'].value_counts('1')


In [None]:
# histogram for categorical variables
g = sns.FacetGrid(train, col='Sex', row='Survived', margin_titles=True)
g.map(plt.hist,'Age',color='blue')



In [None]:
g = sns.FacetGrid(train, hue='Survived', col='Pclass', margin_titles=True)
g=g.map(plt.scatter, 'Fare', 'Age',edgecolor='w').add_legend()


# Feature Engineering

- The process of using domain knowledge of the data to create features that make machine learning algorithms work
- Considered essential in applied machine learning / data analytics
- Difficult and expensive

In [None]:
# Bining / Descritization

# give names for different age group
group_names = ['Young', 'Middle aged', 'Senior']



In [None]:
# divide Age into 3 equal interval groups and give corresponding names

train['Age-binned']=pd.cut(train['Age'], 3 , labels=group_names)


In [None]:
# View Age-binned in bar chart
train['Age-binned'].value_counts().plot(kind='bar')



In [None]:
# view the changes for "Age-binned"


In [None]:
# normalize Fare
# import library 

from sklearn import preprocessing


In [None]:
# Apply min-max normalization on a single attribute

minmax_scaler = preprocessing.MinMaxScaler().fit(train[['Fare']])
train['Fare_minmax']=minmax_scaler.transform(train[['Fare']])


In [None]:
# Show fare-minmax



In [None]:
#Apply Zscore normalization on a single attribute

zscore_scaler = preprocessing.StandardScaler().fit(train[['Fare']])
train['Fare_zscore']=zscore_scaler.transform(train[['Fare']])


In [None]:
# get dummy variables for categorical varialbes

sexdummy =pd.get_dummies(train['Sex'])



In [None]:
sexdummy

In [None]:
# get dummy variables for "Embarked"
embarkeddummy = pd.get_dummies(train['Embarked'],prefix ='Embarked')

In [None]:
embarkeddummy

In [None]:
#Adding the dummy variables to the data frame
trainwithdummy = pd.concat([train,sexdummy,embarkeddummy],axis=1,sort=True)


In [None]:
#drop columns from data frame
traindrop=trainwithdummy.drop(columns=['Name','Sex','Ticket','Embarked','Age'])

In [None]:
# check the updated data frame
traindrop.head()