# Titanic Survival Predictions - Machine Learning Exploration

Hi, I'm Giodio Mitaart a Computer Science Student at BINUS University. In this notebook, I want to try to explore the titanic dataset to re-learn Machine Learning courses that I have studied. If you have any feedback, please write it here! Thank you: D

### The main parts:
1. Import important libraries
2. Read and explore the dataset
3. Data analysis
4. Visualization
5. Cleaning dataset
6. Build the machine learning model

## 1. Import important libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sb
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## 2. Read and explore the dataset

In [None]:
train_df = pd.read_csv('../input/titanic/train.csv')
test_df = pd.read_csv('../input/titanic/test.csv')

#quick look at the training data
train_df.describe() #add parameter include='all' to see more

## 3. Data analysis

In [None]:
#get the features list in the dataset
print(train_df.columns)

#see the sample in order the get an idea of the features
train_df.sample(5) #or we can use train_df.head()

In [None]:
#get insights using dtypes
train_df.dtypes

Some info that we gained:
* Age, Fare, SibSp, Parch (Numerical Features)
* Survived, Sex, Embarked, Pclass (Categorical Features)
* Ticket, Cabin (Alphanumeric Features)

The data types of every features:
* Age: float
* SibSp: int
* Parch: int
* Survived: int
* Sex: string
* Embarked: string
* Pclass: int
* Ticket: string
* Cabin: string

In [None]:
#quick look to the training dataset
train_df.describe(include = "all")

### Insights:
* There are 891 passengers
* If we see carefully, there is a gap in Age features (891-714)/891 = 19.8% of the values is missing. Maybe we need to handle this because in my opinion age has a significant factors to determine passengers' survival possibility
* The Cabin features also missing about 77% of its values. Because there is a significant gap, we will skip (drop) this column

In [None]:
#check the missing values
train_df.isnull().sum()

### Let's make some hypothesis:
1. Age: Young people has the higher possibility to survive than the old ones
2. Sex: Female chance to survive is higher than the male
3. Pclass: If people from high class are more likely to survive
4. Parch: People who travel alone has higher chance to survive than people who travel with family

## 4. Visualization

### Age feature visualization

In [None]:
#devide the ages into logical labels
train_df['Age'] = train_df['Age'].fillna(-0.5)
test_df['Age'] = test_df['Age'].fillna(-0.5)

bins = [-1, 0, 5, 12, 17, 25, 40, 60, np.inf]
labels = ['Unknown', 'Baby', 'Child', 'Teen', 'Young', 'Young Adult', 'Adult', 'Old']

train_df['Group'] = pd.cut(train_df['Age'], bins, labels = labels)
test_df['Group'] = pd.cut(test_df['Age'], bins, labels = labels)

sb.barplot(x='Group', y='Survived', data = train_df)
plt.show()

### Sex feature visualization

In [None]:
#show the bar plot of surrvival chance by sex
sb.barplot(x='Sex', y='Survived', data = train_df)

#show the percentage of female and male that survived
print('Female survived in percentage: ', train_df['Survived'][train_df['Sex'] == 'female'].value_counts(normalize=True)[1]*100)
print('Male survived in percentage: ', train_df['Survived'][train_df['Sex'] == 'male'].value_counts(normalize=True)[1]*100)

We can see that as the hypothesis above, female has a higher chance of survival than male.

### Pclass feature visualization

In [None]:
#show the bar plot of surrvival chance by sex
sb.barplot(x='Pclass', y='Survived', data = train_df)

#show the percentage of people survived by Pclass
print('Pclass 1 survived percentage: ', train_df['Survived'][train_df['Pclass'] == 1].value_counts(normalize = True)[1]*100)
print('Pclass 2 survived percentage: ', train_df['Survived'][train_df['Pclass'] == 2].value_counts(normalize = True)[1]*100)
print('Pclass 2 survived percentage: ', train_df['Survived'][train_df['Pclass'] == 3].value_counts(normalize = True)[1]*100)

We can see that as the hypothesis above, people with who come from high class has a higher chance to survive than the lower.

### Parch feature visualization

In [None]:
#show the bar plot for the parent with child survival
sb.barplot(x='Parch', y='Survived', data = train_df)
plt.show()

## 5. Data Cleansing

Clean our data for missing values and unwanted information.

### Quick look to the test data, and get some insights

In [None]:
test_df.describe(include="all")

We will drop Cabin and Ticket columns because there are not much useful information that can be gained from the features. Also, we will try to see embarked feature.

### Drop Cabin

In [None]:
train_df = train_df.drop(['Cabin'], axis = 1)
test_df = test_df.drop(['Cabin'], axis = 1)

### Drop Ticket

In [None]:
train_df = train_df.drop(['Ticket'], axis = 1)
test_df = test_df.drop(['Ticket'], axis = 1)

### Embarked feature

In [None]:
#find insight from this feature
print('Total of people embarking in Southampton: ')
s = train_df[train_df['Embarked'] == 'S'].shape[0]
print(s)

print('Total of people embarking in Cherbourg: ')
c = train_df[train_df['Embarked'] == 'C'].shape[0]
print(c)

print('Total of people embarking in Queenstown: ')
q = train_df[train_df['Embarked'] == 'Q'].shape[0]
print(q)

From the information above, it's clear that majority of people embarked in Southampton. So, we can make an assumption to fill the missing values with S (Southampthon).

In [None]:
#fill the missing values in Embarked feature with S
train_df = train_df.fillna({'Embarked': 'S'})

### Age feature

In [None]:
#combine the train and test dataset
combine_df = [train_df, test_df]

#extract title for every name in the combined dataset
#how to extract data from string variable
#https://www.kaggle.com/questions-and-answers/141854
for dataset in combine_df:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

#cross tabulation is a method to quantitatively analyze the relationship between multiple variables.
pd.crosstab(train_df['Title'], train_df['Sex'])

### Name feature

After extract the data, now we will drop this feature since it will no longer used

In [None]:
train_df = train_df.drop(['Name'], axis = 1)
test_df = test_df.drop(['Name'], axis = 1)

### Sex feature to numerical values

In [None]:
#mapping sex type to numerical value
sex_mapping = {"male":0, "female":1}
train_df['Sex'] = train_df['Sex'].map(sex_mapping)
test_df['Sex'] = test_df['Sex'].map(sex_mapping)

In [None]:
#peek the data after mapping to numerical value
train_df.head(5)

### Embarked feature to numerical values

In [None]:
#mapping embarked type to numerical value
embarked_mapping = {'S': 0, 'C': 1, 'Q': 2}
train_df['Embarked'] = train_df['Embarked'].map(embarked_mapping)
test_df['Embarked'] = test_df['Embarked'].map(embarked_mapping)

In [None]:
#drop group and title (temporary)
train_df = train_df.drop(['Group'], axis = 1)
test_df = test_df.drop(['Group'], axis = 1)

train_df = train_df.drop(['Title'], axis = 1)
test_df = test_df.drop(['Title'], axis = 1)

In [None]:
#peek the data after mapping
train_df.head(5)

## 6. Testing Model

Let's split the training data to 0.2 to test the accuracy

In [None]:
#import sklearn's train_test_split
from sklearn.model_selection import train_test_split

predict = train_df.drop(['Survived', 'PassengerId'], axis=1)
target = train_df['Survived']
x_train, x_test, y_train, y_test = train_test_split(predict, target, test_size = 0.22, random_state = 0)

### Logisic Regression

In [None]:
#logistic regression and accuracy score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr = LogisticRegression()
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)
acc_lr = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_lr)

### SVM

In [None]:
#SVM
from sklearn.svm import SVC

svc = SVC()
svc.fit(x_train, y_train)
y_pred = svc.predict(x_test)
acc_svc = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_svc)

### Random Forest

In [None]:
#random forest
from sklearn.ensemble import RandomForestClassifier

randomforest = RandomForestClassifier()
randomforest.fit(x_train, y_train)
y_pred = randomforest.predict(x_test)
acc_randomforest = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_randomforest)

![](http://)So, what's the best model?

In [None]:
ml_models = pd.DataFrame({
    'ML_Model': ['Logistic Regression', 'Support Vector Machines',
              'Random Forest'],
    'Score_Accuracy': [acc_lr, acc_svc, 
              acc_randomforest]})


ml_models.sort_values(by='Score_Accuracy', ascending=False)
