# AIP2 Lab Session #5:
# Titanic: Machine Learning from Disaster!
### **2020.10.28**<br/>
---

<br/>

***Library***

- Numpy
- Pandas
- Sklearn

![Sto%CC%88wer_Titanic.jpg](attachment:Sto%CC%88wer_Titanic.jpg)

In [None]:
# This is the library to use for this project.
# It must be executed before the next cell execution.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

## 1. Data analysis
$\quad$Before applying the Preprocessing and Machine Learning algorithm, it is important to first understand how the given data is organized. The data presented by this Kaggle Project contains the following information.

<br>
$\quad$Let's take a look at the description and code below to see how the above information is stored in the data and how the data is distributed.
<br>
<br>

In [None]:
#  Read Data
train_data = pd.read_csv('train.csv');
test_data = pd.read_csv('test.csv');

#print(train_data[["Age"]].hist(bins=20))
#print(train_data[["Fare"]].hist(bins=50))

$\quad$First, it reads the data. The data includes train.csv for training and test.csv for testing. Use the read_csv function of the pandas library to read both data. The imported data is specified as train_data and test_data, respectively.
<br>

$\quad$It is necessary to check how the above information is stored in the read data. Let's run the code below to check out the parts of train.csv and test.csv.

In [None]:
# read first 5 of train_data
train_data.head(5)

In [None]:
# read last 5 of train_data
train_data.tail(5)

$\quad$train_data is used for each row to display "PassangerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Embarked" information. Here, 11 pieces of information except for "Survived" correspond to the feature, and "Survived" corresponds to the label.
<br>
<br>

In [None]:
# read first 5 of test_data
test_data.head()

In [None]:
# read last 5 of test_data
test_data.tail()

$\quad$If you look at test_data, you will see the following information from each row: "PassangerId", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked" are included. Unlike train_data, there is no "Survived" information corresponding to the label because test_data is data used to verify the model.
<br>
<br>
$\quad$As far as we can see, "PassangerId" is simply a feature that is attached for ordering and it is not necessary data to judge whether passennger is actually alive. And because "Pclass", "Age", "SipSp", "Parch", "Fare" are numeric information, machine learning algorithm can be applied even without preprocessing. However, other information such as "Name", "Sex", "Ticket", "Cabin", "Embarked" need appropriate preprocessing such as removing information and extracting new information before vectorization.
<br>
<br>
<br>


$\quad$Now, look at the distribution of given data, the relationship between survivability ("Survived"), and how to preprocess the data.
<br>
<br>
$\quad$First, let's look at the distribution of the numeric information in the data. Execute the following code to check the distribution.

In [None]:
# Analyze numeric information of train_data
train_data.describe(percentiles=[0.25, 0.75])

$\quad$Note that, count(number of data), mean, std(standard deviation), min(minimum) the upper 25%, 50% 75% and max(maximum) are printed.
<br>
<br>
$\quad$For "count", the number of "PassengerId" is 891, so train_data contains a total of 891 people. For "Age", its count is 714, thus "Age" is not known for 177 passengers.
<br>
<br>
$\quad$Next is mean. You can see that average value of survival rate is 38.4%, ticket class is 2.3, number of sisters and brothers are 0.52 number of parents are 0.38 and the ticket cost is 32.2.
<br>
<br>
$\quad$The next values ​​are min, max, 25%, 50%, and 75%. These values allow us to understand the overall distribution and to make various interpretations. For example, "SibSp" and "Parch" are 1 or 0 in the top 75%, so most passengers boarded alone without a family. And up to 75% of the "Fare" was 31, but the max is 512, so some customers pay a lot more than other customers. If you want to know the value of the upper 60%, 80%, etc., you can change the value inside the percentile of the above code to 0.75, 0.8.
<br>
<br>
$\quad$The distribution of other information, except for the numeric information, can be seen by running the following code:

In [None]:
# Analyze non-numeric information of train_data
train_data.describe(include=['O'])

$\quad$The code above shows the distribution of the non-numeric information "Name", "Sex", "Ticket", "Cabin", "Embarked". The output is "count" (number), "unique" (number of different information), "top" (top information), and "freq" (most frequent).
<br>
<br>
$\quad$For "count", train_data has a total of 891, so most information has 891 values. However, we can confirm that some portions of the information of "Cabin" and "Embarked" are unknown.
<br>
<br>
$\quad$For "Unique", "Name" has the value of 891, so everyone in train_data has a different name. There are two sexes, male and female, so it has a value of 2. Ticket and cabin are 681 and 147, respectively, and there are passengers with the same ticket or cabin number. Finally, since Titanic has three departing ports, so embarked is three.
<br>
<br>
$\quad$"top" is the most common information, and "freq" is the number of that information. Because "Name" is different for each person, freq is 1 and the value of top is not significant. "Sex" is 577 male, so there were 577 men and 314 women passengers. And if you look at "Embarked" you can see that most of the passengers (644) boarded at port "S" (Southampton).
<br>
<br>
<br>
<br>
<br>
<br>
$\quad$Next, we examine how each of feature is related with "Survived".

In [None]:
# Statistics between "Pclass" and "Survived"
train_data[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

First, the relationship between "Pclass" and "Survived". The results show that the higher the ticket rating, the higher the survival rate.

In [None]:
# Statistics between "Sex" and "Survived"
train_data[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

The survival rate of women is much higher than that of men.

In [None]:
# Statistics between "SibSp" and "Survived"
train_data[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=True)

"SibSp" is the number of brothers and sisters. In general, the lower the number of brothers and sisters shows the higher the survival rate.

In [None]:
# Statistics between "Parch" and "Survived"
train_data[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=True)

"Parch" is the number of parents and children. The lower the number, as in "SibSp" above, the higher the survival rate.

In [None]:
# Statistics between "Embarked" and "Survived"
train_data[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

"Embarked" is the boarding location. The survival rate is much higher than the other areas when the boarding location is "C", and is the lowest for "S" where the most people boarded.

$\quad$
"Pclass", "Sex" and "Embarked" are good features that can be directly related to "Survived". Also, you can use "Sex" and "Embarked" for training after vectorization of "Sex" and restoration of some unknown information of "Embarked".
<br>
<br>
$\quad$"SibSp" and "Parch", as we have seen above, tend to have a high survival rate when the number is small, but it is difficult to find a direct association. Therefore,we may apply the following preprocessing: '0' if the number is small (less or equal to 4) and '1' if it is big (more than 4). Since both of them indicate the number of family members, it is desirable to use "FamilySize" which is the sum of the two values ​​rather than using both of them.

## 2. Preprocessing

$\quad$First, we create a new feature called "FamilySize" by adding "Sibsp" and "Parch".

In [None]:
# Read Data
(Code in tutorial video)

# Add "SibSp" and "Parch" to make "FamilySize"
(Code in tutorial video)

# Delete "SibSp" abd "Parch"
(Code in tutorial video)

# Print top 5 values
(Code in tutorial video)

As a result, "SibSp" and "Parch" are combined into one "FamilySize". Now let's look at the following code to see how this feature relates to "Survived".

In [None]:
# Statistics between "FamilySize" and "Survived"
(Code in tutorial video)

To simplify further, we change each value of "FamilySize" to "1" if it is bigger than 4 and "0" if it is smaller than or equal to 4.

In [None]:
# Divide "Family size" into 0 and 1 based on 4
(Code in tutorial video)

# Print top 5 values
(Code in tutorial video)

In [None]:
# Statistics between "FamilySize" and "Survived"
(Code in tutorial video)


$\quad$"Name", "Ticket", and "Cabin" cannot be directly related to "Survived". Therefore, we may remove these three features.
<br>
<br>
$\quad$However, if you look at "Name", you find information such as 'Mr', 'Mrs', 'Capt', 'Master' and so on. We extract this information from "Name" and create a new feature "Title". After the extraction, "Name", "Ticket" and "Cabin" features are removed.

In [None]:
# Extract "Name" and create "Title"
(Code in tutorial video)

# Delete "Name", "Ticket", "Cabin" features
(Code in tutorial video)

# Print statistics of "Title"
(Code in tutorial video)

There are very few passengers with "Title" other than "Master", "Mr", "Mrs", and "Ms". So we replace these rare titles with "Rare" and titles for females such as "MS", "Mlle", and "Mmn" are combined with "Miss". After substitution, we vectorize 'Mr' to '1', 'Miss' to '2', 'Mrs' to '3', 'Master' to '4', 'Rare' to '5' and set '0' if "Title" is unknown. This can be done in the code below.

In [None]:
# The rest except the main "Title" is classified as "Rare"
(Code in tutorial video)

# Unify 'Ms', 'Mlle', 'Mme' as 'Miss'
(Code in tutorial video)

# "Title" vectorization
(Code in tutorial video)

# Set '0' for unknown "Title"
(Code in tutorial video)

# Print top 5 values
(Code in tutorial video)


$\quad$In case of "Age", there are 177 missing data and we restore them by the average age (30) of the rest of the passengers.


In [None]:
# Restore empty part to average age
(Code in tutorial video)

# DIY: Preprocessing
$\quad$(pause the video and complete this part!)
<br>
<br>
$\quad$"Sex" and "Embarked" are information that directly affects "Survived". Therefore, these two pieces of information do not require any preprocessing; we simply restore the missing parts and run vectorization.
<br>
<br>
$\quad$First, we restore the two lost data of "Embarked". In this preproessing, we would like to designate two "Embakred" information as 'S' port where the most passengers have boarded. Then, we set 'female' of 'Sex' to '0', and '1' to 'male', 'Q', 'C', 'S' of 'Embarked' are set to '0' 1 'and' 2 ', respectively. You can do these in the code below.
<br>
<br>
$\quad$**Fill in the '???' part!**

In [None]:
# 1. Fill unknown "Embarked" to 'S'
# >> Which data do we want to edit?
# >> The unknown data is labeled as None, and we want to replace them with 'S'.
# (Hint: The code is almost same as the cell right above!!!)
preprocessing_train_data[???].replace(???)
preprocessing_test_data[???].replace(???)

# 2. "Embarked", "Sex" vectorization
# >> To apply Logistic regression or SVM,
#    we want to convert features that are in string format into numeric values.
# >> Let's replace 'female' --> 0, 'male' --> 1
preprocessing_train_data.replace(['female', 'male'], [0, 1], inplace=True)

# >> Now, replace 'Q' --> 0, 'C' --> 1, 'S' --> 2 in the same way!
preprocessing_train_data.replace(???)

# >> We also need to vectorize the test set.
# >> 'female' --> 0, 'male' --> 1
preprocessing_test_data.replace(???)
# >> 'Q' --> 0, 'C' --> 1, 'S' --> 2
preprocessing_test_data.replace(???)

# Print top 5 values
preprocessing_train_data[['Sex', 'Embarked']].head()

In [None]:
# Print result
preprocessing_train_data.head()

The results contain 7 features which are "Title", "Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize". Note that "Age" and "Fare" need proper normalization. Perform "standardization" for these two features.
<br>
<br>
<br>
<br>
<br>
<br>
<br>

## 3. Machine Learning
$\quad$Now we apply machine learning algorithms to preprocessed data. In this project, we will use 'Logistic Regression', and 'SVM'.
<br>
<br>
$\quad$To use each algorithm, we create training data by removing "Survived" and meaningless "PassangerId" from preprocessed data, and extract only "Survived" to create label data, (remove "PassagerId" from the test data), apply each algorithm, predict the test data with the resulting model. The code is shown below.

In [None]:
# Creating data for training
(Code in tutorial video)

In [None]:
classifier = (Code in tutorial video)

# Fit the classifier to the training data
(Code in tutorial video)

# Check the accuracy, AUC, and ROC curve of the classifier set above
(Code in tutorial video)

In [None]:
# Test data prediction
(Code in tutorial video)

# This is how the model predicted
(Code in tutorial video)

# DIY: SVM classifier
$\quad$Now, you can do the same thing with SVM. Apply SVM to the training data and check the result.
<br>
<br>
$\quad$**Fill in the '???' part!**

In [None]:
# 1. Create a SVM classifier from scikit-learn:
#    remember that we imported SVC from sklearn.svm!
# (Note: you need to set 'probability' argument to True)
classifier = ???

# >> Train the classifier with the training set we preprocessed.
# >> Use fit() method.
classifier.???(???)

# >> Check the model accuracy on training set
# >> You can get the mean accuracy of a given dataset and labels
#    with score() method
accuracy = classifier.???(???) * 100

# 2. Draw ROC curve on training set
# >> First, you need to compute probabilities of each labels for training set.
#    (which means you need to make *model predictions* for training set)
# >> You can do this with predict_proba() method
Y_train_pred = classifier.???(???)[:, 1]

# >> Calculate false positive rates, true positive rates,
#    and area under the curve (AUC) with
#    ground truth labels & predicted probability
FPR, TPR, thresholds = roc_curve(???)
AUC = roc_auc_score(???)

# Plot ROC curve.
# (hint: use plt.plot() to plot FPR and TPR)
???(???)

print("Accuracy: ", "{0:.2f}".format(accuracy))
print("Area Under the Curve: ", "{0:.2f}".format(AUC))

In [None]:
# Test data prediction
predict = classifier.predict(???)
predict = np.round(predict)

# This is how the model predicted
result = test_data.copy()
result["PREDICTION"] = predict
result.head()