<a href="https://colab.research.google.com/github/rayveng1/MLMondays/blob/main/Copy_of_Titanic_Dataset_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing The Dataset

Before we do anything, we need to import our modules.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Now, we can download our dataset, which is some of the Titanic Dataset from https://www.kaggle.com/competitions/titanic/data.

In [None]:
titanic_data = pd.read_csv("https://raw.githubusercontent.com/aisutd/ML-Mondays/main/Week%202/titanic/train.csv")

In [None]:
titanic_data.head()

Some of our data wouldn't be useful for prediction. For instance, PassengerId, Name, and Ticket don't provide meaningful information for our machine learning models. For simplicity, we won't use Cabin either.

In [None]:
titanic_data = titanic_data.drop(['Name', 'PassengerId', 'Ticket', 'Cabin'], axis=1)

We also need to make sure our data is clean (we have valid and accurate data for every row). There are many ways of doing this, but the simplest is just removing all rows that have NaNs, which we wouldn't be able to work with.

In [None]:
titanic_data = titanic_data.dropna()

The machine learning models that we will use take in numerical values, so we need to convert categorical features. This applies for Sex and Embarked. 

In [None]:
titanic_data['Sex'] = titanic_data['Sex'].map({'female': 1, 'male': 0}).astype(int)
titanic_data['Embarked'] = titanic_data['Embarked'].map({'Q': 2, 'C': 1, 'S': 0}).astype(int)

We can also create new features that extract the most important parts of our data. Let's create a feature called IsAlone, which combines Parch (number of parents) and SibSp (number of siblings) by being 1 only if this individual had no parents or siblings with them.

In [None]:
titanic_data['FamilySize'] = titanic_data['SibSp'] + titanic_data['Parch'] + 1
titanic_data['IsAlone'] = 0
titanic_data.loc[titanic_data['FamilySize'] == 1, 'IsAlone'] = 1
titanic_data = titanic_data.drop(["SibSp", "Parch", "FamilySize"], axis=1)

In [None]:
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Using ML

Let's start actually analyzing this data with machine learning! First, we need to split our inputs (x) and output (y).

In [None]:
X = titanic_data.drop("Survived", axis=1)
y = titanic_data["Survived"]

If we train using all of our data, our results wouldn't be meaningful. Instead, we need to split our full dataset into a training set and test set. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

From here, we can try a bunch of different algorithms to predict survival on our test set.

## Logistic Regression

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
acc_log = round(logreg.score(X_test, y_test) * 100, 2)
print("Test Accuracy = ", acc_log)

Test Accuracy =  78.32


For logistic regression, we produce some correlation coefficients. These can be used to interpret our model. Positive coefficients mean that the feature increases odds of survival while negative coefficients decrease survival odds. 

In [None]:
coeff_df = pd.DataFrame(titanic_data.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

Unnamed: 0,Feature,Correlation
1,Sex,2.400595
5,IsAlone,0.137113
3,Fare,0.001221
2,Age,-0.033142
4,Embarked,-0.078758
0,Pclass,-1.326983


From our coefficients, we can see that being female was strongly correlated with survival ("women and children first"). Conversely, it appears that higher ticket classes (lower quality tickets) was correlated with lower survival rates.

## Linear SVC

In [None]:
linear_svc = LinearSVC()
linear_svc.fit(X_train, y_train)
acc_linear_svc = round(linear_svc.score(X_test, y_test) * 100, 2)
print("Test Accuracy = ", acc_linear_svc)

Test Accuracy =  71.33




## SVC

In [None]:
svc = SVC()
svc.fit(X_train, y_train)
acc_svc = round(svc.score(X_test, y_test) * 100, 2)
print("Test Accuracy = ", acc_svc)

Test Accuracy =  66.43


## KNeighbors

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
acc_knn = round(knn.score(X_test, y_test) * 100, 2)
print("Test Accuracy = ", acc_knn)

Test Accuracy =  68.53


## Perceptron

In [None]:
perceptron = Perceptron()
perceptron.fit(X_train, y_train)
acc_perceptron = round(perceptron.score(X_test, y_test) * 100, 2)
print("Test Accuracy = ", acc_perceptron)

Test Accuracy =  68.53


## Decision Tree

In [None]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
acc_decision_tree = round(decision_tree.score(X_test, y_test) * 100, 2)
print("Test Accuracy = ", acc_decision_tree)

Test Accuracy =  76.22


## Random Forest

In [None]:
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
acc_random_forest = round(random_forest.score(X_test, y_test) * 100, 2)
print("Test Accuracy = ", acc_random_forest)

Test Accuracy =  81.12


# Future Work

We could probably do better in terms of test accuracy. There are a number of ways we could go about getting that.
*   Feature Engineering -> make better features, remove misleading features, etc.
*   Hyperparametrization -> change the parameter values for our algorithms (for instance, change num_classifiers for RandomForestClassifier)

If you want to try for yourself in a more competitive environment, check out the competition at Kaggle (https://www.kaggle.com/competitions/titanic).