# Using sklearn to Predict Titanic Survivors
Hi, I'm a new Kaggle user as well as a current undergraduate interested in the Data Science and Machine Learning field. In this Kernel, I will try to step by step build a ML model using sklearn to predict the outcomes of each passenger aboard the titanic. Please upvote and share if this helps!! Always looking for suggestions and recommendations. Thank You!

# Contents
1. Importing Libraries and Packages
2. Loading and Viewing Data Set
3. Dealing with NaN Values
4. Plotting and Visualizing Data
5. Feature Engineering
6. Modeling and Predicting with sklearn
7. Evaluating Model Performances
8. Submission

# 1. Importing Libraries and Packages
We will use these packages to help us manipulate the data and visualize the features/labels as well as measure how well our model performed.

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sb
from matplotlib import pyplot as plt
%matplotlib inline #display and return graphs

sb.set(style="whitegrid")

import warnings
warnings.filterwarnings('ignore')

# 2. Loading and Viewing Data Set
With Pandas, we can load both the training and testing set that we wil later use to train and test our model. Before we begin, we should take a look at our data table to see the values that we'll be working with.

In [None]:
training = pd.read_csv('../input/train.csv')
testing = pd.read_csv('../input/test.csv')

In [None]:
training.head()

In [None]:
training.describe()

In [None]:
print(training.keys())
print(testing.keys())

# 3. Dealing with NaN Values
There are NaN values in our data set in the age column. Furthermore, the Cabin and Name column are useless because they aren't good features that we can use to predict survival. We can just drop them as well as the NaN values which will get in the way of training. We also need to fill in the NaN values with median values in order for the model to have a complete prediction for every row in the data set.

In [None]:
def null_table(training, testing):
    print("Training Data Frame")
    print(pd.isnull(training).sum()) 
    print(" ")
    print("Testing Data Frame")
    print(pd.isnull(testing).sum())

null_table(training, testing)

In [None]:
training.drop(labels = ['Cabin', 'Ticket'], axis = 1, inplace = True)
testing.drop(labels = ['Cabin', 'Ticket'], axis = 1, inplace = True)

null_table(training, testing)

In [None]:
#the median will be an acceptable value to place in the NaN cells
training['Age'].fillna(training['Age'].median(), inplace = True)
testing["Age"].fillna(testing["Age"].median(), inplace = True) 
training["Embarked"].fillna("S", inplace = True)
testing["Fare"].fillna(testing["Fare"].median(), inplace = True)

null_table(training, testing)

# 4. Plotting and Visualizing Data
It is very important to understand and visualize any data we are going to use in a machine learning model. By visualizing, we can see the trends and general associations of variables like Sex and Age with survival rate. We can make several different graphs for each feature we want to work with to see the entropy and information gain of the feature. 

**Gender **

In [None]:
#can ignore the testing set for now
sb.barplot(x="Sex", y="Survived", data=training)
plt.title("Distribution of Survival based on Gender")
plt.show()

total_survived_females = training[training.Sex == "female"]["Survived"].sum()
total_survived_males = training[training.Sex == "male"]["Survived"].sum()

print("Proportion of Females who survived:") 
print(total_survived_females/(total_survived_females + total_survived_males))
print("Proportion of Males who survived:")
print(total_survived_males/(total_survived_females + total_survived_males))

Gender appears to be a very good feature to use to predict survival, as shown by the large difference in propotion survived. Let's take a look at how class plays a role in survival as well.

**Class**

In [None]:
sb.barplot(x="Pclass", y="Survived", data=training)
plt.ylabel("Survival Rate")
plt.title("Distribution of Survival Based on Class")

In [None]:
sb.barplot(x="Pclass", y="Survived", hue="Sex", data=training)
plt.ylabel("Survival Rate")
plt.title("Survival Rates Based on Gender and Class")

In [None]:
sb.barplot(x="Sex", y="Survived", hue="Pclass", data=training)
plt.ylabel("Survival Rate")
plt.title("Survival Rates Based on Gender and Class")

It appears that class also plays a role in survival, as shown by the bar graph. People in Pclass 1 were more likely to survive than people in the other 2 Pclasses.

**Age**

In [None]:
survived_ages = training[training.Survived == 1]["Age"]
not_survived_ages = training[training.Survived == 0]["Age"]
plt.subplot(1, 2, 1)
sb.distplot(survived_ages, kde=False)
plt.axis([0, 100, 0, 100])
plt.title("Survived")
plt.ylabel("Proportion")
plt.subplot(1, 2, 2)
sb.distplot(not_survived_ages, kde=False)
plt.axis([0, 100, 0, 100])
plt.title("Didn't Survive")
plt.show()

In [None]:
sb.stripplot(x="Survived", y="Age", data=training, jitter=True)

It appears as though passengers in the younger range of ages were more likely to survive than those in the older range of ages, as seen by the clustering in the strip plot, as well as the survival distributions of the histogram.

Here is one final cumulative graph of a pair plot that shows the relations between all of the different features

In [None]:
sb.pairplot(training)

# 5. Feature Engineering
Because values in columns like Sex and Embarked are categorical, we have to represent certain strings as numerical values in order to perform our classification with our model. 

In [None]:
training.sample(5)

In [None]:
testing.sample(5)

In [None]:
training.loc[training["Sex"] == "male", "Sex"] = 0
training.loc[training["Sex"] == "female", "Sex"] = 1

training.loc[training["Embarked"] == "S", "Embarked"] = 0
training.loc[training["Embarked"] == "C", "Embarked"] = 1
training.loc[training["Embarked"] == "Q", "Embarked"] = 2

testing.loc[testing["Sex"] == "male", "Sex"] = 0
testing.loc[testing["Sex"] == "female", "Sex"] = 1

testing.loc[testing["Embarked"] == "S", "Embarked"] = 0
testing.loc[testing["Embarked"] == "C", "Embarked"] = 1
testing.loc[testing["Embarked"] == "Q", "Embarked"] = 2

In [None]:
training.sample(5)

We can combine SibSp and Parch into one synthetic feature called family size, which indicates the total number of family members on board for each member. 

In [None]:
training["FamSize"] = training["SibSp"] + training["Parch"] + 1
testing["FamSize"] = testing["SibSp"] + testing["Parch"] + 1

In [None]:
training["IsAlone"] = training.FamSize.apply(lambda x: 1 if x == 1 else 0)
testing["IsAlone"] = testing.FamSize.apply(lambda x: 1 if x == 1 else 0)

In [None]:
for name in training['Name']:
    training['Title'] = training['Name'].str.extract('([A-Za-z]+)\.',expand=True)
    
for name in testing['Name']:
    testing['Title'] = testing['Name'].str.extract('([A-Za-z]+)\.',expand=True)
    
title_replacements = {'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr', 'Don': 'Mr', 'Mme': 'Miss',
          'Jonkheer': 'Mr', 'Lady': 'Mrs', 'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs', 'Rev' : 'Mr', 'Dr' : 'Mr', 'Master': 'Mr'}

training.replace({'Title': title_replacements}, inplace=True)
testing.replace({'Title': title_replacements}, inplace=True)

training.loc[training["Title"] == "Miss", "Title"] = 0
training.loc[training["Title"] == "Mr", "Title"] = 1
training.loc[training["Title"] == "Mrs", "Title"] = 2

testing.loc[testing["Title"] == "Miss", "Title"] = 0
testing.loc[testing["Title"] == "Mr", "Title"] = 1
testing.loc[testing["Title"] == "Mrs", "Title"] = 2

In [None]:
print(set(training["Title"]))

In [None]:
training.sample(5)

# 6. Model Fitting and Predicting
Now that our data has been processed and formmated properly, and that we understand the general data we're working with as well as the trends and associations, we can start to build our model. We can import different classifiers from sklearn.

**sklearn Models to Test**

In [None]:
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [None]:
from sklearn.metrics import accuracy_score #to evaluate how well model is predicting 

In [None]:
from sklearn.model_selection import GridSearchCV

**Defining Features in Training/Test Set**

In [None]:
features = ["Pclass", "Sex", "Age", "Embarked", "Fare", "FamSize", "IsAlone"]
X_train = training[features] #define training features set
y_train = training["Survived"] #define training label set
X_test = testing[features] #define testing features set
#we don't have y_test, that is what we're trying to predict with our model

**Validation Data Set**

Although we already have a test set, it is generally easy to overfit the data with these classifiers. It is therefore useful to have a third data set called the validation data set to ensure that our model doesn't overfit with the data. We can make this third data set with sklearn's train_test_split function. We can also use the validation data set to test the general accuracy of our model.

In [None]:
from sklearn.model_selection import train_test_split #to create validation data set

X_training, X_valid, y_training, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=0) #X_valid and y_valid are the validation sets

**SVC Model**

In [None]:
svc_clf = SVC(kernel="linear", gamma=3) #we can try different parameters
svc_clf.fit(X_training, y_training)
pred_svc = svc_clf.predict(X_valid)
acc_svc = accuracy_score(y_valid, pred_svc)

print(acc_svc)

**LinearSVC Model**

In [None]:
linsvc_clf = LinearSVC()
linsvc_clf.fit(X_training, y_training)
pred_linsvc = linsvc_clf.predict(X_valid)
acc_linsvc = accuracy_score(y_valid, pred_linsvc)

print(acc_linsvc)

**RandomForest Model**

In [None]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_training, y_training)
pred_rf = rf_clf.predict(X_valid)
acc_rf = accuracy_score(y_valid, pred_rf)

print(acc_rf)

**LogisiticRegression Model**

In [None]:
logreg_clf = LogisticRegression()
logreg_clf.fit(X_training, y_training)
pred_logreg = logreg_clf.predict(X_valid)
acc_logreg = accuracy_score(y_valid, pred_logreg)

print(acc_logreg)

**KNeighbors Model**

In [None]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_training, y_training)
pred_knn = knn_clf.predict(X_valid)
acc_knn = accuracy_score(y_valid, pred_knn)

print(acc_knn)

**GaussianNB Model**

In [None]:
gnb_clf = GaussianNB()
gnb_clf.fit(X_training, y_training)
pred_gnb = gnb_clf.predict(X_valid)
acc_gnb = accuracy_score(y_valid, pred_gnb)

print(acc_gnb)

**DecisionTree Model**

In [None]:
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_training, y_training)
pred_dt = dt_clf.predict(X_valid)
acc_dt = accuracy_score(y_valid, pred_dt)

print(acc_dt)

# 7. Evaluating Model Performances
After making so many models and predictions, we should evaluate and see which model performed the best and which model to use on our testing set.

In [None]:
model_performance = pd.DataFrame({
    'Model': ['SVC', 'Linear SVC', 'Random Forest', 
              'Logistic Regression', 'K Nearest Neighbors', 'Gaussian Naive Bayes',  
              'Decision Tree'],
    'Accuracy': [acc_svc, acc_linsvc, acc_rf, 
              acc_logreg, acc_knn, acc_gnb, acc_dt]
})

model_performance.sort_values(by='Accuracy', ascending=False)

It appears that the Random Forest model works the best with our data so we will use it on the test set.

In [None]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

# 8. Submission

In [None]:
submission_predictions = rf_clf.predict(X_test)

In [None]:
submission = pd.DataFrame({
        "PassengerId": testing["PassengerId"],
        "Survived": submission_predictions
    })

submission.to_csv("titanic.csv", index=False)
print(submission.shape)