# Titanic

In [None]:
# Make sure pandas and sklearn are installed!

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

We will look at the titanic problem. In this problem we are given data on passengers on the titanic. 

Our goal: build a predictive model that can accurately predict who will survive the titanic tragedy.

First, let’s create DataFrames of our data. Our data is split into two parts, first is the training data, which we will use to build our model. The second is our test data, this is used to test our final results. 


In [None]:
train = pd.read_csv("data/titanic_train.csv")
test = pd.read_csv("data/titanic_test.csv")

Let's Take a quick peak into our data.

In [None]:
train.head()

In [None]:
# Meaning of Variables
from IPython.display import Image
Image("data/data_dictionary.png")

# Preprocessing

We will drop complicated features and convert string objects to numeric via one hot encoding

In [None]:
print(f"Shape of train data frame before preprocessing: {train.shape}")

train.drop(["Name","Ticket","Cabin"],axis=1,inplace=True) # Drop Name and Tickets
test.drop(["Name","Ticket","Cabin"],axis=1,inplace=True) # Drop Name and Tickets

# Impute missing values as the mean value of column in training data
train.fillna(train.mean(),inplace=True) 
test.fillna(train.mean(),inplace=True) 

train = pd.get_dummies(train,dummy_na=True) # One Hot Encode Features
test = pd.get_dummies(test,dummy_na=True) # One Hot Encode Features

print(f"Shape of train data frame after preprocessing: {train.shape}")

Next we will split our training data into train and validation

In [None]:
# Create target 
target = train["Survived"]

# Drop Target from train set
del train["Survived"]

# Make Splits
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(train.drop(["PassengerId"],axis=1),target)

# Model Fitting

Now that we have split up our data, let's build a random forest and adaboost classifiers for this data.

We will fit the models using train, and validate using validation data. 

## Random Forest

In [None]:
# Fit Random Forest 
rf = RandomForestClassifier(n_estimators = 500) # Create Random Forest Object
rf.fit(X_train,y_train) # Fit Random Forest object
y_hat_rf = rf.predict(X_valid) # Predict on Valdiation Set
print(f"The accuracy of random forest is {rf.score(X_valid,y_valid)}!")

## Adaboost

In [None]:
# Fit AdaBoost
adb = AdaBoostClassifier(n_estimators=30) # Create Adaboost Object
adb.fit(X_train,y_train) # Fit Adaboost object
y_hat_adb = rf.predict(X_valid) # Predict on Valdiation Set
print(f"The accuracy of adaboost is {adb.score(X_valid,y_valid)}!")

Wow, these models have done a pretty good job at predicting whether or not somebody survived! 

# Feature Importance

Feature Importance let's us look under the hood at our model.

We compare the what features each model though were important 

In [None]:
def plot_importance(obj,columns=X_train.columns):
    model_type = str(obj.__class__)[:-2].split(".")[-1] # Get Model Type Name
    pd.Series(obj.feature_importances_,index=columns).sort_values(ascending=True).plot(kind="barh",title=model_type+" Importance")



In [None]:
%matplotlib inline
plot_importance(rf)

In [None]:
plot_importance(adb)

# Kaggle Submission

If we choose to submit the Random Forest model:


In [None]:
pred = rf.predict(test.drop(["PassengerId"],axis=1)) 
sub = pd.DataFrame({"PassengerId":test["PassengerId"],"Survived":pred})

In [None]:
sub.to_csv("submissions/rf_sub.csv",index=False) # LB .0.74641

My submission performed 0.74641 on the leader board. 