### Titanic Classification Project ###
By: Victoria Engler

Description: I recently decided I wanted to pursue Data Science full time. Coming from a background of Data Analytics and Python programming, I remembered we tried this project out in one of my first DS courses. 

I couldn't remember if we ever actually completed it end to end, but I hoped to use this today to re-teach myself some of the DS fundamentals and strengthen my ever-growing skillset and knowledge.

For those who are unaware, the Titanic dataset challenge is a popular beginner's project, the challenge is to create a classifier that will predict whether or not an individual died or survived in the wreck.
Going through various models, I worked to decide if one was overfitted, underfitted, if it was even the right one to use. I learned from this project that there's not set answer in DS and that there are so many different methods and resources out there to learn from that will only help me ask tougher questions about the data to be as accurate as possible. In the end, I am still left with questions and plan to keep tuning the below models to help me decide what the best path of success is in any challenging scenario.

####  Loading in all the packages used

In [840]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from mlxtend.classifier import StackingCVClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score
from sklearn.base import BaseEstimator, TransformerMixin
import warnings
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn import model_selection
from sklearn import metrics

warnings.filterwarnings("ignore")

#### Loading data

In [575]:
train=pd.read_csv('train.csv')
test= pd.read_csv ('test.csv')
target=pd.read_csv('gender_submission.csv')

#### Appending two csvs together to do some EDA

In [765]:
entire_df=train.append(test)
entire_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [576]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## Setting up my data for success

In [577]:
X_test= test[['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y_test=target['Survived']

In [578]:
X_train = train[['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y_train=train ['Survived']

In [579]:
X_train.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

It's always interesting to observe the way the world was back then and compare it to now. I was interested below to see what the fare differences were by both class and sex.

In [788]:
classBySex=entire_df.groupby(['Sex', 'Pclass'])['Fare'].mean().unstack()
fig = px.bar(classBySex)
fig.show()
classBySex

Pclass,1,2,3
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,109.412385,23.234827,15.32425
male,69.888385,19.904946,12.415462


It's also interesting how females had an average higher fare, makes me wonder if the higher fare was due to the lower amount of women and children.

I'm also curious about the number of kids and how much they cost based on sex. The number of kids in each class were pretty consistent.

In [829]:
f=entire_df[(entire_df['Age']<18)]
df=f.groupby(['Sex'])['Pclass'].value_counts().unstack()
print(df)
px.bar(df)

Pclass  1   2   3
Sex              
female  8  18  46
male    7  15  60


The fare for boys was slightly higher in all classes

In [834]:
kidsFareBySex=entire_df[entire_df['Age']<18].groupby(['Sex', 'Pclass'])['Fare'].mean().unstack()
fig = px.bar(kidsFareBySex)
fig.show()


There were significantly more men then women in all classes, except class.

In [833]:
f=entire_df[(entire_df['Age']>18)]
df=f.groupby(['Sex'])['Pclass'].value_counts().unstack()
#print(df)
px.bar(df)

In first class, women surprisingly cost more than men.

In [835]:
adultsFareBySex=entire_df[entire_df['Age']>18].groupby(['Sex', 'Pclass'])['Fare'].mean().unstack()
fig = px.bar(adultsFareBySex)
fig.show()


#### Now to the data transformation and classification.

Creating the preprocessor and pipeline for the initial analysis

- Imputer to fill the null values strategically
- Encoder to change the categorical binary columns to numerical values
- Scaler to make sure the numbers are all on the same scale

In [610]:
numeric_features = ["Age", "Fare"]
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
)

categorical_features = ["Embarked", "Sex", "Pclass"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", sclf )]
                                            #this is the stacked one below)]
)

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.871


In [604]:
y_pred=clf.predict(X_train)

In [606]:
accuracy_score(y_true=y_train, y_pred=y_pred)

0.7912457912457912

Defining 4 different models, just to take a look at how they do and see if I can adjust any of the hyper params

In [839]:
classifier1 = Ridge()
classifier2 = LogisticRegression()

classifier3 = DecisionTreeClassifier()

# Initializing Random Forest classifier
classifier4 = RandomForestClassifier( n_estimators=250, 
              criterion='gini', 
              max_depth=10, 
              max_features=0.3, 
              min_samples_split=3,random_state=1)

In [697]:
classifiers = {'Ridge': classifier1,
    "LGC": classifier2,
               "DT": classifier3,
               "RF": classifier4}
classification_scores={}
for classifier in classifiers:
    key=classifiers[classifier]
    clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", key )]
                                            #this is the stacked one below)]
)

    clf.fit(X_train, y_train)
    score=clf.score(X_test, y_test)
    y_pred=clf.predict(X_train)
    
    
    cv_scores = cross_val_score(clf, X_train, y_train, cv=5)
    classification_scores[classifier] = (f'Score: {score}', f'CV Score: {cv_scores}')

In [698]:
classification_scores

{'Ridge': ('Score: 0.6824745596095751',
  'CV Score: [0.31460216 0.38270739 0.38369443 0.32383759 0.43457676]'),
 'LGC': ('Score: 0.9545454545454546',
  'CV Score: [0.78212291 0.81460674 0.78089888 0.76966292 0.80337079]'),
 'DT': ('Score: 0.8277511961722488',
  'CV Score: [0.73743017 0.76404494 0.80337079 0.76404494 0.80337079]'),
 'RF': ('Score: 0.8755980861244019',
  'CV Score: [0.82681564 0.79775281 0.85955056 0.8258427  0.84831461]')}

Clearly, DecisionTreeClassifier and LogisticRegression performed the best at first glance, now to dig deeper below

In [727]:
DTparams = {"classifier__criterion": ["gini", "entropy"],
          "classifier__max_depth": [3,  7, 10],
          "classifier__min_samples_split": np.logspace(-3,3,20),
         'preprocessor__cat__handle_unknown': ['ignore']}

LRparams = {"classifier__verbose": [0,1, 4],
          "classifier__n_jobs":[-1,3,5, None],
          "classifier__fit_intercept": [True, False],
         'preprocessor__cat__handle_unknown': ['ignore']}




LRclf = Pipeline(
    [("preprocessor", preprocessor), 
     ('classifier', LogisticRegression())]
                                            
)

LRgrid = GridSearchCV(estimator = LRclf, 
                    param_grid = LRparams, 
                    cv = 5, 
                    verbose=True,
                   n_jobs=-1,
                   scoring='precision')

DTclf = Pipeline(
    [("preprocessor", preprocessor), 
     ("classifier", DecisionTreeClassifier(random_state=7, max_features='auto') )
     ])


DTgrid = GridSearchCV(estimator = DTclf, 
                    param_grid = DTparams, 
                    cv = 5, 
                    verbose=True,
                   n_jobs=-1,
                   scoring='precision')

LRgrid.fit(X_train,y_train)
print(LRgrid.score(X_test,y_test))
print(LRgrid.best_params_)
best=LRgrid.best_score_
y_pred=LRgrid.predict_proba(X_test)
auc = metrics.roc_auc_score(y_test, y_pred[:,1])
# print(f"The AUC of the Logistic Regression classifier is {auc:.3f}")
print(f"The best score of the Logistic Regression Grid Search is {best}")
print('-----------------------------------------')
DTgrid.fit(X_train,y_train)
print(DTgrid.score(X_test,y_test))
print(DTgrid.best_params_)
y_pred=DTgrid.predict_proba(X_test)
best=DTgrid.best_score_
auc = metrics.roc_auc_score(y_test, y_pred[:,1])

# Print results
print(f"The best score of the Decision Tree Classifier is {best}")

Fitting 5 folds for each of 24 candidates, totalling 120 fits
0.9182389937106918
{'classifier__fit_intercept': True, 'classifier__n_jobs': -1, 'classifier__verbose': 0, 'preprocessor__cat__handle_unknown': 'ignore'}
The best score of the Logistic Regression Grid Search is 0.7441297423524584
-----------------------------------------
Fitting 5 folds for each of 120 candidates, totalling 600 fits
1.0
{'classifier__criterion': 'entropy', 'classifier__max_depth': 3, 'classifier__min_samples_split': 0.001, 'preprocessor__cat__handle_unknown': 'ignore'}
The best score of the Decision Tree Classifier is 0.9518355770602241


Seeing that DTC got a 1.0, is a clear sign to me that this is overfitted. I need to keep exploring different statistics to ensure that I'm not over fitting the model.

In [735]:
DTclf = Pipeline(
    [("preprocessor", preprocessor), 
     ("classifier", DecisionTreeClassifier(criterion='entropy',
                                           random_state=7, 
                                           max_features='auto',
                                          max_depth=3,
                                        min_samples_split= 0.001
                                          ) )
     ])

cv_scores = cross_val_score(DTclf, X_train, y_train, cv=50)
print(f'CV Score: {cv_scores}')

CV Score: [0.83333333 0.66666667 0.66666667 0.88888889 0.66666667 0.77777778
 0.72222222 0.77777778 0.66666667 0.77777778 0.72222222 0.66666667
 0.88888889 0.77777778 0.72222222 0.77777778 0.94444444 0.83333333
 0.83333333 0.72222222 0.77777778 0.83333333 0.77777778 0.83333333
 0.88888889 0.72222222 0.77777778 0.77777778 0.83333333 0.94444444
 0.72222222 0.77777778 0.83333333 0.83333333 0.83333333 0.77777778
 0.66666667 0.72222222 0.72222222 0.83333333 0.83333333 0.82352941
 0.88235294 0.76470588 0.76470588 0.76470588 0.76470588 0.82352941
 0.82352941 0.88235294]


In [737]:
fig = px.scatter(cv_scores)
fig.show()

In [837]:
LRclf = Pipeline(
    [("preprocessor", preprocessor), 
     ("classifier", LogisticRegression(fit_intercept=True, n_jobs= -1, verbose= 0) )
     ])

cv_scores = cross_val_score(LRclf, X_train, y_train, cv=50)
print(f'CV Score: {cv_scores}')

CV Score: [0.94444444 0.55555556 0.83333333 0.83333333 0.72222222 0.83333333
 0.55555556 0.94444444 0.66666667 0.88888889 0.83333333 0.72222222
 0.88888889 0.72222222 0.66666667 0.77777778 0.94444444 0.72222222
 0.88888889 0.88888889 0.77777778 0.88888889 0.66666667 0.72222222
 0.83333333 0.77777778 0.77777778 0.83333333 0.66666667 0.94444444
 0.61111111 0.66666667 0.77777778 0.83333333 0.83333333 0.72222222
 0.72222222 0.77777778 0.83333333 0.77777778 0.83333333 0.76470588
 0.76470588 0.94117647 0.64705882 0.70588235 0.82352941 0.88235294
 0.82352941 0.82352941]


A bit more variability with the Logistic Regression Classifier... will be looking more deeply at these results to truly interpret the best route, however, I feel confident in the Logistic Regression due to its consistency and accuracy score.

In [838]:
fig = px.scatter(cv_scores)
fig.show()

Next steps: 
- Keep tuning hyper parameters
- Try re-splitting the data for more sampling and testing
- Look at more references on overfitting and see how we can re-evaluate the results