<h1 align="center">Titanic with Scikit-Learn</h1>
<h1 align="center"><font size = 5>Zach Chase</h1>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin, TransformerMixin
from sklearn.metrics import (confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

import warnings
warnings.filterwarnings('ignore')

# Custon Transformer Class for Titanic
Writing custom scikit-learn transformers is a convenient way to organize the data
cleaning process. Consider the data in titanic.csv, which contains information about passengers on the maiden voyage of the RMS Titanic in 1912. Write a custom transformer class to
clean this data, implementing the transform() method as follows:
1. Extract a copy of data frame with just the "Pclass", "Sex", and "Age" columns.
2. Replace NaN values in the "Age" column (of the copied data frame) with the mean age.
The mean age of the training data should be calculated in fit() and used in transform()
(compare this step to using sklearn.preprocessing.Imputer).
3. Convert the "Pclass" column datatype to pandas categoricals (pd.CategoricalIndex).
4. Use pd.get_dummies() to convert the categorical columns to multiple binary columns
(compare this step to using sklearn.preprocessing.OneHotEncoder).
5. Cast the result as a NumPy array and return it.
Ensure that your transformer matches scikit-learn conventions (it inherits from the correct base
classes, fit() returns self, etc.).

In [2]:
class TitanicTransformer(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y =None):
        
        #Calculate mean age
        self.mean_age = X.Age.mean()
        
        return self
    
    
    def transform(self, X):
        
        #Fill Na for Age
        X.Age = X.Age.fillna(self.mean_age)
        
        #Get dummies for sex
        X.Sex = pd.get_dummies(X.Sex)
        #Pclass_types = ['1','2','3']
        #X.Pclass.astype("category", categories = Pclass_types).cat.codes
        
        #Get dummies for Pclass
        X.Pclass = pd.CategoricalIndex(X.Pclass)
        X.Pclass = pd.get_dummies(df.Pclass, drop_first=True)
        #X.Pclass = pd.Categorical(X.Pclass)
        #X.Pclass = X.Pclass.cat.codes
        
        X = X[["Age", "Sex", "Pclass"]]
        return X.values

# Problem 2
Read the data from titanic.csv with pd.read_csv(). The "Survived" column
indicates which passengers survived, so the entries of the column are the labels that we would
like to predict. Drop any rows in the raw data that have NaN values in the "Survived" column,
then separate the column from the rest of the data. Split the data and labels into training and
testing sets. Use the training data to fit a transformer from Problem 1, then use that transformer
to clean the training set, then the testing set. Finally, train a LogisticRegressionClassifier
and a RandomForestClassifier on the cleaned training data, and score them using the cleaned
test set

In [4]:
# read the data from filename
df = pd.read_csv("titanic.csv")

# drop rows that have NaN values in the survived column
df = df.dropna(subset = ["Survived"])

# separate survived column from rest of data
y = df["Survived"]
X = df.drop(["Survived"], axis = 1)

# split data and labels into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# use the train data to fit a transformer
titanic = TitanicTransformer()

# use the transformer to clean the training and test set
titanicX_train = titanic.fit(X_train)
cleanTitanicX_train = titanicX_train.transform(X_train)

titanicX_test = titanic.fit(X_test)
cleanTitanicX_test = titanicX_test.transform(X_test)

# train a log reg classifier
mylogreg = LogisticRegression().fit(cleanTitanicX_train, y_train)

# train a random for classifier
myRF = RandomForestClassifier().fit(cleanTitanicX_train, y_train)

# score both classifiers using the cleaned test set
logPredictions = mylogreg.predict(cleanTitanicX_test)
logScore = mylogreg.score(cleanTitanicX_test, y_test)
print('Logistic Regression Score \t:{}'.format(logScore))

rfPredictions = myRF.predict(cleanTitanicX_test)
rfScore = myRF.score(cleanTitanicX_test, y_test)
print('Random Forest Score \t\t:{}'.format(rfScore))

Logistic Regression Score 	:0.7591463414634146
Random Forest Score 		:0.75


# Problem 3
Use classification_report() to score your classifiers from Problem 2. Next,
do a grid search for each classifier (using only the cleaned training data), varying at least two
hyperparameters for each kind of model. Use classification_report() to score the resulting
best estimators with the cleaned test data. Try changing the hyperparameter spaces or scoring
metrics so that each grid search yields a better estimator.

In [5]:
# print classification report for log reg from previous problem
print('Logistic Regression Classification Report')
print(classification_report(y_test, logPredictions))

# print classification report for random forest from previous problem
print('Random Forest Classification Report')
print(classification_report(y_test, rfPredictions))

# grid search over 2+ hyperparameters for log reg
gsLogReg = LogisticRegression()
log_grid = {"solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
           "penalty": ['l1', 'l2', 'elasticnet', None]}
log_gs = GridSearchCV(gsLogReg, log_grid, cv=4, n_jobs=-1, verbose=1)
log_gs.fit(cleanTitanicX_train, y_train)

# print classification report for best estimation
print('Best Logistic Regression Classification Report')
print(log_gs.best_params_, log_gs.best_score_)

# grid search over 2+ hyperparameters for random forests
gsRF = RandomForestClassifier()
rf_grid = {"n_estimators" : [10, 50, 100, 150],
           "min_impurity_split": [1e-5, 1e-6, 1e-7, 1e-8]}
rf_gs = GridSearchCV(gsRF, rf_grid, cv=4, n_jobs=-1, verbose=1)
rf_gs.fit(cleanTitanicX_train, y_train)


# print classification report for best estimation
print('Best Random Forest Classification Report')
print(rf_gs.best_params_, rf_gs.best_score_)

Problem 3 Logistic Regression Classification Report
              precision    recall  f1-score   support

         0.0       0.79      0.84      0.81       204
         1.0       0.70      0.63      0.66       124

    accuracy                           0.76       328
   macro avg       0.75      0.73      0.74       328
weighted avg       0.76      0.76      0.76       328

Problem 3 Random Forest Classification Report
              precision    recall  f1-score   support

         0.0       0.79      0.82      0.80       204
         1.0       0.68      0.64      0.66       124

    accuracy                           0.75       328
   macro avg       0.73      0.73      0.73       328
weighted avg       0.75      0.75      0.75       328

Fitting 4 folds for each of 20 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:    2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Best Logistic Regression Classification Report
{'penalty': 'l1', 'solver': 'liblinear'} 0.7869462419113986
Fitting 4 folds for each of 16 candidates, totalling 64 fits
Best Random Forest Classification Report
{'min_impurity_split': 1e-07, 'n_estimators': 150} 0.7818773851003816


[Parallel(n_jobs=-1)]: Done  64 out of  64 | elapsed:    1.7s finished


# Problem 4
Make a pipeline with at least two transformers to further process the Titanic
dataset. Do a gridsearch on the pipeline and report the hyperparameters of the best estimator

In [6]:
# make a pipeline with 2+ tranformers
pipe = Pipeline([("scaler", StandardScaler()),
                 ("robust", RobustScaler()),
                 ("logReg", LogisticRegression())])

# grid search on the pipeline
pipe_param_grid = {"scaler__with_mean":[True, False],
                  "scaler__with_std": [True, False],
                   "logReg__solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                   "logReg__tol": [1e-3, 1e-4, 1e-5],
                  "robust__with_centering": [True, False],
                  "robust__with_scaling": [True, False]}

pipe_gs = GridSearchCV(pipe,pipe_param_grid).fit(cleanTitanicX_train, y_train)

# report hyperparameters of the best estimator
print(pipe_gs.best_params_, pipe_gs.best_score_, sep='\n')

{'logReg__solver': 'newton-cg', 'logReg__tol': 0.001, 'robust__with_centering': True, 'robust__with_scaling': True, 'scaler__with_mean': True, 'scaler__with_std': True}
0.786931523878587


## Special Thanks

A very special thanks to the Brigham Young University ACME program for their guidance on this project. For further details, please visit https://acme.byu.edu/00000179-afb2-d74f-a3ff-bfbb157c0000/scikit19-pdf