# Titanic Survival Prediction with LazyPredict

I saw a post detailing LazyPredict and how it works, and it seemed pretty simple to implement, so I tried it out in this notebook. 

This notebook will first use LazyPredict to see what our options are, and then it will output ten different submissions using 9 different models that were LazyPredicted to be our top choices. We end up with a bunch of different models that we know are generally good, allowing the parameters of each to easily be tinkered with, and also feature one that was predicted to be one of the worst performers. 

There's definitely a better solution to the pip statements everywhere with the different versions of dependancies required to play nice, but in keeping with the theme of LazyPredict, I'm just going to be lazy.

In [None]:
# let's get lazypredict
!pip3 install -U lazypredict

In [None]:
!pip3 install -U pandas==1.2.3 # I need this version of Pandas for now
import numpy as np             
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
#train_data.head()
test_data.head()

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
# load data into X and y
y = train_data["Survived"]
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = train_data[features]
#Prepare test data
X_submission = test_data[features]

In [None]:
# Select columns
numerical_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]
categorical_cols = [cname for cname in X.columns if X[cname].nunique() < 10 and 
                        X[cname].dtype == "object"]

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='most_frequent')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

!pip3 install -U pandas==1.0.5  
import pandas as pd 
from lazypredict.Supervised import LazyClassifier 
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)

!pip install -U pandas==1.2.3 #hopefully the last one
import pandas as pd
train,test= clf.fit(X_train, X_test, y_train, y_test)
print(train)

The features `['Pclass', 'Sex', 'SibSp', 'Parch', 'Age']` has made ExtraTreesClassifier the most accurate model for us, tied with LGBM classifier. So let's experiment with both and see which comes out on top. 

# ExtraTreesClassifier

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', ExtraTreesClassifier(n_estimators=100,
                                                              max_depth=8,
                                                              random_state=0))
                          ])

scores = cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='accuracy')

print("Accuracy scores:\n", scores, "\nAn average of: ", sum(scores) / len(scores))

In [None]:
pipeline.fit(X,y)
predictions = pipeline.predict(X_submission)
pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('ExtraTrees_submission.csv', index=False)
print("ExtraTrees submission was successfully saved!")

# LGBMClassifier

In [None]:
from sklearn.model_selection import cross_val_score
from lightgbm import LGBMClassifier

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', LGBMClassifier(boosting_type='goss',
                                                          n_estimators=100,
                                                          max_depth=5,
                                                          random_state=0))
                          ])

scores = cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='accuracy')

print("Accuracy scores:\n", scores, "\nAn average of: ", sum(scores) / len(scores))

In [None]:
pipeline.fit(X,y)
predictions = pipeline.predict(X_submission)
pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('LGBM_submission.csv', index=False)
print("LGBM submission was successfully saved!")

# Playing for a better score

Our results weren't that great, so we'll instead try a shotgun approach where we make predictions with a few of the different options LazyPredict gave us that were within the 0.75-0.85 range. 

# AdaBoostClassifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', AdaBoostClassifier(n_estimators=100,
                                                              random_state=0))
                          ])

scores = cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='accuracy')

print("Accuracy scores:\n", scores, "\nAn average of: ", sum(scores) / len(scores))

In [None]:
pipeline.fit(X,y)
predictions = pipeline.predict(X_submission)
pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('AdaBoost_submission.csv', index=False)
print("AdaBoost submission was successfully saved!")

# LogisticRegression

In [None]:
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', LogisticRegression(max_iter=100,
                                                          solver='sag',
                                                              random_state=0))
                          ])

scores = cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='accuracy')

print("Accuracy scores:\n", scores, "\nAn average of: ", sum(scores) / len(scores))

In [None]:
pipeline.fit(X,y)
predictions = pipeline.predict(X_submission)
pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('LR_submission.csv', index=False)
print("LR submission was successfully saved!")

# NuSVC

In [None]:
from sklearn.svm import NuSVC
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', NuSVC(
                                                          random_state=0))
                          ])

scores = cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='accuracy')

print("Accuracy scores:\n", scores, "\nAn average of: ", sum(scores) / len(scores))

In [None]:
pipeline.fit(X,y)
predictions = pipeline.predict(X_submission)
pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('NuSVC_submission.csv', index=False)
print("NuSVC submission was successfully saved!")

# KNeighborsClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', KNeighborsClassifier(n_neighbors=5,
                                                              leaf_size=30))
                          ])

scores = cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='accuracy')

print("Accuracy scores:\n", scores, "\nAn average of: ", sum(scores) / len(scores))

In [None]:
pipeline.fit(X,y)
predictions = pipeline.predict(X_submission)
pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('KNeighborsClassifier_submission.csv', index=False)
print("KNeighborsClassifier submission was successfully saved!")

# Bagging Classifier

In [None]:
from sklearn.ensemble import BaggingClassifier
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', BaggingClassifier(n_estimators=10,
                                                              random_state=0))
                          ])

scores = cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='accuracy')

print("Accuracy scores:\n", scores, "\nAn average of: ", sum(scores) / len(scores))

In [None]:
pipeline.fit(X,y)
predictions = pipeline.predict(X_submission)
pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('BaggingClassifier_submission.csv', index=False)
print("BaggingClassifier submission was successfully saved!")

# XGBClassifier

In [None]:
from xgboost import XGBClassifier
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
                                                           gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=10,
                                                           min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
                                                           objective='binary:logistic', reg_alpha=0, reg_lambda=1,
                                                           scale_pos_weight=1, seed=0, subsample=1))
                          ])

scores = cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='accuracy')

print("Accuracy scores:\n", scores, "\nAn average of: ", sum(scores) / len(scores))

In [None]:
pipeline.fit(X,y)
predictions = pipeline.predict(X_submission)
pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('XGBClassifier_submission.csv', index=False)
print("XGBClassifier submission was successfully saved!")

# Label Propagation

In [None]:
from sklearn.semi_supervised import LabelPropagation
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', LabelPropagation(n_neighbors=7,
                                                              max_iter=1000))
                          ])

scores = cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='accuracy')

print("Accuracy scores:\n", scores, "\nAn average of: ", sum(scores) / len(scores))

In [None]:
pipeline.fit(X,y)
predictions = pipeline.predict(X_submission)
pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('LabelProp_submission.csv', index=False)
print("LabelProp submission was successfully saved!")

# The Poor Scorer
According to LazyPredict with the features `['Pclass', 'Sex', 'SibSp', 'Parch', 'Age']` selected, `QuadraticDiscriminantAnalysis` is one of the worst options we have. Let's try changing the parameters a bit to see how good we can get it. 

In [None]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', QuadraticDiscriminantAnalysis())
                          ])

scores = cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='accuracy')

print("Accuracy scores:\n", scores, "\nAn average of: ", sum(scores) / len(scores))

In [None]:
pipeline.fit(X,y)
predictions = pipeline.predict(X_submission)
pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('QDA_submission.csv', index=False)
print("QDA submission was successfully saved!")