# Titanic Modeling: Basic Version - 4 Models
**by Remington Greider-Little**<br/>
**Data Analytics @ Newman University**

**Data:** A previously cleaned version of [the Titanic data set from Kaggle](https://www.kaggle.com/c/titanic/overview).

**This Notebook:** This is crafted as a demonstration of a standard machine learning training and testing process.

**Contents:**
1. Read and Review Data
2. Prepare Data Splits
3. Train Models
4. Test Models

In [8]:
# Essential Libraries
import numpy as np
import pandas as pd

# Libraries for Machine Learning Process
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

# 1. Read and Review Data

This data has been cleaned in a previous EDA and preparation process.

In [9]:
# Read cleaned version of the data
df = pd.read_csv('data/titanic_cleaned.csv')
df.head(10)

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Family_count,Cabin_ind
0,0,3,0,22.0,7.25,1,0
1,1,1,1,38.0,71.2833,1,1
2,1,3,1,26.0,7.925,0,0
3,1,1,1,35.0,53.1,1,1
4,0,3,0,35.0,8.05,0,0
5,0,3,0,29.699118,8.4583,0,0
6,0,1,0,54.0,51.8625,0,1
7,0,3,0,2.0,21.075,4,0
8,1,3,1,27.0,11.1333,2,0
9,1,2,1,14.0,30.0708,1,0


In [10]:
# Dataframe fundamental info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Survived      891 non-null    int64  
 1   Pclass        891 non-null    int64  
 2   Sex           891 non-null    int64  
 3   Age           891 non-null    float64
 4   Fare          891 non-null    float64
 5   Family_count  891 non-null    int64  
 6   Cabin_ind     891 non-null    int64  
dtypes: float64(2), int64(5)
memory usage: 48.9 KB


# 2. Prepare Data Splits

In [11]:
# features — all columns except target variable
features = df.drop('Survived', axis=1)

# labels — only the target variable column
labels = df['Survived']

In [12]:
# Create Train and Test Splits
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Report Number and Proportion of Train and Test Features and Labels
print(f'Train Split: {X_train.shape[0]} Records, {len(y_train)} Labels = {round(len(y_train)/len(labels), 4) * 100}%')
print(f'Test Split: {X_test.shape[0]} Records, {len(y_test)} Labels = {round(len(y_test)/len(labels), 4) * 100}%')

Train Split: 712 Records, 712 Labels = 79.91%
Test Split: 179 Records, 179 Labels = 20.09%


# 3. Train Models

In [15]:
# Define the models list
models = [LogisticRegression(),
          DecisionTreeClassifier(),
          RandomForestClassifier(),
          GradientBoostingClassifier()
         ]

# Train the models using the training features and labels
for model in models: 
    model.fit(X_train, y_train)
    # Report trained model
    print(f'Trained and ready: {model}')

Trained and ready: LogisticRegression()
Trained and ready: DecisionTreeClassifier()
Trained and ready: RandomForestClassifier()
Trained and ready: GradientBoostingClassifier()


# 4. Test Models

In [16]:
# Test all models on the test split
for model in models:
    
    # Use the model to generate predictions for the Test split, based on its features only
    y_pred = model.predict(X_test)

    # Compare model's predictive performance to the provided test labels
    score = accuracy_score(y_test, y_pred) * 100

    # Report the model and its score
    print(model)
    print(f'  {score}')

LogisticRegression()
  82.12290502793296
DecisionTreeClassifier()
  77.09497206703911
RandomForestClassifier()
  80.44692737430168
GradientBoostingClassifier()
  80.44692737430168
