# Titanic Machine Learning Practice Project
In this practice project, we use machine learning to predict if a given passenger will survive the Titanic crash. This is an ongoing competition on kaggle.com. I did this project to gain knowledge and experience in machine learning and data science. 

In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## Loading Data

In the code below I load the data and partition the data into variables. Our prediction clastrain_test_splitput into the y variable. Our predictor classes are put into X for training data and X_test for the testing data that we will eventually make our final predictions with. PassengerId, Ticket,
and Name are excluded from the training and testing data because these categories are not expected contribute helpful information for survival prediction. 

I then split the training data into train and validation sets. 

In [77]:
# Importing Data
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')

# View aspects of each dataframe
print('Training Data:\n', train.shape)
print('\n\nTesting Data:\n', test.shape)

y = train.Survived

# Selecting all features, excluding PassengerId, Survived, and Name
features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']
X = train[features]
X_test = test[features]

# Splitting training data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42)

Training Data:
 (891, 12)


Testing Data:
 (418, 11)


## Data Exploration


In [78]:
# Checking for columns with missing values
print('Training Data')
print(X_train.shape)
miss_val_col = X_train.isnull().sum()
print(miss_val_col[miss_val_col>0])

print('\nValidation Data')
print(X_val.shape)
miss_val_col = X_val.isnull().sum()
print(miss_val_col[miss_val_col>0])

print('\nTesting Data')
print(X_test.shape)
miss_val_col = X_test.isnull().sum()
print(miss_val_col[miss_val_col>0])

Training Data
(668, 8)
Age         132
Cabin       519
Embarked      2
dtype: int64

Validation Data
(223, 8)
Age       45
Cabin    168
dtype: int64

Testing Data
(418, 8)
Age       86
Fare       1
Cabin    327
dtype: int64


In [87]:
X_train = X_train.drop('Cabin',axis=1)
X_val = X_val.drop('Cabin',axis=1)
X_test = X_test.drop('Cabin',axis=1)

In [89]:
# Identifying categorical and numerical columns
print(X_train.dtypes)
cat_cols = X_train.columns[X_train.dtypes == 'object']
print('\nUnique values of categorical columns:\n',X_train[cat_cols].nunique())

num_cols = X_train.columns.drop(cat_cols)
print('\nNumerical columns:\n',num_cols)

Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
dtype: object

Unique values of categorical columns:
 Sex         2
Embarked    3
dtype: int64

Numerical columns:
 Index(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')


## Preprocessing

In [91]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Numerical data preprocessing
num_preproc = SimpleImputer(strategy='mean')

# Categorical data preprocesseing
cat_preproc = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('onehot',OneHotEncoder(handle_unknown='ignore'))
])

# Combining numerical and categorical preprocessors
preproc = ColumnTransformer(transformers=[
    ('num',num_preproc,num_cols),
    ('cat',cat_preproc,cat_cols)
])

## Building model and pipeline

In [99]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)


In [101]:
first_pipeline = Pipeline(steps=[
    ('preprocessor',preproc),
    ('model',model)
])

first_pipeline.fit(X_train,y_train)
y_pred = first_pipeline.predict(X_val)

In [103]:
# Computing accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_val, y_pred)

0.7802690582959642

In [104]:
# Method to build pipeline and compute accuracy based on preprocessing and model parameters
def pipe_acc(num_preprocessor,cat_preprocessor,model):
    
    preproc = ColumnTransformer(transformers=[
        ('num',num_preprocessor,num_cols),
        ('cat',cat_preprocessor,cat_cols)
    ])
    my_pipeline = Pipeline(steps=[
        ('preprocessor',preproc),
        ('model',model)
    ])
    
    my_pipeline.fit(X_train,y_train)
    return accuracy_score(y_val,my_pipeline.predict(X_val))

In [106]:
# Analyzing parameters to numerical preprocessor


0.7802690582959642