_Following the tutorial from: https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy on the Kaggle titanic problem https://www.kaggle.com/c/titanic. I am using this a tutorial as a refresher of the practical aspect of machine learning and data science to unsure that I do not lose my skills from my masters degree. It will aslo allow me to see a project tackled from another perspective and hopefully allow me to learn something new_

# 1. Defining the problem
It is a binary supervised learning task that looks to predict if an individual will survive the sinking of HMS Titanic. 

__Project summary:__ The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.vOne of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

# 2. Gathering the data
The date for this problem is not include within this repository but can be found here: https://www.kaggle.com/c/titanic/data instead

# 3. Prepare the data
The data is already in .csv format and is deliberately in an easy to use format so the only requirement here is data cleaning. 

We will start by import the neccassary libraries

In [18]:
import sys
import pandas as pd
import matplotlib
import numpy as np
import scipy as sp
import IPython
from IPython import display
import sklearn

import random 
import time

import warnings
warnings.filterwarnings('ignore')
print('-'*25)

-------------------------


In [19]:
#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.plotting import scatter_matrix

#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

In [20]:
data_raw = pd.read_csv("./data/train.csv")
data_val = pd.read_csv("./data/test.csv") # not sure that the test data should be used for validation but this is what the tutorial says to do
data1 = data_raw.copy(deep=True) # copy of the training data used for playing with
data_cleaner = [data1, data_val]

print(data_raw.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None


## Data inspection
- Survived: our target (dependant) variable. 1 for survived and 0 for dead
- Independant variables:
    - PassengerID & Ticket: assumed to random unique identifiers and can be ignored from the dataset when training / predicting
    - Pclass: ordinal datatype for the ticket class (basically social standing of passenger), 1 is upper, 2 is middle and 3 is lower class
    - Name: nominal datatype that we could use for feature engineering to extract title (Mr/Mrs) or family size from surname
    - Sex and Embarked: nominal datatypes
    - Age and Fare: continuous quantitative datatypes
    - SibSp: representes number of related siblings/spouse aboard and Parch represntes number of related parents/children onbpard. We can combine these to create family size
    - Cabin: nominal datatype that can be used in feature engineering to approximate the position on ship when incident occurred but there are many null values so doesn't really add value and can therefore be dropped

## The 4 C's of Cleaning
- Correcting: Does not appear to be any aberrant or non-acceptable data inputs so we can leave correcting for now until later inspection requires it
- Completing: There are missing values in age, cabin, and embarked fields. Here we will input median value for age, cabin attribute is dropped and embark will be imputed with mode. These decisions are basic and more complex approaches could be investigated at a later date 
- Creating: the only feature we can really create is title from the name of the entry
- Converting: There are no date or currency formats to worry about but we must handle the datatype attributes. The object datatypes can be converted to categorical variables

In [24]:
print(f"Train columns with null values:\n{data1.isnull().sum()}")
print('-'*20)

print(f'Test/Validation columns with null values:\n {data_val.isnull().sum()}')
print("-"*20)

data_raw.describe(include='all')

Train columns with null values:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
--------------------
Test/Validation columns with null values:
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
--------------------


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


Now begining cleaning

In [26]:
for dataset in data_cleaner:
    dataset['Age'].fillna(dataset['Age'].median, inplace=True)

    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace=True)

    dataset['Fare'].fillna(dataset['Fare'].median(), inplace=True)

data1.drop(['PassengerId', 'Cabin', 'Ticket'], axis=1, inplace=True)

print(data1.isnull().sum())
print('-'*20)
print(data_val.isnull().sum())

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64
--------------------
PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
dtype: int64
