In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
from math import sqrt
%matplotlib inline
np.set_printoptions(precision=3)
fig_width = 6.9
golden_mean = (sqrt(5)-1.0)/2.0    # Aesthetic ratio
fig_height = fig_width*golden_mean # height in inches

params = {
   'axes.labelsize': 8,
   'text.latex.preamble': ['\\usepackage{gensymb}'],
   'font.size': 10,
    'axes.labelsize': 10, # fontsize for x and y labels (was 10)
    'axes.titlesize': 12,
   'legend.fontsize': 8,
   'xtick.labelsize': 10,
   'ytick.labelsize': 10,
   'text.usetex': True,
   'figure.figsize': [fig_width,fig_height],
    'font.family': 'serif'
   }
rcParams.update(params)


In this post I will show you step by step how to create a machine learning experiment with  Scikit-learn that allows you to predict whether you or your friends would have survived the sinking of the titanic!.


This recipe is based on a [Kaggle competition](https://www.kaggle.com/c/titanic) where the goal is to predict survival on the Titanic, based on real data. [Kaggle](https://www.kaggle.com/competitions) hosts machine learning competitions where anyone can download a dataset, train a model, and test the predictions on the website. The author of the best model wins a price. It is a fun way to get started with machine learning.


## Load The Data

In [None]:
data =pd.read_csv('Data/Titanic.csv')

Scikit-learn will expect numeric values and no blanks, so first we need to do a bit more wrangling

In [None]:
# 'Sex' is stored as a text value. We should convert (or 'map') it into numeric binaries 
# so it will be ready for scikit-learn.
data['Sex'] = data['Sex'].map({'male': 0,'female': 1})

In [None]:
# Let's also drop the 'Cabin','Embarked' 'Ticket' and 'Name' columns
data = data.drop(['Cabin'], axis=1)
data = data.drop(['Embarked'], axis=1)
data = data.drop(['Name'], axis=1)
data = data.drop(['Ticket'], axis=1)

In [None]:
# Handle missing data
data['Age'].fillna(data['Age'].median(), inplace=True)

In [None]:
data.head()

### Arrange data into a features matrix and target vector


In [None]:
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
X = data[features]
y = data.Survived

## Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

* Separate out a validation dataset.
* Set-up the test harness to use 10-fold cross validation.
* Build 5 different models to predict species from flower measurements
* Select the best model.

#### Create a Validation Dataset

A better sense of a model's performance can be found using what's known as a holdout set: that is, we hold back some subset of the data from the training of the model, and then use this holdout set to check the model performance. This splitting can be done using the train_test_split utility in Scikit-Learn. We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

In [None]:
from sklearn.model_selection import train_test_split
# split the data with 20% in each set
validation_size = 0.20
seed = 7
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_size, random_state=seed)

We are using the metric of ‘accuracy‘ to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate)

### Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 4 different algorithms:

* Logistic Regression (LR)
* K-Nearest Neighbors (KNN).
* Random Forest
* Gaussian Naive Bayes (NB).
* Support Vector Machines (SVM).


### Define cross-validation

In cross-validation, the data is instead split repeatedly and multiple models are trained.  The most commonly used version of cross-validation is **k-fold cross-validation**, where k is a user-specified number, usually 5 or 10. We will use 10-fold cross validation to estimate accuracy. This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

Cross-validation is implemented in scikit-learn using the **cross_val_score func‐
tion** from the model_selection module.

In [None]:
# Define models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Test options and evaluation metric
seed = 7
scoring = 'accuracy'

def classifiers(X_train,y_train):
    knn = KNeighborsClassifier()
    gnb = GaussianNB()
    logistic = LogisticRegression()
    svc = svm.SVC()
    dTree = tree.DecisionTreeClassifier()
    rForest = RandomForestClassifier()
    
    
    names = ["Nearest Neighbors", "Naive Bayes","Logistic","RBF SVM","Decision Tree","Random Forest"]
    models = [knn, gnb, logistic, svc, dTree, rForest]
    
    name= []
    results=[]
    for (i,model) in enumerate(models):
        kfold = model_selection.KFold(n_splits=10, random_state=seed)
        cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        msg = "%s: %f" % (names[i], cv_results.mean())
        print(msg)

In [None]:
classifiers(X_train,y_train)

We can see that it looks like Random Forest has the largest estimated accuracy score.



## Make Predictions

The Random Forest algorithm was the most accurate model that we tested. Now we want to get an idea of the accuracy of the model on our validation set.

In [None]:
logreg = LogisticRegression();
logreg.fit(X_train, y_train)
y_predicted = logreg.predict(X_test)
print("Prediction accuracy: %f" % logreg.score(X_test, y_test))

In [None]:
rForest = RandomForestClassifier()
rForest.fit(X_train, y_train) 
y_pred = rForest.predict(X_test)
print("Prediction accuracy: %f" % rForest.score(X_test, y_test))


## Exercise 

* Apply the learned concept on 