In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
from math import sqrt
%matplotlib inline
np.set_printoptions(precision=3)
fig_width = 6.9
golden_mean = (sqrt(5)-1.0)/2.0    # Aesthetic ratio
fig_height = fig_width*golden_mean # height in inches

params = {
   'axes.labelsize': 8,
   'text.latex.preamble': ['\\usepackage{gensymb}'],
   'font.size': 10,
    'axes.labelsize': 10, # fontsize for x and y labels (was 10)
    'axes.titlesize': 12,
   'legend.fontsize': 8,
   'xtick.labelsize': 10,
   'ytick.labelsize': 10,
   'text.usetex': True,
   'figure.figsize': [fig_width,fig_height],
    'font.family': 'serif'
   }
rcParams.update(params)

# Introduction

We aims to give an accessible introduction to how to use machine learning techniques using [scikit-learn](http://scikit-learn.org/) for your own projects and datasets. Among the ML libraries available today, scikit-learn shines as one of the best options.

[Scikit-learn](http://scikit-learn.org/) is a Python open source library designed to tackle Machine Learning problems from beginning to end. It is used and well praised by big companies like Evernote, Spotify etc as shown [here](http://scikit-learn.org/stable/testimonials/testimonials.html)


In this post I will show you step by step how to create a machine learning experiment with  Scikit-learn that allows you to predict whether you or your friends would have survived the sinking of the titanic!.


## Scikit-Learn's Estimator API

Every machine learning algorithm in Scikit-Learn is implemented via the Estimator API, which provides a consistent interface for a wide range of machine learning applications.

The steps in using the Scikit-Learn estimator API are as follows:

1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired values.
   * Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments the estimator classes.
3. Arrange data into a features matrix and target vector following the discussion above.
4. Fit the model to your data by calling the **fit()** method of the model instance.
5. Apply the Model to new data:
  * For supervised learning, often we predict labels for unknown data using the **predict()** method.
  * For unsupervised learning, we often transform or infer properties of the data using the **transform()** or **predict()** method.
  
Let us apply this step in the following example  


## Data Representation in Scikit-Learn

### Features matrix

A two-dimensional numerical array or matrix with shape [$n_{samples}$, $n_{features}$]. By convention, this features matrix is often stored in a variable named $X$.

The samples (i.e., rows) always refer to the individual objects described by the dataset

The features (i.e., columns) always refer to the distinct observations that describe each sample.


### Target array

One dimensional, with length $n_{samples}$ usually the quantity we want to predict from the data

## Load The Data

In [2]:
data =pd.read_csv('Data/Titanic.csv')

Scikit-learn will expect numeric values and no blanks, so first we need to do a bit more wrangling

In [3]:
# 'Sex' is stored as a text value. We should convert (or 'map') it into numeric binaries 
# so it will be ready for scikit-learn.
data['Sex'] = data['Sex'].map({'male': 0,'female': 1})

In [4]:
# Let's also drop the 'Cabin','Embarked' 'Ticket' and 'Name' columns
data = data.drop(['Cabin'], axis=1)
data = data.drop(['Embarked'], axis=1)
data = data.drop(['Name'], axis=1)
data = data.drop(['Ticket'], axis=1)

In [5]:
# Handle missing data
data['Age'].fillna(data['Age'].median(), inplace=True)

In [6]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,1,0,3,0,22.0,1,0,7.25
1,2,1,1,1,38.0,1,0,71.2833
2,3,1,3,1,26.0,0,0,7.925
3,4,1,1,1,35.0,1,0,53.1
4,5,0,3,0,35.0,0,0,8.05


### Arrange data into a features matrix and target vector


In [9]:
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
X = data[features]
y = data.Survived

## Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

* Separate out a validation dataset.
* Set-up the test harness to use 10-fold cross validation.
* Build 5 different models to predict species from flower measurements
* Select the best model.

#### Create a Validation Dataset

A better sense of a model's performance can be found using what's known as a holdout set: that is, we hold back some subset of the data from the training of the model, and then use this holdout set to check the model performance. This splitting can be done using the train_test_split utility in Scikit-Learn. We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

In [11]:
from sklearn.model_selection import train_test_split
# split the data with 20% in each set
validation_size = 0.20
seed = 7
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_size, random_state=seed)

In [12]:
# Test options and evaluation metric
seed = 7
scoring = 'accuracy'


We are using the metric of ‘accuracy‘ to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate)

### Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 4 different algorithms:

* Logistic Regression (LR)
* K-Nearest Neighbors (KNN).
* Random Forest
* Gaussian Naive Bayes (NB).
* Support Vector Machines (SVM).


### LOGISTIC REGRESSION¶
A logistic regression mathematically calculates the decision boundary between the possibilities. It looks for a straight line that represents a cutoff that most accurately represents the training data.

In [23]:
# defining the model with its associated parameters
from sklearn.linear_model import LogisticRegression   # 1. choose model class
model = LogisticRegression()               # 2. instantiate model
model.fit(X_train, y_train) 
y_pred = model.predict(X_test)                        # 4. predict on new data

Finally, we can use the accuracy_score utility to see the fraction of predicted labels that match their true value:

In [24]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.75418994413407825

Thus our model achieve a test accuracy of  75% .

### RANDOM FOREST¶

A random forest is a 'meta estimator'. It will fit a number of decision trees (we'll have to tell it how many) on various sub-samples of the dataset. Then it will use averaging to improve the predictive accuracy and control over-fitting.

In [25]:
from sklearn.ensemble import RandomForestClassifier
rForest = RandomForestClassifier()
rForest.fit(X_train, y_train) 
y_pred = rForest.predict(X_test)  
accuracy_score(y_test, y_pred)

0.79329608938547491

Thus our model achieve a test accuracy of  79% .

##### Well that was easy! But how can we find out how well it worked?

### Define cross-validation

In cross-validation, the data is instead split repeatedly and multiple models are trained.  The most commonly used version of cross-validation is **k-fold cross-validation**, where k is a user-specified number, usually 5 or 10. We will use 10-fold cross validation to estimate accuracy. This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

Cross-validation is implemented in scikit-learn using the **cross_val_score func‐
tion** from the model_selection module.

In [27]:
# Define models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

def classifiers(X_train,y_train,X_test,y_test):
    knn = KNeighborsClassifier()
    gnb = GaussianNB()
    logistic = LogisticRegression()
    svc = svm.SVC()
    dTree = tree.DecisionTreeClassifier()
    rForest = RandomForestClassifier()
    
    
    names = ["Nearest Neighbors", "Naive Bayes","Logistic","RBF SVM","Decision Tree","Random Forest"]
    model = [knn, gnb, logistic, svc, dTree, rForest]
    
    
    for (i,clf) in enumerate(classifiers):