# Machine Learning Part 1

Machine learning is learning from data or examples.

The training data is the data used to learn the pattern that can predict outcomes. It consists of two components: the input features (such as age, fare, gender) and the output feature (such as survived/not survived). Input features are also called predictors, because they are used to predict the output. The output is also called response or outcome.

The training data can be thought of as a spreadsheet, listing inputs against outputs. Each row is considered an observation or sample. The process of representing each observation as an input is known formally as representation.

The process of learning the pattern so it can be used to predict the outcome of unseen combinations of features (instead of looking up known combinations) is called generalisation and is the key to machine learning: this is what allows an ML algorthim to be used in the wild on new data.

The test data only has input features, no outcomes. This is the data used to predict the outcomes.

The process of using training data to learn patterns is called supervised learning, because the training data has input features and output features.

In the Titanic data, the output is either 0 (did not survive) or 1 (survived), which is a discreet output. This is a classification problem - the goal has only two (or more) discreet outcomes, not a range of continuous possibilities.

Problems that have continuous real numbers as outcomes are called regression problems. For example, predicting care mileage from a range of possible car features. 

Classification and regression problems are both supervised learning problems - they both have inputs and outcomes in the training data.


### Unsupervised learning
Unsupervised learning is used to infer a function to find hidden patterns in 'unlabelled data', i.e. data with no classification or categorisation in the observations. An example of this would be customer segmentation for marketing targeting, for example. This specific type is called clustering: using data to cluster customers according to features, so that different marketing tactics can be aimed at the different groups.


## Titanic Challenge

This has a binary output, so it is called a binary classification task. It can be trickier to work a problem with more than two classes, but the basic concepts in these notebooks can be applied to a multi-class problem.

The pattern used to predict the outcomes is also known as a classifier or classifier model.

## Classifiers

If you had a dataset with only two features and plotted it on a graph, the line (or curve. or whatever shape) that separates the group of observations with one outcome from the other observations is called the 'separation boundary'. If you have three features, they can be plotted in three dimensions and the decision boundary becomes a plane. With many features, the bounday becomes a multi-dimensional hyperplane.

That boundary is what a machine learning algorirthm is trying to find, so it can accurate;y put new data into one of the categories based on its features.

There are many types of classifiers, from simple ones (such as logistic regression) to more complex, fancy onces including support vector machine, neural networks, and random forest classifiers.

## Performance metrics

To judge which is the best model to use for a problem and pick the right one, you need performance metrics. There are many, but the course I'm doing will focus on the ones that are best for classifier problems, specifically binary classification problems, because it's working on the Titanic challenge dataset.

Accuracy is the most commonly used performance metric, and it's the one Kaggle uses for the Titanic challenge. Other metrics included precision and recall.

If you have made your predictions and you have access to the actual output, you can compare then to judge the accuracy of the model. Is this why we divide the training data into training and validation sets? The measure is the percentage of predicted outcomes that match the actual outcomes. Accuracy should be as high as possible!

Precision is particularly suited to binary classification and it's about how many results were true negative (predicted and actual was negative), true positive, false negative (the model predicted negative, actual was positive), and false positive. You can make a 2x2 table of these counts which is called a **confusion matrix**, because it gives you an idea where the model got confused and produced the wrong outcomes. You can use this to answer questions about the outcomes.

What percentage of positive predictions are correct? This is the count of true positive predictions divided by the sume of all predicted positive outcomes (true and false). The percentage is the **precision**.

What percentage of positive outcomes were predicted correctly? This is true positive cases divided by the sum of actual positive cases (true positive plus false negative). This fraction is known as **recall**.

## Classifier evaluation

So how is the model actually evaluated? A popular method of evaluating classifiers is called train-test split. Take the training data and split it into two sets: training and validation/test (not to be confused with the test data from the challenge files). The training data is used to train the model, and then predictions are made using the validation/test dataset. The predictions are compared to the outcomes in the validation/test data to evaluate the model's performance.

The common split is 80/20 (80% training). It can be any split you want, but 80/20 is the most common.

## What is a baseline model

As best practice, it's a good idea to create a baseline model that doesn't use any machine learning. In a classifier problem, the baseline model always gives the output of the majority class.

So in a binary classification example, if the outputs are 1 and 0 and 60 out of 100 observations have an output of 1, then your most basic of baseline model will always return one. It will have an accuracy of 60%. So your machine learning models should have an accuracy of over 60% to perform better than the baseline model, otherwise it makes no sense to use a machine learning model.

## Preparing data for the model

First, import the data and the libraries.

In [1]:
#imports
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline

processed_data_path = os.path.join(os.getcwd(), 'data', 'processed')
read_train_path = os.path.join(processed_data_path, 'train.csv')
read_test_path = os.path.join(processed_data_path, 'test.csv')

train_df = pd.read_csv(read_train_path, index_col='PassengerId')
test_df = pd.read_csv(read_test_path, index_col='PassengerId')
df = pd.concat([train_df, test_df], axis=0)

In [2]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 33 columns):
Survived              891 non-null int64
Age                   891 non-null float64
Fare                  891 non-null float64
FamilySize            891 non-null int64
IsMother              891 non-null int64
IsMale                891 non-null int64
Deck_A                891 non-null int64
Deck_B                891 non-null int64
Deck_C                891 non-null int64
Deck_D                891 non-null int64
Deck_E                891 non-null int64
Deck_F                891 non-null int64
Deck_G                891 non-null int64
Deck_Z                891 non-null int64
Pclass_1              891 non-null int64
Pclass_2              891 non-null int64
Pclass_3              891 non-null int64
Title_Lady            891 non-null int64
Title_Master          891 non-null int64
Title_Miss            891 non-null int64
Title_Mr              891 non-null int64
Title_Mrs             891 non-

In [3]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 32 columns):
Age                   418 non-null float64
Fare                  418 non-null float64
FamilySize            418 non-null int64
IsMother              418 non-null int64
IsMale                418 non-null int64
Deck_A                418 non-null int64
Deck_B                418 non-null int64
Deck_C                418 non-null int64
Deck_D                418 non-null int64
Deck_E                418 non-null int64
Deck_F                418 non-null int64
Deck_G                418 non-null int64
Deck_Z                418 non-null int64
Pclass_1              418 non-null int64
Pclass_2              418 non-null int64
Pclass_3              418 non-null int64
Title_Lady            418 non-null int64
Title_Master          418 non-null int64
Title_Miss            418 non-null int64
Title_Mr              418 non-null int64
Title_Mrs             418 non-null int64
Title_Officer         418 n

## Data preparation

Most machine learning algorithms expect numerical arrages, so these need to be created for input and output.

In [4]:
# this takes all rows and all columns except Survived (i.e. from Age onwards) and creates a matrix of float values
X = train_df.loc[:, 'Age':].as_matrix().astype('float')

# this creates a flattened one dimensional array (or vector) of floats from the outputs using a numpy function
y = train_df['Survived'].ravel()

In [5]:
# both of these are now numpy arrays (a vector in the case of y, because it's one-dimensional) so the shape attribute 
# can be used.
# It's good practice to use uppercase variables for matrices and lower case for vectors
print(X.shape, y.shape)

(891, 32) (891,)


In [6]:
# now split X into training and test data
# it's good practice to set the random_state and it ensures you get the same selection each time you run this
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(712, 32) (712,)
(179, 32) (179,)


In [7]:
# average survival in the train and test set
# use the numpy mean function to get the proportion of positive classes
print('the mean survival in train: {0:.3f}'.format(np.mean(y_train)))
print('the mean survival in test: {0:.3f}'.format(np.mean(y_test)))

the mean survival in train: 0.383
the mean survival in test: 0.385


Ideally, you want the outcomes to be evenly distributed in the train and test datasets, so around 50% positive outcomes. The mean here is 39% (so 39% were 1, 61% were 0), which is imbalanced but useable. Datasets that are even more imbalanced may be hard to work with.

For example, in marketing conversion data, the positive outcomes could only be 2-3% of the data. So a different process needs to be followed to build and evaulate models (not covered in the course I'm following).

## Creating a Baseline Model

The DummyClassifier in scikit-learn creates a baseline model that can be used for evaluating the performance of the rest of your models. It's slightly more sophisticated than the manual, no code needed baseline above, but not very. However, it is something that can be used for automated comparisons.

In [8]:
# import funcion
from sklearn.dummy import DummyClassifier

In [9]:
# create model, specify most_frequent because the baseline model we want will always output the majority class
model_dummy = DummyClassifier(strategy='most_frequent', random_state=0)

In [10]:
# train model
model_dummy.fit(X_train, y_train)

DummyClassifier(constant=None, random_state=0, strategy='most_frequent')

In [11]:
# basic evaluation
# this will predict the output for the test dat and then compare predicted output with actual output
# this will outout the accuracy, which is the default score
print('score for baseline model: {0:.2f}'.format(model_dummy.score(X_test, y_test)))

score for baseline model: 0.61


What this score is saying is that if you always predict 0 for survived, the majority class, you will have 61% accuracy for the model. Very important! Now you'll know if machine learning algorithms are really better.

There are features in scikit-learn that will allow you to get other performance metrics, such as precision, confusion matrix, etc.

In [12]:
# performance metrics
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

In [13]:
# accuracy score
print('accuracy for baseline model: {0:.2f}'.format(accuracy_score(y_test, model_dummy.predict(X_test))))

accuracy for baseline model: 0.61


In [14]:
# confusion matrix
print('accuracy for baseline model:\n {0}'.format(confusion_matrix(y_test, model_dummy.predict(X_test))))

accuracy for baseline model:
 [[110   0]
 [ 69   0]]


In [15]:
# from the matrix, we are always predicting zero (negative outcome) never positive
# precision and recall score
print('precision for baseline model: {0:.2f}'.format(precision_score(y_test, model_dummy.predict(X_test))))
print('recall for baseline model: {0:.2f}'.format(recall_score(y_test, model_dummy.predict(X_test))))

precision for baseline model: 0.00
recall for baseline model: 0.00


  'precision', 'predicted', average, warn_for)


## Creating output for a Kaggle submission

In [16]:
# converting the Kaggle test data (not the validation/test data) to a matrix
test_X = test_df.as_matrix().astype('float')

In [17]:
# get predictions
predictions = model_dummy.predict(test_X)

In [18]:
# for a Kaggle submission, each prediction needs to be attached to a PassengerId
df_submission = pd.DataFrame({'PassengerId': test_df.index, 'Survived': predictions})

In [19]:
df_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0


In [20]:
submission_data_path = os.path.join(os.getcwd(), 'data', 'external')
submission_file_path = os.path.join(submission_data_path, '01_dummy.csv')

# setting index to False prevents an extra column being added to the output
df_submission.to_csv(submission_file_path, index=False)

You could also write a function to do all of this and then just call the function, which will be useful because it's the same process that will be used for future models.

In [21]:
def get_submission_file(model, filename):
    # converting to the matrix
    test_X = test_df.as_matrix().astype('float')
    # get predictions
    predictions = model.predict(test_X)
    # submission dataframe
    df_submission = pd.DataFrame({'PassengerId': test_df.index, 'Survived': predictions})
    # submission file
    submission_data_path = os.path.join(os.getcwd(), 'data', 'external')
    submission_file_path = os.path.join(submission_data_path, filename)
    #write to file
    df_submission.to_csv(submission_file_path, index=False)

In [22]:
get_submission_file(model_dummy, '01_dummy.csv')

## Logistic Regression model

The logistic regression model uses a sigmoid function to predict the probability of an output. Thinking about it on a graph, if you plot the sigmoid curve (probability on y axis, your input on x axis) then for any given input, you can find the point where that input value bisects the curve -> probability value. If you set the threshold to 0.5, then anything over the threshold is class 1 and anything less than 0.5 is class 0.

In [23]:
# import function
from sklearn.linear_model import LogisticRegression

In [24]:
# create model
model_lr_1 = LogisticRegression(random_state=0)

In [25]:
# train model
model_lr_1.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [26]:
# accuracy score
print('accuracy for logistic regression - version 1: {0:.2f}'.format(model_lr_1.score(X_test, y_test)))

accuracy for logistic regression - version 1: 0.83


The accuracy score for this model is 83%, which is better than the baseline model of 0.62%. Hooray

In [27]:
# accuracy score
print('accuracy for logistic regression - version 1: {0:.2f}'.format(accuracy_score(y_test, model_lr_1.predict(X_test))))
# confusion matrix
print('accuracy for logistic regression - version 1:\n {0}'.format(confusion_matrix(y_test, model_lr_1.predict(X_test))))# from the matrix, we are always predicting zero (negative outcome) never positive
# precision and recall score
print('precision for logistic regression - version 1: {0:.2f}'.format(precision_score(y_test, model_lr_1.predict(X_test))))
print('recall for logistic regression - version 1: {0:.2f}'.format(recall_score(y_test, model_lr_1.predict(X_test))))

accuracy for logistic regression - version 1: 0.83
accuracy for logistic regression - version 1:
 [[95 15]
 [15 54]]
precision for logistic regression - version 1: 0.78
recall for logistic regression - version 1: 0.78


In [28]:
# coefficient properties/parameters
model_lr_1.coef_

array([[-0.02842268,  0.00455451, -0.50009089,  0.6178132 , -0.81392331,
         0.12845079, -0.17281789, -0.39317834,  0.52159979,  1.09941224,
         0.40341217, -0.18345052, -0.30036043,  0.96533486,  0.48256744,
        -0.34483448,  0.28089598,  1.21761328,  0.56363966, -1.44586305,
         1.07245548, -0.11273708, -0.47293646,  0.16255648,  0.24716933,
         0.28009428,  0.41324773,  0.49183528,  0.46198829,  0.14924424,
         0.37283516,  0.73023265]])

## Second Kaggle submission file

In [29]:
get_submission_file(model_lr_1, '02_lr.csv')

This submission got a 79% score on Kaggle. Hooray!