# Introduction to Machine Learning (Supervised Learning) with scikit-learn

Recap of workshop #1 (Pandas)

... we will be using pandas again

### Goals of this workshop

* What is machine learning and supervised learning?
* Become familiar with the scikit learn API
* Become familiar with a common flow of how to approach a supervised learning problem
* Create your own model to predict values of homes in boston (regression)
* Create your own model to classify spam emails (binary classifcation)
* Create your own model to classify images of handwritten digits (multi-class classification)



# 1)
### What is Machine Learning?

Machine learning is the process of extracting patterns and insight from data automatically. Machine learning models "learn" from data that it gets to see and can make inferences or predictions on new, unseen data.


### Supervised Learning (Regression and Classification)

Machine learning can be broadly classified into two categories: **Unsupervised Learning** and **Supervised Learning**

In supervised learning, we have data that has both input features and a *desired output*. The task is to build a model to make predictions on unseen data. The data that the model learns from must have labels for the attribute we are trying to predict.

Supervised learning is further broken down into two categories: **Classification** and **Regression**

In classification, the labels are *discrete*. This means there is a clear distinction between categories. In our email spam task coming up, the labels are "spam" or "not spam", two distinct categories. In digit classification, the labels are {0,1,...,9}, which make up 10 distinct categories. These categories must be 

In regression, the labels are continuous. This could be a price of a home, a grade in a course, or the price of a stock. 





# About scikit-learn

# Data Representation

Before we do any modeling, we have to understand how our data should be organized.

Data in scikit-learn is expected to be stored in a 2D array. Think matrices. In this workshop we will refer to this 2D array as **X**.
The shape of these arrays is **num_samples** x **num_features**

$$\mathbf{X} = \begin{bmatrix}
    x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)} & \dots  & x_{m}^{(1)} \\
    x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)} & \dots  & x_{m}^{(2)} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{1}^{(n)} & x_{2}^{(n)} & x_{3}^{(n)} & \dots  & x_{m}^{(n)}
\end{bmatrix}.
$$


where 

**num_samples** is the number of observations your dataset contains

**num_features** is the number of features or attributes that each one of your observations has. Features can be real-valued or discrete-valued



We will refer to the labels as the **y** array

$$\mathbf{y} = \begin{bmatrix}
    y^{(1)} \\
    y^{(2)}  \\
    \vdots \\
    y^{(n)} 
\end{bmatrix}.
$$

In [None]:
import pandas as pd # for dataframes, loading data

import matplotlib.pyplot as plt # plotting

import numpy as np # arrays and matrices

# Training and Testing data

In supervised learning, we use training and testing sets in the process of building a model. Given our available data, we split the data into two sets: the training set and testing set. (70% training and 30% testing is common, but this choice is arbitrary). 

We will use the training set to train our models and the testing set to test our models. The reason we have to do this is because we cannot both train and test on the same data. Our model's accuracy will be over optimistic.


# 2)
# Classification

In this section we will work on two classification tasks:

1) Classifying images of handwritten digits

2) Classifying emails as spam or not spam

# 2.1)
### MNIST handwritten digits

This MNIST handwritten digits dataset we will use is a set of 28x28 pixel resolution images of handwritten digits. This is a total of 748 pixels per image and each pixel is a feature. So for this data, each observation has 784 features. Each pixel has a value in [0,255], which represents the shade intensity of the pixel. Most of the pixels have a value of 0, because most of the image is whitespace.

In [None]:
mnist = pd.read_csv('mnist.csv')

In [None]:
mnist.shape

We have 42000 observations and 785 features. One of these columns is actually just the labels. Lets remove that to create the **y** vector.

In [None]:
mnist.head()

Split the data in to **X** matrix and **y** vector

In [None]:
y_mnist = mnist['label']
X_mnist = mnist.drop('label', axis = 1)

In [None]:
print 'The X matrix has %d observations(rows) and %d features(columns)' % (X_mnist.shape[0], X_mnist.shape[1])
print 'The y vector has %d rows' % (y_mnist.shape[0])

To visualize the some observations:

In [None]:
def plot_number(row_number):
    image = np.array(X_mnist.iloc[row_number,:]).reshape(28,28)
    plt.imshow(image, cmap='gray')
    plt.show()

In [None]:
plot_number(6)

### Split our data into training and testing sets

In [None]:
print X_mnist_train.shape
print X_mnist_test.shape

## K-Nearest-Neighbours Model

http://colah.github.io/posts/2014-10-Visualizing-MNIST/

In [None]:
# import knn

In [None]:
# .fit()


In [None]:
# .predict()


In [None]:
# .score()


# 2.2)
# Spam messages

In [None]:
spam_header = pd.read_csv('spambase_names.csv')
spam = pd.read_csv('spambase.csv', names=list(spam_header))

In [None]:
print spam.shape
spam.head()

Organize our data into X and y

In [None]:
y_spam = spam['spam']
X_spam = spam.drop(['spam'], axis=1)

Split into training and testing sets

## Logistic Regression Model

Logistic regression is a binary classifier. It models the probability of the True class as:

$$ log(\frac{P(X)}{1-P(X)}) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{m}x_{m} $$

where $ x_{1},...,x_{m} $ are the $m$ distinct features and $ \beta_{0},...,\beta_{m} $ are the coefficients of the model

In the *spam* indicator in our data, 1 means spam and 0 means not spam.

$$ log(\frac{P(Y=1)}{P(Y=0)}) = log(\frac{P(spam)}{P(not\_spam)}) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{m}x_{m} $$

Basically this models the probability of an email being spam

In [None]:
# import 

In [None]:
# .fit()

Coefficient Interpretation

In [None]:
coefs = zip(list(X_spam_train),logreg.coef_.tolist()[0])
sorted_coefs = sorted(coefs, key = lambda tup: tup[1], reverse=True)
sorted_coefs

Evalaute

In [None]:
# .score()

# 3)
# Regression

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
print boston.DESCR

In [None]:
X_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
y_boston = boston.target

Split data

In [None]:
X_boston_train, X_boston_test, y_boston_train, y_boston_test = train_test_split(X_boston, y_boston,
                                                                               test_size=0.3,
                                                                               random_state=1234)

# Linear Regression model

The linear regression model is of the form:

$$ y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{m}x_{m} $$

In [None]:
# import model

In [None]:
# init and .fit()


In [None]:
# .score()

In [None]:
boston_coefs = zip(list(X_boston_train), linreg.coef_.tolist())
boston_coefs

In [None]:
sorted(boston_coefs, key = lambda tup: tup[1], reverse = True)

# Recap of scikit-learn API

In [None]:
mnist.head()

In [None]:
mnist.shape

mnist_small = mnist.sample(10000)
mnist_small.shape

y_mnist_small = mnist_small['label']
X_mnist_small = mnist_small.drop('label', axis=1)

print X_mnist_small.shape
print y_mnist_small.shape

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

cv = StratifiedKFold(5)

scores = cross_val_score(knn, X_mnist_small, y_mnist_small)

In [None]:
scores

In [None]:
np.mean(scores)