# Machine Learning

<img src="https://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png?format=250w">

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram


## What is Machine Learning?

> Field of study that give computers the ability to learn without being explicitly programmed.
> -Arthur Samuel, 1959

<b>Short  Answer</b>:  The offspring of Statistics and Computer Science

<b>Better Answer</b>:  A set of models which aim to learn something about a data set to apply that knowledge to new data

<u>Utility</u>

 - Using labels from training data to classify new objects (e.g., images, digits, webpages)
 - Learning the relationship between explanatory features and response variable to predict for new data (e.g., stock market)
 - Discovering natural clustering structure in data
 - Detecting low-dimensional structure in high-dimensional data
 - Finding outliers in large data sets
 - Game playing/Robotics
 
 >   *Essentially, all models are wrong, but some are useful.*
 
 >     -- George Box, Statistician (1919-2013)


## Navigating the Terminology 

<img src="figs/term.png">

## Different Types of Learning

<img src="figs/three.png">

From: [S. Raschka (2015)](https://www.slideshare.net/SebastianRaschka/nextgen-talk-022015/8-Learning_Labeled_data_Direct_feedback)

## Supervised vs. Unsupervised Learning 

<img src="figs/learn_types.png">

## Supervised Learning: Regression

Use training set of $(\vec x,y)$ pairs to learn to predict $y$ for new $\vec x$. **Regression** is predicting a *continuous* outcome ($y$) variable from a vector of input features ($\vec x$). That is, we seek to learn:

$f(\vec x) = y$

 In "theory-driven" MCMC modeling, we already think we know from physics what the functional form of $f$ is and what we try to do is figure out the parameters of $f$ that best accommodate the data we have and the beliefs we start with. When we do not know a functional form for $f$ we take more "data driven" approach, such as with Gaussian Processes.

In `sklearn` there are a lot of "data driven" modelling possibilities.

- Linear Regression:  `linear_model.LinearRegression`
- Lasso & Ridge Reg.:  `linear_model.Lasso` / `linear_model.Ridge`
- Gaussian Process Regression: `gaussian_process.GaussianProcess`
- Nearest Neighbor Regression:  `neighbors.KNeighborsRegressor`
- Support Vector Regression:   `svm.SVR`
- Regression Trees:  `tree.DecisionTreeRegressor`

An aside on the "data driven" vs "theory driven" distinction...

### Regression ## 

Let's take a look at the famous California Housing data. We don't have a good physics model for this (of course there are economic theories...). for now we just have data and seek a data-driven model.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import colors
from matplotlib import cm

import seaborn as sns
sns.set_context("talk")

from sklearn import datasets
import pandas as pd

%matplotlib inline

In [None]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

X = housing['data']  # 8 features (e.g. HouseAge,Latitude, AveBedrms, etc.)
Y = housing['target']  # response (median house price in $100,000)

df = pd.DataFrame(X, columns=housing.feature_names)
df["target"]  = housing['target']

# separate out the target into 5 different bins (for viz purposes)
nbins = 5
df["target_binned"] = pd.qcut(df["target"], nbins, labels=False)
df

In [None]:
print("feature vector shape=", X.shape)
print("output shape=", Y.shape)

In [None]:
print(housing.feature_names)

In [None]:
print(housing.DESCR)

In [None]:
f, axs = plt.subplots(1, 3, figsize=(12,6))

for i, ax in enumerate(axs):
    ax.scatter(X[:, i], Y, alpha=0.2, s=2)
    ax.set_xlabel(housing.feature_names[i])
    ax.set_ylabel("Median House Price (in $100,000)")
    
plt.subplots_adjust(wspace=0.5)

In [None]:
fig = plt.figure(figsize=(5, 5))
point_size = 80*(Y/max(Y))**3

df1 = df[['MedInc', 'HouseAge', 'AveBedrms', "target_binned"]]
colors = sns.color_palette("colorblind", nbins)

g= sns.pairplot(df1, hue="target_binned", palette=sns.color_palette("cubehelix", nbins),
                         plot_kws=dict(s=2, edgecolor=None, alpha=0.3))

### Basic Model Fitting

We need to create a **training set** and a **testing set**.

In [None]:
# half of data
import math
half = math.floor(len(Y)/2)
train_X = X[:half]
train_Y = Y[:half]
test_X = X[half:]
test_Y = Y[half:]

## Linear Regression

The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the input variables. In mathematical notion, if $\hat{y}$ is the predicted value.
$$\hat{y}(w, x) = w_0 + w_1 x_1 + ... + w_p x_p$$
Across the module, we designate the vector $w = (w_1,
..., w_p)$ as `coef_` and $w_0$ as `intercept_`.
To perform classification with generalized linear models, see Logistic regression.

http://scikit-learn.org/stable/modules/linear_model.html

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

clf = linear_model.LinearRegression()

In [None]:
# fit the model
clf.fit(train_X, train_Y)

In [None]:
clf.intercept_

In [None]:
# now do the prediction
Y_lr_pred = clf.predict(test_X)

# how well did we do?
mse = mean_squared_error(test_Y,Y_lr_pred) ; print(mse)

In [None]:
f, ax = plt.subplots(figsize=(6, 6))
ax.scatter(test_Y,Y_lr_pred - test_Y, s=2, alpha=0.3)
ax.set_title("Linear Regression Residuals - MSE = %.2f" % mse)
ax.set_xlabel("True Median House Price ($100,000)")
ax.set_ylabel("Residual")
ax.hlines(0,min(test_Y),max(test_Y),color="red")

## *k*-Nearest Neighbor (KNN) Regression

"The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data (possibly transformed into a fast indexing structure such as a Ball Tree or KD Tree.)."

<img src="http://scikit-learn.org/stable/_images/sphx_glr_plot_regression_001.png">

http://scikit-learn.org/stable/modules/neighbors.html

In [None]:
from sklearn import neighbors
from sklearn import preprocessing

# many methods work better on scaled X
scaler = preprocessing.PowerTransformer() 
X_scaled = scaler.fit_transform(X)

clf1 = neighbors.KNeighborsRegressor(5)

# scale
train_X = X_scaled[:half]
test_X = X_scaled[half:]

# not scaled
#train_X = X[:half]
#test_X = X[half:]

clf1.fit(train_X,train_Y)

In [None]:
Y_knn_pred = clf1.predict(test_X)
mse = mean_squared_error(test_Y,Y_knn_pred) ; print(mse)

f, ax = plt.subplots(figsize=(10, 8))
ax.plot(test_Y, Y_knn_pred - test_Y, 'o', alpha=0.4)
ax.set_title("k-NN Residuals - MSE = %.1f" % mse)
ax.set_xlabel("True Median House Price ($100,000)")
ax.set_ylabel("Residual")
ax.hlines(0,min(test_Y),max(test_Y),color="red")

## Error Estimation & Model Selection

**Q**: How will our model perform on future data?

So far, we’ve split the data, using one set to train the model and the other to test its performance

This train-test strategy avoids over-fitting to the sample on hand, but wastes data & can produce poor error estimates.

cf. https://scikit-learn.org/stable/modules/cross_validation.html

### model selection: cross-validation


- *K-fold CV* - randomly split the training data into K "folds."  For each $k=1,...,K$, train model only on the data not in fold $k$ & predict for data in fold $k$.  Compute performance metric over CV predictions.

- *Leave-one-out (LOO) CV* -- n-fold CV with  n = number of training points.


<img src="https://www.evernote.com/l/AUWvg9caKz1OO7opS2Ji3Z7OwOFkLCrg2WsB/image.png">

<img src="figs/YWgro.gif" width=50%>

In [None]:
from sklearn import model_selection

In [None]:
import numpy as np
from sklearn import datasets

housing = fetch_california_housing()

X = housing['data'] ; y = housing['target']

from sklearn import linear_model
clf = linear_model.LinearRegression()

from sklearn.model_selection import cross_val_score, cross_val_predict

def print_cv_score_summary(model, xx, yy, cv, verbose=False):
    scores = cross_val_score(model, xx, yy, cv=cv, n_jobs=1, verbose=verbose)
    print("mean: {:3f}, stdev: {:3f}".format(
        np.mean(scores), np.std(scores)))

In [None]:
# Returns the coefficient of determination R^2 of the prediction.
print_cv_score_summary(clf, X, y,
                       cv=model_selection.KFold(10, shuffle=True, random_state = 42), verbose=True)

In [None]:
predictions = cross_val_predict(clf, X, y, 
                                cv=model_selection.KFold(10, shuffle=True, random_state = 42), n_jobs=1)

In [None]:
mse = mean_squared_error(y, predictions) ; print(mse)

f, ax = plt.subplots(figsize=(10,6))
ax.scatter(y, predictions - y,alpha=0.2,edgecolors=None)
ax.set_title("CV kfold linear model - MSE = %.2f" % mse)
ax.set_xlabel("True log normalized Median House Price")
ax.set_ylabel("Residual")
ax.hlines(0,min(test_Y),max(test_Y),color="red")
ax.set_xlim(0,5.1)

In [None]:
clf_knn = neighbors.KNeighborsRegressor(15)
print_cv_score_summary(clf_knn, X, y,
                       cv=model_selection.KFold(5, shuffle=True, random_state = 42), verbose=True)

In [None]:
model_selection.GridSearchCV?

In [None]:
parameters = {"n_neighbors": [5, 8, 10, 12, 15, 20],  "weights": ["uniform", "distance"]}

knn_tune = model_selection.GridSearchCV(clf_knn, parameters, 
                                        n_jobs = -1, cv = 10, verbose=True, scoring='neg_mean_squared_error')

knn_opt = knn_tune.fit(X, y)

In [None]:
knn_opt.best_estimator_