# Outline
- ### What is scikit-learn?
- ### Quick intro to machine learning
- ### General process
- ### Walk-through examples
    - Classification
    - Regression
    - (if time) Cross-validation
- ### Conclusion

## What is sci-kit learn?
- scikit-learn.org
- "sci" = science
- a Python module for machine learning
- Installation: [instructions for various OS](http://scikit-learn.org/stable/install.html)

## Very quick intro to machine learning
- Methods of learning (useful) information from data
    - Given samples and their attributes ("features"), what can we say about some new example? what can we say about the samples in general?
    - Examples: weather prediction given previous days' weather data, product recommendation given users' browsing histories, fake news detection given an article/post... and many, many others! 
- 2 most common problem settings
    - Supervised: data available to us come with "labels" (i.e. thing we want to predict)
        - Classification: prediction of group membership 
        <img src="http://scikit-learn.org/stable/_images/sphx_glr_plot_logistic_multinomial_002.png", width=500>
        - Regression: prediction of a (real) value
        <img src="http://scikit-learn.org/stable/_images/sphx_glr_plot_isotonic_regression_001.png", width=500>
    - Unsupervised: data available to us don't come with labels
        - Clustering: discovery of group membership
        <img src="http://scikit-learn.org/stable/_images/sphx_glr_plot_dbscan_001.png", width=500>
        - Density estimation: discovery of distribution of data
- Today's focus: supervised learning

## General process (supervised learning)
- Training, validation, testing
- Training:
    - Given: some examples with features and labels
    - Learn the relationship between labels and features: usually described by a function **f**, i.e. learning f such that
    *label $\approx$ f(features)*
    - How is this *learned*? 
        - Problem often is formulated as minimizing a cost function (or maximizing a likelihood function)
        - Specifically:
            - Given data points $(x_1, y_1), ... (x_N, y_N)$ where $x_i$ are usually feature vectors 
            (i.e. data attributes represented numerically)
            - Using your knowledge of probability and what you know about the data, 
            you hypothesize a possible relationship between $x_i$ and its label $y_i$: 
            $\hat{y}_i = f(x_i)$, or commonly: $$\hat{y}_i = f(\sum_{j=1}^{m}w_jx_{i,j})$$
            The problem then becomes: solving for the weights $w_j$ that helps model this relationship
            - Since we have true labels $y_i$, formulate a cost function:
            $$L(w) = \sum_{i=1}^{N}g(y_i, \hat{y}_i)$$
            - Best parameters $w_j$ are then found by solving:
            $$w^{\ast} = argmin_{w} L(w)$$
- Testing: 
    - Given: new data (not seen during training) -- **NEVER** mix training and testing data!
    - Predict the labels
    - Evaluate predicted labels by comparing to ground truth
- Validation:
    - Preferably a separate set from training and testing
    - Used for tuning parameters of f() and help *prevent overfitting*, aka "regularization"
    $$L(w) = g(y_i, \hat{y}_i) + \lambda||w||^2_2$$
    - Sometimes the amount of data is too small --> cross validation

## Example 1: classification
- Problem: detect heart disease from a patient's record
- Data: [UCI heart disease data set (short version)](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)
    - [train](https://courses.cs.washington.edu/courses/cse546/05sp/psetdata/processed.cleveland.data.train)/[test](https://courses.cs.washington.edu/courses/cse546/05sp/psetdata/processed.cleveland.data.test) -- processed data courtesy of CSE 546 (Spring 2005)
    - attributes: 14 (13 features + label)

In [24]:
# might be necessary, import here first
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

# data preprocessing function
def load_data(file_name):
    flines = open(file_name).readlines()
    flines = [x.rstrip() for x in flines]
    flines = [x.split(',') for x in flines]
    flines = [[float(s) for s in x] for x in flines]
    feats = [x[:-1] for x in flines]
    labels = [x[-1] for x in flines]
    labels = [int(x>0) for x in labels]
    return feats, labels


In [23]:
# load data downloaded
X_train, y_train = load_data('hd.cleveland.data.train.csv')
X_test, y_test = load_data('hd.cleveland.data.test.csv')

In [26]:
# peek at X_train, y_train
print np.array(X_train)
print np.array(X_train).shape

[[ 63.   1.   1. ...,   3.   0.   6.]
 [ 67.   1.   4. ...,   2.   3.   3.]
 [ 67.   1.   4. ...,   2.   2.   7.]
 ..., 
 [ 50.   0.   4. ...,   1.   0.   3.]
 [ 64.   0.   4. ...,   1.   0.   3.]
 [ 57.   1.   3. ...,   1.   1.   7.]]
(203, 13)


We will first try a simple [logistic regression classifier](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
- (very abbreviated) math background: 
    - logistic function: $$\sigma(x) = \frac{1}{1 + e^{-x}}$$
    ... to model $p(y_i=1 \,| \, x_i)$, 
    and $p(y_i=0 \, | \, x_i) = 1 - \sigma(x_i)$
    
    - Then decide on predicted label as follows:
    $\hat{y_i} = 1$ if $p(y_i=1 \,| \, x_i) > p(y_i=0 \,| \, x_i)$ 
    
    and vice versa
    
    - Recall: $$L(w) = \sum_{i=1}^{N}g(y_i, \hat{y}_i)$$
    Here 
    $$-g(y_i, \hat{y}_i) = y_i\log\sigma(x_i) + (1-y_i)\log(1-\sigma(x_i))$$
    
    
- Find more [detailed derivation](http://cs229.stanford.edu/notes/cs229-notes1.pdf) in many, many tutorials/texts

In [28]:
from sklearn import linear_model
logistic = linear_model.LogisticRegression()

# fit the classifier
logistic.fit(X_train, y_train)
print logistic.coef_

[[-0.03986426  0.99840617  0.63739152  0.01867455  0.00452262 -0.52640388
   0.42199075 -0.04683685  0.46728745  0.25581275  0.15207522  0.84417809
   0.38177672]]


In [33]:
# use learned classifier to predict test data y_pred = f(X_test)
y_pred = logistic.predict(X_test)
print y_pred
print np.array(y_test)

[0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 0 1 1 1 0 0
 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0
 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 0]
[0 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 1 1 0 0 1 1 1 0 0
 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 1 0 1 0 1 0 1 0 0
 0 1 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 0]


Now, evaluating the model:
- In classification problems, we often care about
    - Precision $p = \frac{TP}{TP + FP}$
    - Recall $r = \frac{TP}{TP + FN}$
    - Accuracy $a = \frac{TP + TN}{P+N}$
    - F1 score $f1 = \frac{2*r*p}{r+p}$
    - CM in sklearn given as 
    $[[TN,FP];$
    $[FN,TP]]$

In [30]:
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# compute and print scores
print "CM:"
print confusion_matrix(y_test, y_pred)
print "precision:", precision_score(y_test, y_pred)
print "recall:", recall_score(y_test, y_pred)
print "accuracy: ", accuracy_score(y_test, y_pred)
print "f1: ", f1_score(y_test, y_pred)

CM:
[[48  4]
 [15 33]]
precision: 0.891891891892
recall: 0.6875
accuracy:  0.81
f1:  0.776470588235


Now try to refit the data with some different parameters:

In [31]:
logistic = linear_model.LogisticRegression(C=100)
logistic.fit(X_train, y_train)
y_pred = logistic.predict(X_test)

print "CM:"
print confusion_matrix(y_test, y_pred)
print "precision:", precision_score(y_test, y_pred)
print "recall:", recall_score(y_test, y_pred)
print "accuracy: ", accuracy_score(y_test, y_pred)
print "f1: ", f1_score(y_test, y_pred)


CM:
[[48  4]
 [16 32]]
precision: 0.888888888889
recall: 0.666666666667
accuracy:  0.8
f1:  0.761904761905


How about a different classifier? 
- There are [lots](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) to play with: SVM, decision trees are among the most popular default choices


In [35]:
from sklearn.svm import LinearSVC
clf = LinearSVC(C=100)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print "CM:"
print confusion_matrix(y_test, y_pred)
print "precision:", precision_score(y_test, y_pred)
print "recall:", recall_score(y_test, y_pred)
print "accuracy: ", accuracy_score(y_test, y_pred)
print "f1: ", f1_score(y_test, y_pred)



CM:
[[51  1]
 [38 10]]
precision: 0.909090909091
recall: 0.208333333333
accuracy:  0.61
f1:  0.338983050847


In [36]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print "CM:"
print confusion_matrix(y_test, y_pred)
print "precision:", precision_score(y_test, y_pred)
print "recall:", recall_score(y_test, y_pred)
print "accuracy: ", accuracy_score(y_test, y_pred)
print "f1: ", f1_score(y_test, y_pred)

CM:
[[44  8]
 [19 29]]
precision: 0.783783783784
recall: 0.604166666667
accuracy:  0.73
f1:  0.682352941176


## Example 2: regression
Same problem setting as above, but predict real-valued scores instead of presence/absence/class membership. 
We begin with a standard choice: [linear regression](http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)

- Recall: $$L(w) = \sum_{i=1}^{N}g(y_i, \hat{y}_i)$$
    Here 
    $$g(y_i, \hat{y}_i) = (y_i - \hat{y_i})^2$$

    And
    $$\hat{y_i} = \sum_{j=1}^{m}w_jx_{i,j}$$

Let's apply linear regression on our previous heart disease problem
- Note/disclaimer: this is just a toy example! 
- Details on issues/problems later...

In [38]:
# data preprocessing function, version 2 for regression task
def load_data_cont(file_name):
    flines = open(file_name).readlines()
    flines = [x.rstrip() for x in flines]
    flines = [x.split(',') for x in flines]
    flines = [[float(s) for s in x] for x in flines]
    feats = [x[:-1] for x in flines]
    labels = [x[-1] for x in flines]
    return feats, labels

# load data downloaded
X_train, y_train = load_data_cont('hd.cleveland.data.train.csv')
X_test, y_test = load_data_cont('hd.cleveland.data.test.csv')

print np.array(y_train)

[ 0.  2.  1.  0.  0.  0.  3.  0.  2.  1.  0.  0.  2.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  1.  3.  4.  0.  0.  0.  0.  3.  0.  2.  1.  0.  0.  0.
  3.  1.  3.  0.  4.  0.  0.  0.  1.  4.  0.  4.  0.  0.  0.  0.  2.  0.
  1.  1.  1.  1.  0.  0.  2.  0.  1.  0.  2.  2.  1.  0.  2.  1.  0.  3.
  1.  1.  1.  0.  1.  0.  0.  3.  0.  0.  0.  3.  0.  0.  0.  0.  0.  0.
  0.  3.  0.  0.  0.  1.  2.  3.  0.  0.  0.  0.  0.  0.  3.  0.  2.  1.
  2.  3.  1.  1.  0.  2.  2.  0.  0.  0.  3.  2.  3.  4.  0.  3.  1.  0.
  3.  3.  0.  0.  0.  0.  0.  0.  0.  0.  4.  3.  1.  0.  0.  1.  0.  1.
  0.  1.  4.  0.  0.  0.  0.  0.  0.  4.  3.  1.  1.  1.  2.  0.  0.  4.
  0.  0.  0.  0.  0.  0.  1.  0.  3.  0.  1.  0.  4.  1.  0.  1.  0.  0.
  3.  2.  0.  0.  1.  0.  0.  2.  1.  2.  0.  3.  1.  2.  0.  3.  0.  0.
  0.  1.  0.  0.  0.]


In [40]:
reg = linear_model.LinearRegression()
# fit model:
reg.fit(X_train, y_train)

# print coefficients
print reg.coef_
print reg.intercept_

# Common evaluation scores for regression
from sklearn.metrics import mean_squared_error, mean_absolute_error
print "MSE: ", mean_squared_error(y_test, y_pred)
print "MAE: ", mean_absolute_error(y_test, y_pred) 


[ -6.03270243e-04   1.96276775e-01   2.80906865e-01   6.04260193e-03
   2.89013845e-04  -1.57798558e-01   1.37685901e-01  -6.38560647e-03
   6.28071200e-02   1.54847785e-01   1.43634231e-01   3.64274504e-01
   1.59684804e-01]
-1.55286599229
MSE:  1.42
MAE:  0.76


- Add simple regularization ([ridge regression](http://scikit-learn.org/stable/modules/linear_model.html#ridge-regression))


In [44]:
# ridge regressor
reg = linear_model.Ridge(alpha=0.1)
reg.fit(X_train, y_train)

# print coefficients
print reg.coef_
print reg.intercept_

# Common evaluation scores for regression
from sklearn.metrics import mean_squared_error, mean_absolute_error
print "MSE: ", mean_squared_error(y_test, y_pred)
print "MAE: ", mean_absolute_error(y_test, y_pred) 


[ -5.97861518e-04   1.95666600e-01   2.80758785e-01   6.03936276e-03
   2.88213336e-04  -1.57111316e-01   1.37668124e-01  -6.39156389e-03
   6.27571634e-02   1.54926841e-01   1.43273063e-01   3.64008186e-01
   1.59771758e-01]
-1.55055330931
MSE:  1.42
MAE:  0.76


- Play around with others!

In [46]:
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

print reg.feature_importances_

# Common evaluation scores for regression
from sklearn.metrics import mean_squared_error, mean_absolute_error
print "MSE: ", mean_squared_error(y_test, y_pred)
print "MAE: ", mean_absolute_error(y_test, y_pred) 



[ 0.06667992  0.00487795  0.07382952  0.12881689  0.07092588  0.          0.0021317
  0.13888155  0.02087771  0.19666514  0.00936427  0.00480968  0.28213979]
MSE:  1.42
MAE:  0.76


In [48]:
from sklearn.model_selection import cross_val_score
clf = LinearSVC()

# reload classification data
X_train, y_train = load_data('hd.cleveland.data.train.csv')
print cross_val_score(clf, X_train, y_train, cv=5)

[ 0.76190476  0.70731707  0.475       0.675       0.8       ]


## Important issues to note
- Common practice: normalizing features so that values are roughly on the similar scale
- Be careful in how you interpret data/what was learned
    - In our linear regression example, the problem setting probably does not make a lot of sense
- Dealing with real-valued vs categorical data
    - categorical data are often represented as one-hot sub-vectors
    - real-valued data are often quantized to several levels
- Different classifiers are more robust/more appropriate for different problem types -- see [ML cheatsheet](http://scikit-learn.org/stable/tutorial/machine_learning_map/)
- Cross validation when there is not enough data -- See [Model selection and Evaluation](http://scikit-learn.org/stable/model_selection.html#model-selection)
    - So common and well-accepted that a [stackexchange](https://stats.stackexchange.com/) is named after it
    - Divide training data into k partitions (folds) 1, 2, ... k; prefereably keeping the same distribution among folds
    - Train k classifiers instead of 1:
        - For fold 1, use folds {2, ... k} to train model, then evaluate on data of fold 1
        - For fold i, use folds {1, ... k}\i to train model, then evaluate on data of fold i
        - Report scoring metric as average of k scores in each fold


Other useful topics to check out
- [feature preprocessing/normalization](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)
- [feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)
- [dimensionality reduction](http://scikit-learn.org/stable/modules/decomposition.html#decompositions)

## Conclusion
- Use [responsibly](https://xkcd.com/643/)
<img src="https://imgs.xkcd.com/comics/ohm.png">
- Handy guide by sci-kit learn: [ML cheatsheet](http://scikit-learn.org/stable/tutorial/machine_learning_map/)
- A very recent critique piece on [misused ML](https://medium.com/@blaisea/physiognomys-new-clothes-f2d4b59fdd6a)
- Try to understand what's going on before implementing/using a certain algorithm
- Resources:
    - Free Coursera [class(es)](https://www.coursera.org/learn/machine-learning)
    - Pedro Domingos' ["A Few Useful Things to Know about ML"](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)
    - scikit-learn's [tutorial pages/docs](http://scikit-learn.org/stable/tutorial/index.html)
    - Some classic textbooks: 
        - K. Murphy, Machine Learning: a Probabilistic Perspective, 2012
        - C. Bishop, Pattern Recognition and Machine Learning, 2007
        - R. Duda, P. Hart & D. Stork, Pattern Classification, 2001
        - T. Mitchell, Machine Learning, 1997
    - UW classes