# Course: Supervised Learning with scikit-learn
- [DataCamp course link](https://www.datacamp.com/courses/supervised-learning-with-scikit-learn/)

In [1]:
# Pre-load modules used later
from IPython.display import Image

## Chapter 1: Classification
- [Slides](slides/Supervised Learning with scikit-learn/ch1_slides.pdf)


- Models are implemented as Python classes and serve two functions:
  1. Implement algos for learning & predicting (behavior)
  1. Store information learned from training/fitting (data)
- **K-nearest neighbors**
  - larger k = smoother decision boundary = less complex model
  - smaller k = more granular decision boundary = more complex model = more prone to overfitting
- Key model performance measure: 
  - **accuracy** = # correct predictions / total # of data points
- **Model complexity curves** are awesome!


- **Train/test split**

In [26]:
from sklearn.model_selection import train_test_split

## Chapter 2: Regression
- [Slides](slides/Supervised Learning with scikit-learn/ch2_slides.pdf)


- **Loss function** - Typically uses OLS (Ordinary Least Squares) method in linear regression models. Same as minimizing *RMSE (Root Mean Squared Error)* of the *predictions* on the test set.
- Key model performance measure: 
  - **R-squared** = **R²** = amount of variance in target variable predicted from feature variables.
  

- **k-fold cross-validation**
  >```python
from sklearn.model_selection import cross_val_score
reg = linear_model.LinearRegression()
cv_results = cross_val_score(reg, X, y, cv=5)
```

### Regularization
- Purpose is to penalize & discourage large coefficients (which can lead to overfitting)
- Denoted as *"alpha"* or *"lambda"*
- Common methods:
  - `Ridge` - (L2) Typical first choice for regression models.
  - `Lasso` - (L1) Useful for feature selection; it minimizes least-impactful features.
  - `Elastic net` - The penalty term is a linear combination of the L1 and L2 penalties:
  >\begin{equation*}
l1\_ratio = {a\ ∗\ L1\ +\ b\ ∗\ L2}
\end{equation*}

  
### Logistic regression
- Used for *binary* classification (NOT for continuous target variables despite its name)
- Key model performance measure:
  - **ROC curve** (Receiver Operating Characteristic curve)
  - **AUC** (Area Under the [ROC] Curve) - The larger the better!
- Regularization param for logreg is **C**, which controls the *inverse* of the regularization strength. A *large C* can lead to an *overfit* model, while a *small C* can lead to an *underfit* model.

### Support Vector Machines (SVM)
- A discriminative classifier formally defined by a separating hyperplane.
- Hyperparameters:
  - C - controls the regularization strength
  - gamma - controls the kernel coefficient
- **Support-vector clustering (SVC)**: A similar method that also builds on kernel functions but is appropriate for *unsupervised* learning.

## Chapter 3: Fine-tuning your model
- [Slides](slides/Supervised Learning with scikit-learn/ch3_slides.pdf)


### "Confusion matrix" (That's what that thing is called!)
- Better model performance measure:
>\begin{equation*}
\mathbf{accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
\end{equation*}

- Other Confusion matrix metrics:
>\begin{equation*}
\mathbf{precision} = \frac{TP}{TP + FP}
\end{equation*}
  - Intuition: Very few false positives.  High *precision* means for example that very few real emails were predicted as spam.


- 
>\begin{equation*}
\mathbf{recall/sensitivity/hit\ rate} = \frac{TP}{TP + FN}
\end{equation*}
  - Intuition: High *recall* means for example that a high percentage of spam emails were predicted correctly as spam.


- 
>\begin{equation*}
\mathbf{F1\ score} = 2 * \frac{precision * recall}{precision + recall}
\end{equation*}

In [27]:
# ROC curve example
Image(url="https://www.medcalc.org/manual/_help/images/roc_intro3.png")

### Hyperparameter tuning
- Hyperparameters cannot be explicitely learned by fitting the model.
- Trial & error thru **grid search** is an effective method for finding the best hyperparams. The **randomized search** alternative saves computation time.
- Essential to use *cross-validation* with a **hold-out set** here.

In [28]:
from sklearn.model_selection import GridSearchCV
# OR
from sklearn.model_selection import RandomizedSearchCV

## Chapter 4: Preprocessing and pipelines

### Encoding categorical values
- **One hot encoding** - Uses **"dummy variables"** to map/encode categorical values to binary values.
  - scikit-learn: `OneHotEncoder()` 
  - pandas: `get_dummies()`
- **Label encoding** - Method mapping each category to its own numerical value.

### Imputing missing data
- sklearn's `SimpleImputer()` uses `np.mean()` as default method.

In [29]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values='NaN', strategy='mean')

### The Pipeline design pattern

- The **`Pipeline`** object takes sequential list of steps
- Output of one step is input to next step
- Each `step` is a *tuple* with two elements:
  - Name - string
  - Transform - object implementing .fit() and .transform()
- A Pipeline can itself also serve as a step
- `make_pipeline(*steps)` is a shorthand constructor which does not require step names.


- Preprocessing multiple dtypes -- Use sklearn helper functions to combine steps which can't be directly fed from one to another:
  - [**`FunctionTransformer()`**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) - Turns a Python function into a Pipeline-compatible Transformer object.  In our use case we created FunctionTransformer() objects to return numeric and text columns from DataFrame separately.
  - [**`FeatureUnion()`**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) - Combines multiple feature matrixes from sub-pipelines into one 'union' step in pipeline.


In [30]:
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

steps = [('imputation', imp),
         ('logistic_regression', logreg)]

pipeline = Pipeline(steps)

Hyperparameters can be specified with the **'`step_name__parameter_name`'** notation for scikit-learn:

In [31]:
parameters = {'SVM__C': [1, 10, 100],
              'SVM__gamma': [0.1, 0.01]}

### Normalizing (aka Centering & Scaling)

Typical normalization methods:
- *Standardization*: All features centered around zero and variance of one.  Range from -1 to 1
  - (Subtract the mean and divide by variance)
- Range from 0 to 1
  - (Subtract the minimum and divide by the range)



---

# Course: Machine Learning with the Experts: School Budgets
- [DataCamp.com course link](https://www.datacamp.com/courses/machine-learning-with-the-experts-school-budgets/)
- Full code used in this course is in [this Jupyter Notebook](https://github.com/datacamp/course-resources-ml-with-experts-budgets/blob/master/notebooks/1.0-full-model.ipynb)


## Chapter 1: Exploring the raw data
- [Slides](slides/Machine Learning with the Experts - School Budgets/ch1_slides.pdf)

- **Log loss** for binary classification
  \begin{equation*}
logloss = - \frac{1}{N} \sum_{i=1}^N (y_i log(p_i) + (1 - y_i)log(1 - p_i))
\end{equation*}


In [32]:
Image(url="http://wiki.fast.ai/images/4/43/Log_loss_graph.png", width=450)

-  **[`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict) vs [`.predict_proba()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba)**: Output from the `LogisticRegression` model method `.predict()` is just binary 0/1 instead of probabilities ranging from 0 to 1.

## Chapter 2: Creating a simple first model
- [Slides](slides/Machine Learning with the Experts - School Budgets/ch2_slides.pdf)

### Multi-class classification tips
- This data has rarely-occuring labels, therefore a standard train-test-split may be error-prone.  
  - Scikit-learn's **`StratifiedShuffleSplit`** is a solution, however it only works only with a *single target variable*.
  - A helper method has been created to do multi-class stratified train-test split: [**`multilabel_train_test_split()`**](https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/data/multilabel.py)
- **OneVsRestClassifier** fits a separate classifier for each class.

In [33]:
from sklearn.multiclass import OneVsRestClassifier

### NLP tools & tips

- **`CountVectorizer()`** - Given an array of words this tokenizes, creates vocabulary, and calculates frequency counts. "bag-of-words" (I did this shit 18 years ago with Perl.)

## Chapter 3: Improving your model
- [Slides](slides/Machine Learning with the Experts - School Budgets/ch3_slides.pdf)

- More Pipelines practice

## Chapter 4: Learning from the experts
- [Slides](slides/Machine Learning with the Experts - School Budgets/ch4_slides.pdf)

A number of tricks (optimization methods) where used by the [DrivenData competition winner](https://www.drivendata.org/competitions/4/box-plots-for-education/).
- NLP:
  - n-gram usage
  - Tokenization method
  - Normalization (feature scaling)
    - scikit-learn's [`MaxAbsScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html)
  - [**Dimensionality reduction**](https://en.wikipedia.org/wiki/Dimensionality_reduction)
    - scikit-learn's [`SelectKBest`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html), which uses a [chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test)
- Stats:
  - **Interaction terms**: A statistical method that lets your model express what happens if two features appear together.
    - Implemented in scikit-learn's [`PolynomialFeatures`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).
    - Course uses custom [`SparseInteractions`](https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py) to add support for sparse matrixes to save memory.
    - Equation:
  \begin{equation*}
\beta_1 x_1 + \beta_2 x_2 + \beta_3(x_1 * x_2)
\end{equation*}
- Computation:
 - "**Hashing trick**" to reduce memory consumption with negligible loss in accuracy.
   - sklearn's [`HashingVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) (used in place of `CountVectorizer`)
