# Scikit-learn concepts

Scikit-learn is a very popular and well-written/well-documented library for doing classical machine learning in Python

There are a two key concepts you need to learn in order to use scikit-learn:

- Data representation
- The estimator API

After learning these concepts you will be able to use scikit-learn as a laboratory for experimenting with different algorithms

The possible uses for this laboratory are:

- Learning and understanding a new ML algorithm
- Exploring properties of a new dataset
- Establishing a baseline model before trying something with higher complexity
- Preprocessing data for use in more advanced algorithms and workflows

In addition to these notes, we reccommend looking at the excellent resources in the ["Introduction to Scikit-Learn" chapter](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html) of Jake Vanderplas' excellent, free book "Python Datascience Handbook"

Some of the content in this section and others (will be noted) is borrowed from Jake's notes -- see the [license](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/LICENSE-TEXT) for his materials

## Data representation

For both supervised and unsupervised learning tasks, you must hand scikit-learn a 2-dimensional dataset containing one or more samples of one or more features

We'll call this the feature or input matrix

Each row of the feature matrix treated as an observation and each column is a feature or variable

**Example**: In the housing price example, suppose we wanted to use the features `sqft_living`, `bedrooms`, and `bathrooms` as features in order to predict `price`. In this case our feature matrix `X` would look like

$$X=\begin{bmatrix} 
\text{sqft_living}_1 & \text{bedrooms}_1 & \text{bathrooms}_1 \\
\text{sqft_living}_2 & \text{bedrooms}_2 & \text{bathrooms}_2 \\
\vdots & \vdots & \vdots \\
\text{sqft_living}_N & \text{bedrooms}_N & \text{bathrooms}_N
\end{bmatrix}$$

where the subscripts denote a sample number and we have $N$ samples

### Features

The labels or features for a supervised learning task should be provided as either a 1 dimensional array or pandas Series

In the housing price example we would have

$$
y = \begin{bmatrix}
\text{price}_1 \\
\text{price}_2 \\
\vdots \\
\text{price}_N
\end{bmatrix}
$$

where again subscripts denote sample numbers and we have $N$ samples

If we had more than one target variable (suppose we want to predict both the house price and estimated property tax) we would stack the labels side by side as columns

For example

$$
y = \begin{bmatrix}
\text{price}_1 & \text{tax}_1 \\
\text{price}_2 & \text{tax}_2\\
\vdots & \vdots \\
\text{price}_N & \text{tax}_N
\end{bmatrix}
$$

## Estimator API

One of the biggest benefits of learning to use scikit-learn is that once you understand the core API you can very quickly try out different algorithms and estimators

**Disclaimer**: the content below is made up  and a summary of how I think about and use scikit-learn. It hopefully lines up with the documentation and other resources, but I make no promises!

The scikit-learn estimator has two main types of actor (class):

1. Transformer: The takes an feature matrix `X` and returns a transformed feature matrix
2. Predictor: Takes a feature matrix `X` and returns a predicted value/label/class/cluster

### Transformers

As an example transformer, consider the operation of standardizing the your dataset:

Let $x^i$ represent the $i$th column of the matrix `X`. To standardize $x^i$  we would subtract its mean and divide by its standard deviation $\tilde{x}^i_j = \frac{x^i_j - \text{mean}(x^i)}{\text{std}(x^i)}$

The fitting process for this transformer would be to compute the mean and standard deviation of each feature for later usage

The scikit-learn class `preprocessing.StandardScaler` implements this procedure

In order to fit the transformer we use the `.fit` method

In order to transform a feature matrix `X` we then call the `.transform` method

Let's see an example

In [8]:
from sklearn import preprocessing
import numpy as np

scaler = preprocessing.StandardScaler()

X = np.arange(12.0).reshape((4, 3))
print("X =\n", X)

In [10]:
X.mean(axis=0)

array([4.5, 5.5, 6.5])

In [11]:
X.std(axis=0)

array([3.35410197, 3.35410197, 3.35410197])

In [12]:
scaler.fit(X)  # will comptue mean and std of each column
transformed_X = scaler.transform(X)

print("transformed_X =\n", transformed_X)

transformed_X =
 [[-1.34164079 -1.34164079 -1.34164079]
 [-0.4472136  -0.4472136  -0.4472136 ]
 [ 0.4472136   0.4472136   0.4472136 ]
 [ 1.34164079  1.34164079  1.34164079]]


In [13]:
transformed_X.mean(axis=0)  # should be 0

array([0., 0., 0.])

In [14]:
transformed_X.std(axis=0)  # should be 1

array([1., 1., 1.])

**Check for understanding**: With your neighbor do the following:

- Use `scaler` to transform the variable `new_X` defined in the code cell below
    - Check the mean and standard deviation along `axis=0` (as we did above) for both `new_X` and the transformed version
    - Does the transformed version of `new_X` have mean 0 and std 1? Why or why not?
- Repeat the above, but this time transform `new_X` with the method `scaler.fit_transform`
    - Check meean and standard deviation
    - Does this transformed version of `new_X` have mean 0 and std 1? Why or why not? Explain

In [17]:
new_X = np.random.randn(100, 3)

# your code here

In [18]:
from sklearn import linear_model

### Predictors

Predictors are the actual machine learning algorithms you've likely heard of: linear regression, decision trees, neural networks, ect.

Like transformers they have a `.fit` method that takes as in input the feature matrix `X` and the target vector `y` and fits the machine learning model

Unlike transformers there is not a `.transform` method, but rather a `.predict` method

`.predict` takes as an input a feature matrix `X` and returns a predicted target, label, or cluster

We'll dive in more when we get to real datasets, but for now we'll show you how to make the basic linear regression predictor

In [22]:
from sklearn import linear_model

y = X[:, 1] + 2 * X[:, 0] - X[:, 2]
linreg = linear_model.LinearRegression()
linreg.fit(X, y)
linreg.predict(X)

array([-1.,  5., 11., 17.])

**Check for understanding**: With your neighbor do the following:

- Come up with a new target vector and name it`y2` -- this can be anything you want
- Experiment with the `linreg.fit_predict` method and see if you can determine what it does
- Discuss with your neighbor

### Pipelines

It is very common in scikit-learn user code to see one or more transformers applied before sending the transformed data to a predictor

In order to do this properly the user would have to call `.fit` on all transformers with training data, call 

In order to make this convenient and reproducible for users scikit-learn has the notion of a **pipeline**

The basic structure of the pipeline is:

```python
[
    transformer1,
    transformer2,
    ...,
    transformerN,
    predictor
]
```

Once you have a pipeline you can call `.fit` on the pipeline and scikit learn will do the following:

```python
X1 = transformer1.fit_transform(X)
X2 = transformer2.fit_transform(X1)
...
XN = transformerN.fit_transform(X)
predictor.fit(XN, y)
```

You can then call `.predict` on the pipeline and sklearn will do the following:

```python
X1 = transformer1.transform(X)
X2 = transformer2.transform(X1)
...
XN = transformerN.transform(X)
output = predictor.predict(XN)
```

and return `output` for you.

Notice:

- All pipeline steps are `.fit` when you call `.fit`
- When `.fit`ting or `.predict`ting, the pipeline will call `.transform` on all transformer layers using the data you pass in

There are a few ways to make a pipeline, but my favorite is as follows:

In [25]:
from sklearn import pipeline, linear_model

model = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.LinearRegression()
)

model.fit(X, y)

model.predict(X)

array([-1.,  5., 11., 17.])

**Check for understanding**: Look at the routines available in the `preprocessing` module. Try including more than one transformer in your pipeline. Does it work?