# Reminders on scikit-learn and machine learning (correction)

Inspired by Xavier Dupré's course

A few simple exercises on *scikit-learn*. The notebook is long for those who are new to machine learning and probably without suspense for those who have already done some.

In [1]:
%matplotlib inline

## Synthetic dataset

We simulate a set of random data, with a uniform distribution $\mathcal{U}_{(0,1)}$.

In [2]:
from numpy import random
n = 1000
X = random.rand(n, 2)
X[:5]

array([[0.02037951, 0.30968762],
       [0.29413   , 0.37153774],
       [0.41002219, 0.33296235],
       [0.86395508, 0.87805471],
       [0.94593263, 0.3296339 ]])

Let's create a starting model: $Y = 3 X_1 - 2 X_2^2 + \epsilon$.

We'll need to approximate $Y$ using the descriptors $X_1$ and $X_2$. 

$\epsilon $~$ \mathcal{U}_{(0,1)}$ is a source of noise that we can't control for prediction.

In [3]:
y = X[:, 0] * 3 - 2 * X[:, 1] ** 2 + random.rand(n)
y[:5]

array([-2.99442799e-03,  1.27015060e+00,  1.85177846e+00,  1.78228679e+00,
        3.38708163e+00])

## Exercise 1: dividing into training and test databases

We need to test our model on a different database from the one used for training **in order to measure its power of generalization**. As we have seen, the empirical risk on a given set of data is not characteristic of the general risk, and we may witness a phenomenon of over-learning on the training set.

In our case, we want the model to learn the law $3 X_1 - 2 X_2^2$ and **overlearning would be equivalent to memorizing the noise vector $\epsilon$** which only corresponds to variations in $Y$ that are independent of our model.

Simple [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [4]:
# to fill

## Exercice 2: learn a linear regression

Find the weights $\theta = \begin{pmatrix}
           \theta_{1} \\
           \theta_{2}
         \end{pmatrix}$ solution de $\underset{\theta}{\arg\max} \sum_{i=1}^{n}|Y_i-f_{\theta}(\mathbf{X}_i)|^2$
         
Where $f_{\theta}(\mathbf{X}) = \theta_0 + \sum_{d=1}^{D}\theta_d X_d$ with in our case $D=2$

Calculate the coefficient $R^2$. 
$$R^2=1-\frac{\sum_{i=1}^{n}|Y_i-f(\mathbf{X}_i)|^2}{\sum_{i=1}^{n}|Y_i-\overline{Y}|^2}$$

Where $\mathbf{X} = \begin{pmatrix}
           X_{1} \\
           X_{2}
         \end{pmatrix}$ et $\overline{Y}=\frac{1}{n}\sum_{i=1}^{n}Y_i$

Use : [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), [r2_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html).

In [None]:
# to fill

## Exercise 3: improve the model by applying a well-chosen transformation

The starting model is: $Y = 3 X_1 - 2 X_2^2 + \epsilon$. Simply add polynomial features with [PolynomialFeatures](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).

Taking the parameter :
```python
degree=2
```
The intial descriptor set being $\mathbf{X} = \begin{pmatrix} X_{1} \\ X_{2} \end{pmatrix}$ 
will now become $\mathbf{X} = \begin{pmatrix} 1 \\ X_{1} \\ X_{2} \\ X_{1}^2 \\ X_{1}X_{2} \\ X_{2}^2 \end{pmatrix}$ which gives : 

$$f_{\theta}(\mathbf{X}) = \theta'_0 + \theta_1 X_1 + \theta_2 X_2 + \theta_3 X_1^2 + \theta_4 X_1X_2 + \theta_5 X_2^2$$

In [None]:
# to fill

## Exercise 4: learn a random forest

Use: [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

In [10]:
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor() 
# learning classifier 
clf.fit(X_train, y_train)
# scoring classifier 
clf.score(X_test, y_test)

  from numpy.core.umath_tests import inner1d


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

## Exercise 5: A bit of math

Compare the two models on the following data? What do you notice? Why or why not?

In [13]:
X_test2 = random.rand(n, 2) + 0.5
y_test2 = X_test2[:, 0] * 3 - 2 * X_test2[:, 1] ** 2 + random.rand(n)

In [None]:
# to fill

## Exercise 6: making a graph with...

The scatterplot of the first and second game, the predictions of the two models, a legend, a title... with [pandas](https://pandas.pydata.org/) or directly with [matplotlib](https://matplotlib.org/) as you wish.

In [None]:
# to fill

## Exercise 7: Illustrating overfitting with a decision tree

As the complexity of the model increases, overfitting again occurs. Similarly, the model using only $X_1$ and $X_2$ is not necessarily adapted to the problem and is in a case of underlearning. 

<img src="images/ex_over-underfitting.png"> 

On the first data set.

In [None]:
from sklearn.tree import DecisionTreeRegressor

res = []
for md in range(1, 20):
    # to fill
    
    res.append(dict(profondeur=md, r2_train=r2_train, r2_test=r2_test))

df = pandas.DataFrame(res)
df.head(10)

In [None]:
ax = df.plot(x='profondeur', y=['r2_train', 'r2_test'])
ax.set_title("Evolution du R2 selon la profondeur");

## Exercise 8: Increasing the number of features and regularizing a logistic regression

The aim is to examine the impact of regularizing the coefficients of a logistic regression as the number of features increases. We use polynomial features and a logistic regression. [Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) or [Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html).

In [None]:
from sklearn.linear_model import Ridge, Lasso
import numpy.linalg as nplin
import numpy

def coef_non_zero(coef):
    return sum(numpy.abs(coef) > 0.001)

res = []
for d in range(1, 21):
    # to fill

df = pandas.DataFrame(res)
df.head(21)

Make 2 graphs with :
- Graph 1: number of features 
- Graph 2: number of non-zero coefficients
As a function of the number of features in the model

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
# to fill