<a href="https://colab.research.google.com/github/scaomath/wustl-math450/blob/main/Lectures/Math_450_Notebook_9_(Overfit).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding lecture 9 of Math 450

Things need to know:

## Python
- Dictionary
- Generator, iterator, `iter()`, `next()`, `enumerate()`, `try: except:` flow control.
- Matrix-vector multiplications and "broadcastability".


## PyTorch
- `loss.backward()` vs hand computation.
- Why `with torch.no_grad():` is necessary in manual gradient descent computation and inference.
- Build simple neural network using `torch.nn.Sequential()`
- Build complex neural network like CNN using `torch.nn.Module` class, `constructor`, inheritance, the usage of `super`.
- Hand-compute gradient descent for a binary classification problem using torch `DataLoader` interface for (mini-batch) SGD.
- How to implement autograd using `Optimizer` class.
- Convolution neural network (CNN, to be learned)

## Project
- Implement an optimizer to train a CNN to classify handwritten Japanese characters.
- Tune it using validation to achieve reasonabe accuracy.


## Today
- Overfit and validation

## Data fitting

For a set of data $\{(x^{(i)},y^{(i)})\}_{i=1}^{N}$, the NN model only "fits" the data roughly, not precisely. Yet we can achieve reasonably well accuracy with it.

## Types of data fitting

- **Interpolation**: suppose we know $n+1$ distinct grid points
$x_0, x_1, x_2, \dots, x_n$, and the values the values at each of these
points as $f_k = f(x_k)$, but we have no idea of what $f$'s analytical expression is. Then the problem of interpolation is to find an approximation of $h(x)$ that is defined at any point $x \in [a, b]$ that **coincides** with $f(x)$ at $x_k$.

- **Regression**: we can also consider a regression model, to minimize the mean square error $\dfrac{1}{n}\|f(x_k) - h(x_k; W)\|^2$, where $h(x_k; W)$ is the NN's output.


## Tools to use
Today we will borrow something from `scikit-learn` package.

Reference: Adapted from [https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html) to be more readable.

## Data

`X_train`, `y_train` are our training data. In the following example, we have 10 of them.

## Model
We will use the linear regression in `scikit-learn `package to fit not just a linear function but a polynomial function of any degree, e.g. $h(x) = w_{10} x^{10} + w_9 x^9 + \dots + w_1 x + b$, to the data. 

Remark: for those of us who are interested, we are essentially using the Vandermonde matrix by adding $x^p$ as features. 

## Validation
We choose a bunch of testing points, see if our model (built from only 10 noisy samples) approximates our true function $x^2$ to a reasonable accuracy.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.pipeline import Pipeline # for easier fitting using high degree polynomials testing
from sklearn.preprocessing import PolynomialFeatures # evaluating polynomials at points
from sklearn.linear_model import LinearRegression # we have used this before

In [None]:
np.random.seed(42)
X_train = np.linspace(0,2,10)
# true function is x^2, adding some noise
true_function = lambda x: x**2
y_train = true_function(X_train) + np.random.normal(0,0.5, size=10)
plt.scatter(X_train, y_train, s=40, alpha=0.8);

In [None]:
# linear regression
poly_degree = 1
polynomial_features = PolynomialFeatures(degree=poly_degree, include_bias=True)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])

In [None]:
pipeline.fit(X_train.reshape(-1,1), y_train)

In [None]:
num_samples = 100
X_test = np.linspace(0, 2, num_samples) # this the the testing points
y_pred = pipeline.predict(X_test.reshape(-1,1)) 
y_true = true_function(X_test)
error = np.mean((y_pred - y_true)**2)

plt.figure(figsize=(8,6))
plt.plot(X_test, y_pred, linewidth = 2, label="Model's prediction")
plt.plot(X_test, y_true, linewidth = 2, label="True function")
plt.scatter(X_train, y_train, edgecolor='b', s=40, label="Training samples")
plt.legend(loc='best', fontsize = 'x-large')
plt.title(f"Mean Square Error = {error:.2e}", fontsize = 'xx-large')
plt.show()

# What if we increase the degree?
Try increasing the degree gradually in `PolynomialFeatures()` (since we have packed `PolynomialFeatures()` and `LinearRegression()` into one class, we can use pipeline). 

In [None]:
# now we use pipeline to change the polynomial_features directly w/o redefine it
# better than the scikit-learn's example's clumsy usage of pipeline
pipeline.set_params(polynomial_features__degree=3)
pipeline.fit(X_train.reshape(-1,1), y_train)

## validation
num_samples = 100
X_test = np.linspace(0, 2, num_samples) # this the the testing points
y_pred = pipeline.predict(X_test.reshape(-1,1)) # this the value predicted by the model
y_true = true_function(X_test)
error = np.mean((y_pred - y_true)**2)

plt.figure(figsize=(8,6))
plt.plot(X_test, y_pred, linewidth = 2, label="Model's prediction")
plt.plot(X_test, y_true, linewidth = 2, label="True function")
plt.scatter(X_train, y_train, edgecolor='b', s=40, label="Training samples")
plt.legend(loc='best', fontsize = 'x-large')
plt.title(f"Mean Square Error = {error:.2e}", fontsize = 'xx-large');
plt.show()

# Metric for regression problem

Coefficient of determination $R^2$
$$
R^2\Big(\mathbf{y}^{\text{Actual}}, \mathbf{y}^{\text{Pred}}\Big) = 1 - \frac{\displaystyle\sum_{i=1}^{n_{\text{test}}} \left(y^{(i),\text{Actual}} - y^{(i),\text{Pred}}\right)^2}{\displaystyle\sum_{i=1}^{n_\text{test}} (y^{(i),\text{Actual}} - \bar{y}^{\text{Actual}})^2}
\quad 
\text{ where }\; \bar{y}^{\text{Actual}} = \displaystyle\frac{1}{n_{\text{test}}} 
\sum_{i=1}^{n_\text{test}} y^{(i),\text{Actual}}
$$