## **Overfitting, underfitting, bias-variance tradeoff**

*This notebook is a collection of several articles online which describe the issue of overfitting and underfitting* [1,2]

When developing a machine learning model, we are often excited to see great performance of our model.  

Unfortunately, when a model performs well on training data, it is often an indication of poor performance in the real world.  Let's begin with a simple example of overfitting. 

[1] https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html

[2] https://www.dataquest.io/blog/learning-curves-machine-learning/


In [None]:
# We've already heard about scikit-learn, the machine learning toolkit in python.  
# In our previous workshops, we have used scikit-learn to build a regression model for some data.
# The recipe will be the same; we will build a linear regression model, but it will be based on contrived data.

# Begin by loading scikit-lean
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [None]:
# To demonstrate over fitting, we need a "True" function.  This is an sinusoid function that we are trying to fit a model to.
def true_fun(X):
    return np.cos(1.5 * np.pi * X)

# Now we need to build our dataset.

# seed() is used to ensure that we are all seeing the same plots.  You can change seed, if you want the data to come out differently.
np.random.seed(0)

# We are selecting 30 data points to fit to.
n_samples = 30

# We are going to choose some random points as input to the function
X = np.sort(np.random.rand(n_samples))
# Then we evaluate the function and we add some noise
y = true_fun(X) + np.random.randn(n_samples) * 0.1

plt.scatter(X,y)
plt.xlabel("x")
plt.ylabel("y")
plt.title("Samples from our Mystery function")
plt.show()

In [None]:
# Given our samples of {(x0,y0),(x1,y1),(x2,y2),...,(xn,yn)}, let's fit a model to it.

# Line model

# A linear model of degree 1 is a line
degree = 1
polynomial_features = PolynomialFeatures(degree=degree,include_bias=False)

# Linear Regression and the Pipeline will take care of the fitting, so we don't have to.  You can review the prior workshop for the math related to Linear Regression
# https://www.youtube.com/watch?v=OG8ZFDBt5f0&t=4s
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
# This does the fit.  For any statistical model that you are developing, there will be a fit function.  It takes the X values (inputs) and the y values (known outputs).
pipeline.fit(X[:, np.newaxis], y)

# This "scores" the performance of the model
scores = cross_val_score(pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10)

X_test = np.linspace(0, 1, 100)
plt.scatter(X, pipeline.predict(X[:, np.newaxis]), label="Model")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Degree {}\nMSE = {:.2}(+/- {:.2})".format(degree, -scores.mean(), scores.std()))
plt.show()

# The Mean Square Error (MSE) is the difference between the known datapoints and the model


In [None]:
# A line equation didn't do a very good job.  That is disappointing, but not surprising.  Maybe we could jump to a polynomial that is higher order

# A polynomial model of degree 2
degree = 2
polynomial_features = PolynomialFeatures(degree=degree,include_bias=False)

linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
# This does the fit.  For any statistical model that you are developing, there will be a fit function.  It takes the X values (inputs) and the y values (known outputs).
pipeline.fit(X[:, np.newaxis], y)

# This "scores" the performance of the model
scores = cross_val_score(pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10)

X_test = np.linspace(0, 1, 100)
plt.scatter(X, pipeline.predict(X[:, np.newaxis]), label="Model")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Degree {}\nMSE = {:.2}(+/- {:.2})".format(degree, -scores.mean(), scores.std()))
plt.show()




In [None]:
# Degree 2 is better than Degree 1.  We should go to a higher order polynomial

# INSERT YOU FAVORITE POLYNOMIAL DEGREE HERE
myFavoritDegree = __


In [None]:

degree = myFavoritDegree
polynomial_features = PolynomialFeatures(degree=degree,include_bias=False)

linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
# This does the fit.  For any statistical model that you are developing, there will be a fit function.  It takes the X values (inputs) and the y values (known outputs).
pipeline.fit(X[:, np.newaxis], y)

# This "scores" the performance of the model
scores = cross_val_score(pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10)

plt.scatter(X, pipeline.predict(X[:, np.newaxis]), label="Model")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Degree {}\nMSE = {:.2}(+/- {:.2})".format(degree, -scores.mean(), scores.std()))
plt.show()




In [None]:
# Remember that regression models build a curve, so that it can sample points other than the known samples.
# Let's add some additional points into the mix and see how it does


# ADD SOME NEW POINTS TO SEE WHAT HAPPENS
newPoints = np.array([__,__,__,__])

In [None]:


degrees = [1, 2, myFavoritDegree]

plt.figure(figsize=(14, 5))

for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i],
                                             include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(pipeline, X[:, np.newaxis], y,
                             scoring="neg_mean_squared_error", cv=10)

    plt.scatter(X, pipeline.predict(X[:, np.newaxis]), label="Model")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
    plt.scatter(newPoints, pipeline.predict(newPoints[:, np.newaxis]), label="New Points")    
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree {}".format(degrees[i]))

plt.show()




In [None]:
# Now, we need to figure out what is really going on, because our new points just don't seem to fit as well.

#  Let's look at the regression model and understand what function it created


degrees = [1, 2, myFavoritDegree]

plt.figure(figsize=(14, 5))

for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i],
                                             include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(pipeline, X[:, np.newaxis], y,
                             scoring="neg_mean_squared_error", cv=10)
    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples") 
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree {}".format(degrees[i]))

plt.show()






### **ROOM TO TAKE NOTES AMD LESSONS LEARNED**








--

