---

_You are currently looking at **version 0.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource._

---

# Assignment 2

In this assignment you'll explore the relationship between model complexity and generalization performance, by adjusting key parameters of various supervised learning models. Part 1 of this assignment will look at regression and Part 2 will look at classification.

## Part 1 - Regression

In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10


X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)

def intro():
    %matplotlib notebook

    plt.figure()
    plt.scatter(X_train, y_train, label='training data')
    plt.scatter(X_test, y_test, label='test data')
    plt.legend(loc=4);

intro()

<IPython.core.display.Javascript object>

### Question 1

Write a function that fits a polynomial LinearRegression model on the *training data* `X_train` for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. `np.linspace(0,10,100)`) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.

<img src="assets/polynomialreg1.png" style="width: 1000px;"/>

The figure above shows the fitted models plotted on top of the original data (using `plot_one()`).

<br>
*This function should return a numpy array with shape `(4, 100)`*

In [3]:
def answer_one():
    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures

    
    # Iteration to generate the polynomial for every degree
    degrees = [1,3,6,9]
    results = []
    for i in degrees:

        # Polynomial transformation of the features
        # Here we change the degree in every iteration
        poly = PolynomialFeatures(degree=i, include_bias=False)
        X_train_poly = poly.fit_transform(X_train.reshape(-1,1))

        # Linear Regression of the transformed polynomial features
        linreg = LinearRegression()
        linreg.fit(X_train_poly, y_train)

        # Predictions by the Regression model
        # We define the values to predict
        X_predict_input = np.linspace(0, 10, 100).reshape(-1,1)
        # Remember! The values to predict must also be transformed to the same degree as the features
        X_predict_input_poly = poly.fit_transform(X_predict_input)

        # Predictions
        y_predicted = linreg.predict(X_predict_input_poly)

        # Appends the predictions for each degree
        results.append(y_predicted)

    # Transforms the list into an array, each row contains the predictions for each degree
    # It generates an array with shape (4, 100)
    results = np.array(results)
    

    return results

In [4]:
# feel free to use the function plot_one() to replicate the figure 
# from the prompt once you have completed question one
def plot_one(degree_predictions):
    plt.figure(figsize=(10,5))
    plt.plot(X_train, y_train, 'o', label='training data', markersize=10)
    plt.plot(X_test, y_test, 'o', label='test data', markersize=10)
    for i,degree in enumerate([1,3,6,9]):
        plt.plot(np.linspace(0,10,100), degree_predictions[i], alpha=0.8, lw=2, label='degree={}'.format(degree))
    plt.ylim(-1,2.5)
    plt.legend(loc=4)

plot_one(answer_one())


<IPython.core.display.Javascript object>

### Question 2

Write a function that fits a polynomial LinearRegression model on the training data `X_train` for degrees 0 through 9. For each model compute the $R^2$ (coefficient of determination) regression score on the training data as well as the the test data, and return both of these arrays in a tuple.

*This function should return a tuple of numpy arrays `(r2_train, r2_test)`. Both arrays should have shape `(10,)`*

In [5]:
def answer_two():
    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.metrics import r2_score

    # Initialize list to store the scores
    train_scores = []
    test_scores = []

    # Iteration to generate the polynomial for every degree with values between [0-9]
    degrees = range(10)
    for degree in degrees:

        ## Polynomial transformation of the features ##
        # Here we define different values of include_bias according to the value of degree
        # include_bias is only True when degree=0
        if degree == 0:
            bias = True
        else:
            bias= False

        # Here we change the degree in every iteration and define include_bias=True only when degree=0 else include_bias=False
        poly = PolynomialFeatures(degree=degree, include_bias=bias)
        X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
        X_test_poly = poly.fit_transform(X_test.reshape(-1,1))

        ## Linear Regression of the transformed polynomial features
        linreg = LinearRegression()
        linreg.fit(X_train_poly, y_train)


        ## Predictions of y values for each data set
        y_train_predicted = linreg.predict(X_train_poly)
        y_test_predicted = linreg.predict(X_test_poly)


        ## Scores using sklearn.metrics.r2_score
        # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
        # 2_score(y_true, y_predicted)
        test_score = r2_score(y_test, y_test_predicted)
        train_score = r2_score(y_train, y_train_predicted)

        test_scores.append(test_score)
        train_scores.append(train_score)

        # After all the iterations we generate a tuple with to arrays containing the values of each score list
        tupla = (np.array(train_scores), np.array(test_scores)) 

    return tupla

### Question 3

Based on the $R^2$ scores from question 2 (degree levels 0 through 9), what degree level corresponds to a model that is underfitting? What degree level corresponds to a model that is overfitting? What choice of degree level would provide a model with good generalization performance on this dataset? 

(Hint: Try plotting the $R^2$ scores from question 2 to visualize the relationship)

*This function should return a tuple with the degree values in this order: `(Underfitting, Overfitting, Good_Generalization)`*

In [6]:
def answer_three():
    
    # Lets use answer_two() to obtain the train and test scores for the different degrees (0-9) and unpack them in to variables
    train_scores, test_scores = answer_two()
    degrees = range(10)

    #Lets graph
    def graph():

        import matplotlib.pyplot as plt
        plt.figure()
        plt.plot(degrees, train_scores, label='training data')
        plt.plot(degrees, test_scores, label='test data')
        plt.title('Degrees vs R2 scores')
        plt.xlabel('Complexity (Degree)')
        plt.ylabel('R2  scores')
        plt.legend(loc=2);
        plt.show()
    
    # Comment/uncomment to see/hide the graph
    #graph()
    
    # According to the graph we can choose the propper degrees (Underfitting, Overfitting, Good_Generalization)
    result = (2,9,7)

    return result

### Question 4

Training models on high degree polynomial features can result in overfitting. Train two models: a non-regularized LinearRegression model and a Lasso Regression model (with parameters `alpha=0.01`, `max_iter=10000`, `tol=0.1`) on polynomial features of degree 12. Return the $R^2$ score for LinearRegression and Lasso model's test sets.

*This function should return a tuple `(LinearRegression_R2_test_score, Lasso_R2_test_score)`*

In [12]:
def answer_four():
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import Lasso, LinearRegression
    from sklearn.metrics import r2_score

    ## 1st Polynomial degree 12 transformation of X_train and X_test(features) ##
    # We will use X_train to train the model and X_test to predict values and compute the R2
    poly = PolynomialFeatures(degree=12, include_bias=False)
    X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
    X_test_poly = poly.fit_transform(X_test.reshape(-1,1))

    ## 2nd Regression models:
    # A- Linear Regression of the transformed polynomial features
    # Lets train the model
    linreg = LinearRegression()
    linreg.fit(X_train_poly, y_train)
    # Lets predict the X_test_poly values using the Linear Rgression model
    y_test_predicted_linreg = linreg.predict(X_test_poly)
    # Score
    LinearRegression_R2_test_score = r2_score(y_test, y_test_predicted_linreg)


    # B- Lasso Regression of the transformed polynomial features
    # Lets train the model
    linlasso = Lasso(alpha=0.01, max_iter=10000, tol=0.1)
    linlasso.fit(X_train_poly, y_train)
    # Lets predict the X_train_poly values using the Lasso model
    y_test_predicted_lasso = linlasso.predict(X_test_poly)
    # Score
    Lasso_R2_test_score = r2_score(y_test, y_test_predicted_lasso)

    
    # Result
    result = (LinearRegression_R2_test_score, Lasso_R2_test_score)

    return result

## Part 2 - Classification

For this section of the assignment we will be working with the [UCI Mushroom Data Set](http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io) stored in `mushrooms.csv`. The data will be used to trian a model to predict whether or not a mushroom is poisonous. The following attributes are provided:

*Attribute Information:*

1. cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s 
2. cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s 
3. cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y 
4. bruises?: bruises=t, no=f 
5. odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s 
6. gill-attachment: attached=a, descending=d, free=f, notched=n 
7. gill-spacing: close=c, crowded=w, distant=d 
8. gill-size: broad=b, narrow=n 
9. gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y 
10. stalk-shape: enlarging=e, tapering=t 
11. stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=? 
12. stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s 
13. stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s 
14. stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y 
15. stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y 
16. veil-type: partial=p, universal=u 
17. veil-color: brown=n, orange=o, white=w, yellow=y 
18. ring-number: none=n, one=o, two=t 
19. ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z 
20. spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y 
21. population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y 
22. habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

<br>

The data in the mushrooms dataset is currently encoded with strings. These values will need to be encoded to numeric to work with sklearn. We'll use pd.get_dummies to convert the categorical variables into indicator variables. 

In [20]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


mush_df = pd.read_csv('assets/mushrooms.csv')
mush_df2 = pd.get_dummies(mush_df)

X_mush = mush_df2.iloc[:,2:]
y_mush = mush_df2.iloc[:,1]


X_train2, X_test2, y_train2, y_test2 = train_test_split(X_mush, y_mush, random_state=0)

### Question 5

Using `X_train` and `y_train` from the preceeding cell, train a DecisionTreeClassifier with default parameters and random_state=0. What are the 5 most important features found by the decision tree?

*This function should return a list of length 5 of the feature names in descending order of importance.*

In [21]:
def answer_five():
    from sklearn.tree import DecisionTreeClassifier

    # Lets create and train the classifier with default parameters and random_state=0
    clf = DecisionTreeClassifier(random_state=0).fit(X_train2, y_train2)

    # Lets merge names and importances, sort them according to their descending importance and select the 5 biggest values
    important_features_tuples = sorted(list(zip(X_train2.columns, clf.feature_importances_)), key=lambda x: x[1], reverse=True)[:5]

    # Lets save the name of the five most important features
    result = []
    for tup in important_features_tuples:
        result.append(tup[0])

    return result

['odor_n', 'stalk-root_c', 'stalk-root_r', 'spore-print-color_r', 'odor_l']

### Question 6

For this question, use the `validation_curve` function in `sklearn.model_selection` to determine training and test scores for a Support Vector Classifier (`SVC`) with varying parameter values.

Create an `SVC` with default parameters (i.e. `kernel='rbf', C=1`) and `random_state=0`. Recall that the kernel width of the RBF kernel is controlled using the `gamma` parameter.  Explore the effect of `gamma` on classifier accuracy by using the `validation_curve` function to find the training and test scores for 6 values of `gamma` from `0.0001` to `10` (i.e. `np.logspace(-4,1,6)`).

For each level of `gamma`, `validation_curve` will use 3-fold cross validation (use `cv=3, n_jobs=2` as parameters for `validation_curve`), returning two 6x3 (6 levels of gamma x 3 fits per level) arrays of the scores for the training and test sets in each fold.

Find the mean score across the five models for each level of `gamma` for both arrays, creating two arrays of length 6, and return a tuple with the two arrays.

e.g.

if one of your array of scores is

    array([[ 0.5,  0.4,  0.6],
           [ 0.7,  0.8,  0.7],
           [ 0.9,  0.8,  0.8],
           [ 0.8,  0.7,  0.8],
           [ 0.7,  0.6,  0.6],
           [ 0.4,  0.6,  0.5]])
       
it should then become

    array([ 0.5,  0.73333333,  0.83333333,  0.76666667,  0.63333333, 0.5])

*This function should return a tuple of numpy arrays `(training_scores, test_scores)` where each array in the tuple has shape `(6,)`.*

In [23]:
def answer_six():
    from sklearn.svm import SVC
    from sklearn.model_selection import validation_curve
    
    # We define the range of gamma values that we want to test in the SVC classifier
    gamma_range = np.logspace(-4,1,6)

    # We create the classifier
    clf = SVC(kernel='rbf', C=1, random_state=0)

    # Validation curve
    train_scores, test_scores = validation_curve(clf, X_mush, y_mush,
                                                param_name='gamma',
                                                param_range=gamma_range, # range of values to evaluate
                                                scoring = "accuracy", # type of scorer we want
                                                cv=3) # number of cross validations to make
    
    # Average scores for every gamma value
    train_scores_mean = np.mean(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    
    # Lets create a tuple with both arrays
    result = (train_scores_mean,test_scores_mean)
    
    return result

(array([0.89838749, 0.98104382, 0.99895372, 1.        , 1.        ,
        1.        ]),
 array([0.88749385, 0.82951748, 0.84170359, 0.86582964, 0.83616445,
        0.51797144]))

### Question 7

Based on the scores from question 6, what gamma value corresponds to a model that is underfitting? What gamma value corresponds to a model that is overfitting? What choice of gamma would provide a model with good generalization performance on this dataset? 

(Hint: Try plotting the scores from question 6 to visualize the relationship)

*This function should return a tuple with the degree values in this order: `(Underfitting, Overfitting, Good_Generalization)`*

In [27]:
%matplotlib inline

def answer_seven():
    
    def graph():

        import matplotlib.pyplot as plt

        # Lets unpack the results from answer_six()
        train_scores_mean,test_scores_mean = answer_six()
        gamma_range = np.logspace(-4,1,6)

        plt.figure()
        plt.title('Validation Curve with SVM')
        plt.xlabel('$\gamma$ (gamma)')
        plt.ylabel('Accuracy')
        plt.ylim(0.0, 1.1)
        lw = 5

        plt.semilogx(gamma_range, train_scores_mean, label='Training score',
                    color='darkorange', lw=lw)
        plt.semilogx(gamma_range, test_scores_mean, label='Cross-validation score',
                    color='navy', lw=lw)


        plt.legend(loc='best')
        plt.show()
    
    # Comment/uncomment to hide or nake the graph
    graph()

    # Watching the graph we can identify the gamma values in this order: (Underfitting, Overfitting, Good_Generalization)
    result = (0.001, 10.0, 0.1)

    return result