# Regression: Ridge Regression (interpretation)
## Coursera University of Washington ML specialization Regression course week 5 assignment

In this notebook, we will run ridge regression multiple times with different L2 penalties to see which one produces the best fit. We will revisit the example of polynomial regression as a means to see the effect of L2 regularization. In particular, we will:
* Use a scikit-learn to run polynomial regression
* Use matplotlib to visualize polynomial regressions
* Use a scikit learn to run polynomial regression, this time with L2 penalty
* Use matplotlib to visualize polynomial regressions under L2 regularization
* Choose best L2 penalty using cross-validation.
* Assess the final fit using test data.

We will continue to use the House data from previous notebooks.  (In the next programming assignment for this module, you will implement your own ridge regression learning algorithm using gradient descent.)

In [None]:
import numpy as np
import pandas as pd
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load the data set to a data frame

In [None]:
# dictionary with dataset column names and their corresponding data types
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 
              'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 
              'sqft_living':float, 'floors':float, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 
              'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

os.chdir('/kaggle/input/polynomialregression')
sales_df = pd.read_csv('kc_house_data.csv', dtype = dtype_dict)
sales_df.sort_values(by=['sqft_living', 'price'], inplace=True)
sales_df.head()

The target variable that we want to predict is price and the input features that we consider is just sqft_living

In [None]:
sqft_living = sales_df.loc[:, 'sqft_living'].values.reshape(-1, 1)
price = sales_df.loc[:, 'price'].values.reshape(-1, 1)    

Use sklearn polynomial feature which accepts an array ‘feature’ in fit_transform method and a maximal ‘degree’ in constructor and returns an data frame (e.g. SFrame) with the first column equal to ‘feature’ and the remaining columns equal to ‘feature’ to increasing integer powers up to ‘degree’.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

def create_polynomial_features(degree_n, input_features):
    polynomial_features = PolynomialFeatures(degree=degree_n)
    return polynomial_features.fit_transform(input_features)    

In [None]:
poly1_data = create_polynomial_features(1, sqft_living)

Write a function to:
1. Build an polynomial data set using training_data[‘sqft_living’] as the feature and the current degree
2. Learn a model on TRAINING data to predict ‘price’ based on your polynomial data set at the current degree
3. Plot the predicted price on top of the scatter plot of sft-living vs price at the current degree
4. Print the model statistics (on the training data) viz mean-squared-error aka RSS and r2-score at the current degree

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression

deg_color_map = {1: 'red', 2: 'green', 3: 'blue', 4: 'cyan', 5: 'grey', 6: 'gold', 7: 'lavender', 8: 'lime', 9: 'magenta', 15: 'coral'}

def regression_degree_n(degree_n, data, regressor):
    # data : data frame containing both input features and target variable
    # returns : The coefficient of degree 1 i.e. theta1
    input_features = data.loc[:, 'sqft_living'].values.reshape(-1, 1)
    y = data.loc[:, 'price'].values.reshape(-1, 1)    
    input_features_degree_n = create_polynomial_features(degree_n, input_features)    
    regressor.fit(input_features_degree_n, y)
    predicted_y = regressor.predict(input_features_degree_n)        
    rmse_deg = np.sqrt(mean_squared_error(y, predicted_y))
    r2_deg = r2_score(y, predicted_y)
    plt.plot(input_features, predicted_y, color=deg_color_map[degree_n], label='degree {}'.format(degree_n))            
    print('For model complexity of polynomial degree {}:'.format(degree_n))
    print('The learned coefficients = {}'.format(regressor.coef_))
    print('The root mean squared error = {}, r2_score = {} '.format(rmse_deg, r2_deg, degree_n))
    return regressor.coef_[0][1]


Produce a scatter plot of the training data (sqft-living vs price) and add the fitted model for polynomial degree 1 and 15

In [None]:
import matplotlib.pyplot as plt

def plot_sqftliving_price(sqft_living, price):
    fig, ax = plt.subplots(figsize=(12,6))    
    plt.plot(sqft_living, price, 'o', mfc='cyan', mec='orange')
    xrange = np.linspace(0, 14000, 15)
    ax.set_xticks(xrange)
    plt.xlabel('sqft_living')
    plt.ylabel('price')    

plot_sqftliving_price(sqft_living, price)
linear_regressor = LinearRegression(normalize=True)
regression_degree_n(1, sales_df, linear_regressor)    
regression_degree_n(15, sales_df, linear_regressor)    
plt.legend()
plt.title('Polynomial regression')    
plt.show()

Note: When we have so many features and so few data points, the solution can become highly numerically unstable, which can sometimes lead to strange unpredictable results.  Thus, rather than using no regularization, we will introduce a tiny amount of regularization (`l2_penalty=1e-5`) to make the solution numerically stable.  (In lecture, we discussed the fact that regularization can also help with numerical stability, and here we are seeing a practical example.)

With the L2 penalty specified above, fit the model and print out the learned weights.

In [None]:
from sklearn.linear_model import Ridge

l2_small_penalty = 1e-5
ridge_regressor = Ridge(alpha=l2_small_penalty, normalize=True)
plot_sqftliving_price(sqft_living, price)
regression_degree_n(15, sales_df, ridge_regressor)    
plt.legend()
plt.title('Polynomial regression of degree 15 with l2 penalty')    
plt.show()

Note the decrease in the value of learned coefficients with a l2 penalty applied

# Observe overfitting

Recall from Week 3 that the polynomial fit of degree 15 changed wildly whenever the data changed. In particular, when we split the sales data into four subsets and fit the model of degree 15, the result came out to be very different for each subset. The model had a *high variance*. We will see in a moment that ridge regression reduces such variance. But first, we must reproduce the experiment we did in Week 3.

 Please download the provided csv files for each subset and load them with the given list of types:. 

In [None]:
# dtype_dict same as above
set_1 = pd.read_csv('wk3_kc_house_set_1_data.csv', dtype=dtype_dict)
set_2 = pd.read_csv('wk3_kc_house_set_2_data.csv', dtype=dtype_dict)
set_3 = pd.read_csv('wk3_kc_house_set_3_data.csv', dtype=dtype_dict)
set_4 = pd.read_csv('wk3_kc_house_set_4_data.csv', dtype=dtype_dict)

Next, fit a 15th degree polynomial on `set_1`, `set_2`, `set_3`, and `set_4`, using 'sqft_living' to predict prices. Print the weights and make a plot of the resulting model.

This time use l2_small_penalty=1e-9

In [None]:
l2_smaller_penalty=1e-9
coeff_deg1_list = []
ridge_regressor_smaller = Ridge(alpha=l2_smaller_penalty, normalize=True)

def fit_and_plot_model(data, regressor, coef_deg1, label):
    plot_sqftliving_price(sqft_living, price)
    coef_deg1_1 = regression_degree_n(15, data, regressor)    
    coef_deg1.append(coef_deg1_1)    
    plt.legend()
    plt.title('Polynomial regression of degree 15 with l2 penalty ({})'.format(label))    
    plt.show()
    
fit_and_plot_model(set_1, ridge_regressor_smaller, coeff_deg1_list, 'set 1')    

In [None]:
fit_and_plot_model(set_2, ridge_regressor_smaller, coeff_deg1_list, 'set 2')    

In [None]:
fit_and_plot_model(set_3, ridge_regressor_smaller, coeff_deg1_list, 'set 3')    

In [None]:
fit_and_plot_model(set_4, ridge_regressor_smaller, coeff_deg1_list, 'set 4')    

The four curves should differ from one another a lot, as should the coefficients you learned.

***QUIZ QUESTION:  For the models learned in each of these training sets, what are the smallest and largest values you learned for the coefficient of feature `power_1`?***  (For the purpose of answering this question, negative numbers are considered "smaller" than positive numbers. So -5 is smaller than -3, and -3 is smaller than 5 and so forth.)

In [None]:
sorted_coef_deg1 = sorted(coeff_deg1_list)
print(sorted_coef_deg1)

# Ridge regression comes to rescue

Generally, whenever we see weights change so much in response to change in data, we believe the variance of our estimate to be large. Ridge regression aims to address this issue by penalizing "large" weights. (Weights of `model15` looked quite small, but they are not that small because 'sqft_living' input is in the order of thousands.)

With the argument `l2_penalty=1e5`, fit a 15th-order polynomial model on `set_1`, `set_2`, `set_3`, and `set_4`. Other than the change in the `l2_penalty` parameter, the code should be the same as the experiment above. 

In [None]:
l2_big_penalty = 1.23e2
coeff_deg1_list_2 = []
ridge_regressor_big = Ridge(alpha=l2_big_penalty, normalize=True)


In [None]:
fit_and_plot_model(set_1, ridge_regressor_big, coeff_deg1_list_2, 'set 1')    

In [None]:
fit_and_plot_model(set_2, ridge_regressor_big, coeff_deg1_list_2, 'set 2')    

In [None]:
fit_and_plot_model(set_3, ridge_regressor_big, coeff_deg1_list_2, 'set 3')   

In [None]:
fit_and_plot_model(set_4, ridge_regressor_big, coeff_deg1_list_2, 'set 4')   

These curves should vary a lot less, now that you applied a high degree of regularization.

***QUIZ QUESTION:  For the models learned with the high level of regularization in each of these training sets, what are the smallest and largest values you learned for the coefficient of feature `power_1`?*** (For the purpose of answering this question, negative numbers are considered "smaller" than positive numbers. So -5 is smaller than -3, and -3 is smaller than 5 and so forth.)

In [None]:
sorted_coef_deg1_2 = sorted(coeff_deg1_list_2)
print(sorted_coef_deg1_2)

# Selecting an L2 penalty via cross-validation

Just like the polynomial degree, the L2 penalty is a "magic" parameter we need to select. We could use the validation set approach as we did in the last module, but that approach has a major disadvantage: it leaves fewer observations available for training. **Cross-validation** seeks to overcome this issue by using all of the training set in a smart way.

We will implement a kind of cross-validation called **k-fold cross-validation**. The method gets its name because it involves dividing the training set into k segments of roughtly equal size. Similar to the validation set method, we measure the validation error with one of the segments designated as the validation set. The major difference is that we repeat the process k times as follows:

Set aside segment 0 as the validation set, and fit a model on rest of data, and evalutate it on this validation set<br>
Set aside segment 1 as the validation set, and fit a model on rest of data, and evalutate it on this validation set<br>
...<br>
Set aside segment k-1 as the validation set, and fit a model on rest of data, and evalutate it on this validation set

After this process, we compute the average of the k validation errors, and use it as an estimate of the generalization error. Notice that  all observations are used for both training and validation, as we iterate over segments of data. 

To estimate the generalization error well, it is crucial to shuffle the training data before dividing them into segments. For the purpose of this assignment, let us download the csv file containing pre-shuffled rows of training and validation sets combined: wk3_kc_house_train_valid_shuffled.csv. In practice, you would shuffle the rows with a dynamically determined random seed.

In [None]:
train_valid_shuffled = pd.read_csv('wk3_kc_house_train_valid_shuffled.csv', dtype=dtype_dict)
test = pd.read_csv('wk3_kc_house_test_data.csv', dtype=dtype_dict)
print(len(train_valid_shuffled))
print(len(test))

Once the data is shuffled, we divide it into equal segments. Each segment should receive `n/k` elements, where `n` is the number of observations in the training set and `k` is the number of segments. Since the segment 0 starts at index 0 and contains `n/k` elements, it ends at index `(n/k)-1`. The segment 1 starts where the segment 0 left off, at index `(n/k)`. With `n/k` elements, the segment 1 ends at index `(n*2/k)-1`. Continuing in this fashion, we deduce that the segment `i` starts at index `(n*i/k)` and ends at `(n*(i+1)/k)-1`.

With this pattern in mind, we write a short loop that prints the starting and ending indices of each segment, just to make sure you are getting the splits right.

In [None]:
n = len(train_valid_shuffled)
k = 10 # 10-fold cross-validation

for i in range(k):
    start = (n*i)/k
    end = (n*(i+1))/k-1
    print(i, (round(start, 0), round(end, 0)))

Let us familiarize ourselves with array slicing with SFrame. To extract a continuous slice from an SFrame, use colon in square brackets. For instance, the following cell extracts rows 0 to 9 of `train_valid_shuffled`. Notice that the first index (0) is included in the slice but the last index (10) is omitted.

In [None]:
train_valid_shuffled[0:10] # rows 0 to 9

Now let us extract individual segments with array slicing. Consider the scenario where we group the houses in the `train_valid_shuffled` dataframe into k=10 segments of roughly equal size, with starting and ending indices computed as above. Extract the fourth segment (segment 3) and assign it to a variable called `validation4`.

In [None]:
n = len(train_valid_shuffled)
k = 10

def get_segment_indices(num_segments, num_elements):
    k_segments = []
    for i in range(num_segments):
        start = (num_elements*i)/num_segments
        end = (num_elements*(i+1))/num_segments-1
        k_segments.append((int(round(start,0)), int(round(end,0))))
    return k_segments

k_segments = get_segment_indices(k, n)
start_4 = k_segments[3][0]    
end_4 = k_segments[3][1]
validation4 = train_valid_shuffled[start_4:end_4]

To verify that we have the right elements extracted, run the following cell, which computes the average price of the fourth segment. When rounded to nearest whole number, the average should be $536,234.

In [None]:
print(int(round(validation4['price'].mean(), 0)))

After designating one of the k segments as the validation set, we train a model using the rest of the data. To choose the remainder, we slice (0:start) and (end+1:n) of the data and paste them together. SFrame has `append()` method that pastes together two disjoint sets of rows originating from a common dataset. For instance, the following cell pastes together the first and last two rows of the `train_valid_shuffled` dataframe.

In [None]:
n = len(train_valid_shuffled)
first_two = train_valid_shuffled[0:2]
last_two = train_valid_shuffled[n-2:n]
print(first_two.append(last_two))

Extract the remainder of the data after *excluding* fourth segment (segment 3) and assign the subset to `train4`.

In [None]:
pre_validation4 = train_valid_shuffled[0: start_4]
post_validation4 = train_valid_shuffled[end_4+1: n]
train4 = pre_validation4.append(post_validation4)
print(len(validation4))
print(len(train4))

To verify that we have the right elements extracted, run the following cell, which computes the average price of the data with fourth segment excluded. When rounded to nearest whole number, the average should be $539,450.

In [None]:
print(int(round(train4['price'].mean(), 0)))

Now we are ready to implement k-fold cross-validation. Write a function that computes k validation errors by designating each of the k segments as the validation set. It accepts as parameters (i) `k`, (ii) `l2_penalty`, (iii) dataframe, (iv) name of output column (e.g. `price`) and (v) list of feature names. The function returns the average validation error using k segments as validation sets.

* For each i in [0, 1, ..., k-1]:
  * Compute starting and ending indices of segment i and call 'start' and 'end'
  * Form validation set by taking a slice (start:end+1) from the data.
  * Form training set by appending slice (end+1:n) to the end of slice (0:start).
  * Train a linear model using training set just formed, with a given l2_penalty
  * Compute validation error using validation set just formed

In [None]:
def k_fold_cross_validation(k, l2_penalty, data, output_name, features_list, degree_n):
    n = len(data)
    k_segments = get_segment_indices(k, n)
    k_validation_errors = []
    # generate polynomial features from the given input features
    X = data.loc[:, features_list].values.reshape(-1, 1)
    y = data.loc[:, output_name].values.reshape(-1, 1)    
    poly_X = create_polynomial_features(degree_n, X)    
    poly_data = np.concatenate((poly_X, y), axis=1)
    for i, segment in enumerate(k_segments):
        start_i = k_segments[i][0]    
        end_i = k_segments[i][1]
        validation_i = poly_data[start_i:end_i]
        pre_validation_i = poly_data[0: start_i]
        post_validation_i = poly_data[end_i+1: n]
        train_i = np.concatenate((pre_validation_i, post_validation_i), axis=0)
        regressor = Ridge(alpha=l2_penalty, normalize=True)
        # exclude the last column which is the output variable y to get X
        train_X = train_i[:, 0:-1]
        valid_X = validation_i[:, 0:-1]
        # the last column is y
        train_y = train_i[:, -1]
        valid_y = validation_i[:, -1]    
        regressor.fit(train_X, train_y)
        valid_predicted_y = regressor.predict(valid_X)        
        rss = np.sum((valid_y - valid_predicted_y)**2)
        k_validation_errors.append(rss)
    return np.mean(k_validation_errors)

result = k_fold_cross_validation(10, 1000, train_valid_shuffled, 'price', ['sqft_living'], 15)
print(result)

Once we have a function to compute the average validation error for a model, we can write a loop to find the model that minimizes the average validation error. Write a loop that does the following:
* We will again be aiming to fit a 15th-order polynomial model using the `sqft_living` input
* For each l2_penalty in [10^3, 10^3.5, 10^4, 10^4.5, ..., 10^9] (to get this in Python, you can use this Numpy function: np.logspace(3, 9, num=13).): Run   10-fold cross-validation with l2_penalty.
* Report which L2 penalty produced the lowest average validation error.

Note: since the degree of the polynomial is now fixed to 15, to make things faster, you should generate polynomial features in advance and re-use them throughout the loop. Make sure to use `train_valid_shuffled` when generating polynomial features!

In [None]:
penalty_rss_list = []
for penalty in np.logspace(3, 9, num=13):
    result = k_fold_cross_validation(10, penalty, train_valid_shuffled, 'price', ['sqft_living'], 15)
    penalty_rss_list.append((penalty, result))
    print('penalty:{} --> rss:{}'.format(penalty, result))

sorted_penalty_rss_list = sorted(penalty_rss_list, key=lambda item:item[1])
print(sorted_penalty_rss_list[0])

***QUIZ QUESTIONS:  What is the best value for the L2 penalty according to 10-fold validation?***

Once you found the best value for the L2 penalty using cross-validation, it is important to retrain a final model on all of the training data using this value of `l2_penalty`. This way, your final model will be trained on the entire dataset.

In [None]:
regressor = Ridge(alpha=sorted_penalty_rss_list[0][0], normalize=True)
X = sales_df.loc[:, 'sqft_living'].values.reshape(-1, 1)
y = sales_df.loc[:, 'price'].values.reshape(-1, 1)    
poly15_X = create_polynomial_features(15, X)    
regressor.fit(poly15_X, y)
# make predictions on the test dataset
X_test = test.loc[:, 'sqft_living'].values.reshape(-1, 1)
poly15_X_test = create_polynomial_features(15, X_test)    
y_test = test.loc[:, 'price'].values.reshape(-1, 1)    
predicted_y_test = regressor.predict(poly15_X_test)        
rss_test = np.sum((y_test - predicted_y_test)**2)

***QUIZ QUESTION: Using the best L2 penalty found above, train a model using all training data. What is the RSS on the TEST data of the model you learn with this L2 penalty? ***

In [None]:
print(rss_test)