In this notebook I tried to run a simple, normal regression and a Ridge regression; Ridge regressions can perform less well on training datasets but better on test datasets because they take into consideration that not all variables are as important as others in the prediction of the dependent variable. 
The normal regression treates all parameters in an unbiased way.

Firstly, I'm importing what I need; the database I'll be using is Boston Housing prices, and can be directly loaded (versions I found here on Kaggle didn't have all the variables I has used before).

In [None]:

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

After importing I'm splitting the data in two:  1. my target variable, the price - df_labels, and 2. the rest of the independent variables, df.

In [None]:
# loading the Boston dataset
from sklearn.datasets import load_boston
house_price = load_boston()
df_labels = pd.DataFrame(house_price.target)
df = pd.DataFrame(house_price.data)
print(df_labels.head())
print(df.head())

All good, but they have no column names, must add them:

In [None]:
df_labels.columns = ['PRICE']
df.columns = house_price.feature_names
print(df.shape)
print(df_labels.shape)

Then, I'm creating a new complete dataframe, just in case I'll need it.


In [None]:
df_total = df.merge(df_labels, left_index = True, right_index = True)
df_total.head()

Here I'm having a look at the variables to see if there are any missings to take care of, or categorial variables that should be encoded.

In [None]:

df_total.describe()
df_total.info()

Fortunately this is a nice database, no missings and to categorial variables :) 

Then I'm having a look at the distribution of my target variable, the price:

In [None]:
plt.hist(df_labels['PRICE'], bins = 8)

Looks good, almost perfect normal distribution, checking also skewness and kurtosis, that are not bad.

In [None]:
from scipy.stats import skew,kurtosis 
print(skew(df_labels['PRICE']))
print(kurtosis(df_labels['PRICE'])) 

Then I'm getting to the point that I'm very interested in, the correlations; here a bit of research was needed to get an understanding of the variables, I'm going to mention some of the highly correlated ones:
LSTAT is the % of the population with lower status, RM is the number of rooms, PTRATIO is the ration between pupils and teachers, TAX is the tax rate.

In [None]:
corr_matrix = df_total.corr(method = 'pearson')
corr_matrix 

In the next step I'm scaling the data and then splitting it in train/test data to begin training my model.

In [None]:
# standardize and train/test split: standardize only data, not target
df = preprocessing.scale(df)
X_train, X_test, y_train, y_test = train_test_split(
    df, df_labels, test_size=0.3, random_state=10)

Fiting the regression:

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train)

And then cheking the RMSE of my model:

In [None]:
#on train set
from sklearn.metrics import mean_squared_error
y_train_predicted = lin_reg.predict(X_train)
lin_mse = mean_squared_error(y_train_predicted, y_train)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

In [None]:
print(lin_reg.intercept_)
print(lin_reg.coef_)

Now I'm applying the model on the test data:

In [None]:
#on test set
y_test_predicted = lin_reg.predict(X_test)
lin_mse = mean_squared_error(y_test_predicted, y_test)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

And getting an RMSE a bit higher.
Now, I want to compare the error I got with the variable itself (expressed in 1Ks): so there's an error of 5.41 on a variable with values between 5 and 50, and a mean of 22.

In [None]:
#let's see how rmse compares to the rest of the target var desciptives
df_labels['PRICE'].describe()

Now moving on to the Ridge regression; now, in the first example the alpha is 0 so the RMSE is identical to the one above:

In [None]:
#do the same with ridge
ridge_reg = Ridge(alpha=0)
ridge_reg.fit(X_train, y_train)
y_train_predicted = ridge_reg.predict(X_train)
ridge_mse = mean_squared_error(y_train_predicted, y_train)
ridge_rmse = np.sqrt(ridge_mse)
ridge_rmse 

Here I'm testing with a new alpha to see how the RMSE changes:

In [None]:
ridge_reg = Ridge(alpha=50)
ridge_reg.fit(X_train, y_train)
y_train_predicted = ridge_reg.predict(X_train)
ridge_mse = mean_squared_error(y_train_predicted, y_train)
ridge_rmse = np.sqrt(ridge_mse)
ridge_rmse 

And finally, I'm using a Ridge cross-validation that will help tell me which alpha is the best from a list of alphas:

In [None]:
from sklearn.linear_model import RidgeCV
regr_cv = RidgeCV(alphas=[0.1,0.3, 0.5,0.7, 1.0, 10.0, 50.0])
model = regr_cv.fit(X_train, y_train)

In [None]:
model.alpha_

RMSE when we're using the alpha we got after cross-validation:

In [None]:
y_train_predicted = regr_cv.predict(X_train)
ridge_mse = mean_squared_error(y_train_predicted, y_train)
ridge_rmse = np.sqrt(ridge_mse)
ridge_rmse 

In [None]:
def function(i):
    ridge_reg = Ridge(alpha = i)
    ridge_reg.fit(X_train, y_train)
    y_train_predicted = ridge_reg.predict(X_train)
    ridge_mse = mean_squared_error(y_train_predicted, y_train)
    ridge_rmse = np.sqrt(ridge_mse)
    print(ridge_rmse)

In [None]:
function(0.1)

And let's see what RMSE I'll find on the test data with the value for alpha:

In [None]:
#on test set
y_test_predicted = ridge_reg.predict(X_test)
lin_mse = mean_squared_error(y_test_predicted, y_test)
lin_rmse = np.sqrt(lin_mse)
lin_rmse