# LINEAR REGRESSION

One of the most simplest supervised learning algorithms in our toolkit.

It is a common and useful method of making predictions when the target vector is a quantitative value (home price, age ...) .

It assumes that the relationship between the features and the target vector is approximately linear.

**The effect** (also called coefficient, weight or parameter) of the features on the target vector is constant.

Great interpretability due to coefficients of the models are the effect on a one-unit change on the target vector



## FITTING A LINE

You want to train a model that represents a linear relationship between the feature and the target vector.

For the sake of explanation we have trained our model using only two features, this mean our model will be: 

**ŷ = Bo + B1 * x1 + B2 * x2 + €rror**

In [1]:
# Load libraries

from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

LOAD BOSTON == The target value is the median value of a Boston home in the 1970's in thousands of dollars.

In [2]:
# Load data with only two features

boston = load_boston()
features = boston.data[:,0:2]   # All rows, two first columns
target = boston.target

In [3]:
# Create linear regression

regression = LinearRegression()

In [4]:
# Fit Linear Regression

model = regression.fit(features, target)
model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Bo == Bias or Intercept
B1 and B2 == Coefficients identified by fitting the model

In [5]:
# View the intercept Bo

Bo = model.intercept_
Bo

22.485628113468223

In [6]:
# View the feature coefficients 

coefficients = model.coef_
coefficients

array([-0.35207832,  0.11610909])


The first feature in our solution is the number of crimes per resident.

The model coefficient of this feature was 0.35, meaning that if we multiply this coefficient by 1000 we have the change in the house price for each additional one crime per capita

In [7]:
crime_unit_change = model.coef_[0]*1000
print("For every single crime per capita will decrease the price of the house by approximately : ",crime_unit_change, "€")

For every single crime per capita will decrease the price of the house by approximately :  -352.0783156402677 €


In [8]:
# The price of the first home in the dataset is the first value in the target vector * 1000

real_price = target[0]*1000
real_price

24000.0

In [9]:
# Using predict we can calculate the value of the house

predicted_price = model.predict(features)[0]*1000
predicted_price

24573.366631705547

In [10]:
Difference = real_price - predicted_price

print("The model was off by : ", Difference, "€")

The model was off by :  -573.3666317055468 €


## Handling Interactive Effects

You have a feature whose effect on the target variable depends on another feature.

Solution : Create an interaction term to capture that dependence using scikit-learn's Polynomial Features.

In [11]:
# Load Libraries

import numpy as np
from sklearn.preprocessing import PolynomialFeatures

In [12]:
# Create an interection term

interaction = PolynomialFeatures(degree = 3, include_bias = False, interaction_only = True)
features_interaction = interaction.fit_transform(features)

In [13]:
# Create Linear Regression 

regression = LinearRegression()

In [14]:
# Fit the Linear Regression

model = regression.fit(features_interaction, target)

Sometimes a feature's effect on our target variable is at least partially dependent on another feature.

Example: There are two factors that can determine the sweetness of coffe.

1-. Add sugar 

2 -. Stir the glass

Both factors are mandatory to find the coffe sweet, if you act with them separately there will be no success.

We can account for interaction effects by including a new feature comprising the product of corresponding values from the interacting features.

ŷ = Bo + B1*x1 + B2*x2 + B3*x1*x2 + €rror

In [15]:
# View the feature values for the first observation

features[0]

array([6.32e-03, 1.80e+01])

To create an interaction term, we simply multiply this two values together for every observation.

**Polynomial Features**

Create interactions terms for all combinations of features. We can use model selection strategies to identify the combination of features and interaction terms that produce the best model

3 parameters we must see: 

interaction_only : True ==> Tells Polynomial features to only return interacion terms (no polynomial features)

include_bias : False ==> Prevent from containing bias

degree: maximun number of features to create interaction terms from. (in case we want interaction between 3 elements) x, x²,x³


In [16]:
# For each observation multiply the value of the first and second feature

interaction_term = np.multiply(features[:,0], features[:,1])

In [17]:
# View interaction term for first observation.

interaction_term[0]

0.11376

We can see the output of Polynomial features from our solution by checking to see if the first observation's feature values and interaction term value match oir manually calculated verion

In [18]:
# View the values of the first observation

features_interaction[0]

array([6.3200e-03, 1.8000e+01, 1.1376e-01])

# Fitting a NonLinear Relationship

Create a Polynomial Regression by including polynomial features in a linear regression model.

Convert this :        ŷ = Bo + B1*x1  

into this:   ŷ = Bo + B1*x1 + B2*x1^2 + ... +  Ba*Xj^d + €rror             being d the degree of the polynomial.

The model will be more flexible by adding a esisting feature to some power x², x³, linear regression will interpret this values as any other one.

In [28]:
features_non_linear = boston.data[:,0:1]

In [29]:
# Createa a polynomial feature x^2 and x^3

polynomial = PolynomialFeatures(degree = 3, include_bias = False)
features_polynomial = polynomial.fit_transform(features_non_linear)

In [30]:
# Create a Linear Regression

regression = LinearRegression()

In [31]:
# Fit the linear regression

model = regression.fit(features_polynomial, target)

In [32]:
# View first observation of the dataset

features_non_linear[0]

array([0.00632])

To create a polynomial feature we would raise the first observantion's value to the second degree x² 



In [33]:
# View first observation raised to the second power x²

features_non_linear[0]**2

array([3.99424e-05])

This would be our new feature, we will then also raise the value to x³

In [34]:
# View first observation raised to the third power x³ 

features_non_linear[0]**3

array([2.52435968e-07])

By including all three features in our features matrix and then running a linear regression we have conducted a polynomial regression.

In [35]:
# View the first observation values for x, x²,x³

features_polynomial[0]


array([6.32000000e-03, 3.99424000e-05, 2.52435968e-07])

## Reducing Variance with Regularization 

Use a learning algorithm that includes a shrinkage penalty (regularization) like ridge regression and lasso regression. They are different because of they apply different shrinkage penaltys. Ridge VS Lasso == Better predictions VS More interpretable answer

In standard linear regression the model trains to minimize the Sum of Squared Errors between the true and prediction y-ŷ.

Regularized regression learners are similar but they apply a shrinkage penalty that actually, makes the model shrink

**Elastic Net** Simple regresssion model with both penalties included.

Regardless of which one to use, both ridge and lasso regression can penalize large or complex models by including coefficients values in the loss function we are trying to minimize. 

In [38]:
# Load libraries 

from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

### Ridge Regression 

The shrinkage penalty is a tuning hyperparameter multiplied by the squared sum of all coefficients

In [39]:
# Load data

features_ridge = boston.data
target_ridge = boston.target

In [45]:
# Standarize features

scaler = StandardScaler()

features_standardized = scaler.fit_transform(features_ridge)

In [46]:
# Create a Ridge Regression with an alpha value

regression = Ridge(alpha = 0.5)

In [50]:
# Fit the linear Regression 

model = regression.fit(features_standardized, target)
model.coef_

array([-0.92396151,  1.07393055,  0.12895159,  0.68346136, -2.0427575 ,
        2.67854971,  0.01627328, -3.09063352,  2.62636926, -2.04312573,
       -2.05646414,  0.8490591 , -3.73711409])

The hyperparameter alpha let us control how much we penalize the coefficients with higher values of alpha creating simpler models.

The ideal value of alpha should be tuned like any other hyperparameter. 

#### Include a RidgeCV method that allows us to select the idal value of alpha


In [52]:
# Load libraries

from sklearn.linear_model import RidgeCV

In [55]:
# Create Ridge Regression with three alpha values

regression_ridgecv = RidgeCV(alphas = [0.1, 1.0, 10.0])

In [56]:
# Fit the linear regression

model_cv = regression_ridgecv.fit(features_standarized, target)

In [57]:
# View Coefficients

model_cv.coef_

array([-0.91987132,  1.06646104,  0.11738487,  0.68512693, -2.02901013,
        2.68275376,  0.01315848, -3.07733968,  2.59153764, -2.0105579 ,
       -2.05238455,  0.84884839, -3.73066646])

In [58]:
# We can see easily the best modelś alpha value.

model_cv.alpha_

1.0

### Lasso Regression

Simplify your linear regression model by reducing the number of features.

The shrinkage penalty is a tuning hyperparameter multiplied by the sum of the absolute value of all coefficients.

In [59]:
# Load Libraries 

from sklearn.linear_model import Lasso

In [60]:
# Create Lasso Regression 

regression_lasso = Lasso(alpha = 0.5)

In [61]:
# Fit the linear regression

model_lasso = regression_lasso.fit(features_standarized, target)

This model can shrink model to 0. Effectively reducing the number of feartures in the model. For example, in our solution we set alpha to 0.5 and we can see that many coefficients are 0, meaning that their corresponding features are not used in the model.

In [62]:
# View coefficients

model_lasso.coef_

array([-0.11526463,  0.        , -0.        ,  0.39707879, -0.        ,
        2.97425861, -0.        , -0.17056942, -0.        , -0.        ,
       -1.59844856,  0.54313871, -3.66614361])

If we increase alpha to a much higher value we can see that literally none of the features are being used.

In [65]:
#Create Lasso alpha = 10 

regression_a10 = Lasso(alpha=10)
model_a10 = regression_a10.fit(features_standardized, target)
model_a10.coef_

array([-0.,  0., -0.,  0., -0.,  0., -0.,  0., -0., -0., -0.,  0., -0.])

The practical benefit of this effect is taht it means that we could include 100 features in our feature matrix and then, through adjusting lasso's alpha hyperparameter produce a model that uses only 10 of the most important features.

This let us reduce variance while improving the interpretability of our model. (fewer features are easier to explain)