# Linear Regression Project

Using the final version of the Ames Housing dataset I worked on through the feature engineering section of the course. I created a Linear Regression Model, trained it on the data with the optimal parameters using a grid search, and then evaluated the model's capabilities on a test set.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data

In [2]:
df = pd.read_csv("../DATA/AMES_Final_DF.csv")

In [3]:
df.head()

Unnamed: 0,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,...,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD,Sale Condition_AdjLand,Sale Condition_Alloca,Sale Condition_Family,Sale Condition_Normal,Sale Condition_Partial
0,141.0,31770,6,5,1960,1960,112.0,639.0,0.0,441.0,...,0,0,0,0,1,0,0,0,1,0
1,80.0,11622,5,6,1961,1961,0.0,468.0,144.0,270.0,...,0,0,0,0,1,0,0,0,1,0
2,81.0,14267,6,6,1958,1958,108.0,923.0,0.0,406.0,...,0,0,0,0,1,0,0,0,1,0
3,93.0,11160,7,5,1968,1968,0.0,1065.0,0.0,1045.0,...,0,0,0,0,1,0,0,0,1,0
4,74.0,13830,5,5,1997,1998,0.0,791.0,0.0,137.0,...,0,0,0,0,1,0,0,0,1,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925 entries, 0 to 2924
Columns: 274 entries, Lot Frontage to Sale Condition_Partial
dtypes: float64(11), int64(263)
memory usage: 6.1 MB


**Separated out the data into X features and y labels**

In [5]:
X = df.drop('SalePrice', axis = 1)
y = df['SalePrice']

**Split up X and y into a training set and test set. Since I will later be using a Grid Search strategy, I set my test proportion to 10%. To get the same data split as the solutions notebook, I specifed random_state = 101**

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=101)

**The dataset features has a variety of scales and units. For optimal regression performance, I scaled the X features.

### fit
    Purpose: The fit method is used to compute the necessary parameters (like mean and standard deviation for StandardScaler) from the training data (X_train).

    Action: In the context of StandardScaler, fit calculates the mean and standard deviation for each feature in X_train.

    When to Use: Training data so that the model can learn the parameters (mean and standard deviation) from it.

### transform
    Purpose: The transform method uses these parameters to transform the data. This transformation applies the learned parameters to scale (or normalize) the data.

    Action: In StandardScaler, it subtracts the mean and divides by the standard deviation for each feature, using the parameters learned during the fit.

    When to Use: Both training and test data, but the key point is that it uses the parameters learned from the training data only.

### fit_transform
    Purpose: This is a convenience method that combines fit and transform into one step.

    Action: For StandardScaler, it first calculates the mean and standard deviation on the training data (X_train), and then it immediately scales the X_train data using these parameters.

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
scaler = StandardScaler()

In [10]:
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

**Elastic Net model for L1 and L2 Regularization (adding penalty to loss function for generalization)**

### Linear Regression (LinearRegression()):

    Purpose: Linear Regression is used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
    
    Method: The method involves finding the line that best fits the data, which is done by minimizing the sum of the squares of the differences between the observed and predicted values (known as the least squares method).
    
    Usage: It's generally used when there is a linear relationship between the variables and when the data is not prone to overfitting.
    
### Ridge Regression (Ridge(alpha)):

    Purpose: Ridge Regression, also known as L2 regularization, is an extension of linear regression that adds a regularization term to the loss function.
    
    Method: In Ridge Regression, the loss function is the least squares loss plus a term proportional to the square of the magnitude of the coefficients. The proportionality constant is the alpha parameter. This additional term penalizes large coefficients and helps to reduce model complexity and prevent overfitting.
    
    Parameter alpha: The alpha parameter controls the strength of the regularization. When alpha is zero, Ridge Regression is equivalent to Linear Regression. As alpha increases, the impact of the regularization term grows, leading to smaller coefficients (but not zero).
    
    Usage: It's used when the data is prone to multicollinearity (independent variables are highly correlated) or when you want to prevent overfitting in a model with many features.
    
    

## Regularization

Technique used in machine learning to prevent overfitting, which occurs when a model learns the training data too well, including its noise and outliers, leading to poor performance on unseen data. Regularization does this by adding a penalty to the loss function used to train the model. This penalty discourages overly complex models and helps in generalizing the model to new, unseen data. 

The two most common types of regularization are L1 and L2 regularization.

### L1 Regularization (Lasso Regression):

    Mechanism: L1 regularization adds a penalty equal to the absolute value of the magnitude of coefficients. This is mathematically represented as the sum of the absolute values of the coefficients.

    Effect: The key characteristic of L1 regularization is that it can lead to sparse models, where some coefficients can become exactly zero. This means L1 regularization can be used for feature selection, identifying which features are important for the prediction.


### L2 Regularization (Ridge Regression):

    Mechanism: L2 regularization adds a penalty equal to the square of the magnitude of coefficients. This is mathematically represented as the sum of the squares of the coefficients.

    Effect: Unlike L1, L2 regularization does not lead to sparse models, and no coefficients are ever zeroed out. It only shrinks the size of the coefficients. L2 is good for handling multicollinearity and model complexity by distributing the error among all terms, but it doesn’t help in reducing the number of features.

### Comparison:

Feature Selection: L1 regularization can zero out coefficients, effectively performing feature selection. L2 does not.

Stability: L2 regularization is more stable and less sensitive to outliers than L1.

Usage: L1 is used when we have a high number of features, and we need to determine which are important. L2 is used when we need to prevent multicollinearity or when we have overfitting but all features are important.

### Combining L1 and L2 (Elastic Net):

Combination of L1 and L2 Regularization: 
Elastic Net includes both the L1 and L2 penalty terms in its regularization formula. 
The L1 term makes it capable of reducing the coefficients of less important features to zero (like Lasso), facilitating feature selection.
The L2 term shrinks the coefficients of correlated predictors (like Ridge), which helps in handling multicollinearity.

Regularization Parameters: There are two key parameters in Elastic Net:

    Alpha (α): Overall regularization strength. A larger value of α means more regularization.
    
    L1 Ratio: Determines the mix between L1 and L2 regularization. An L1 ratio of 1 is equivalent to Lasso, while an L1 ratio of 0 is equivalent to Ridge. Values in between 0 and 1 give both L1 and L2 effects.
    
Hyperparameter Tuning: Selecting the right values for α and the L1 ratio is crucial for the performance of the Elastic Net model. This is typically done via **cross-validation**.

Standardization: Before applying Elastic Net, it's a good practice to **standardize** the features so that they are on a comparable scale.

In [11]:
from sklearn.linear_model import ElasticNet

In [12]:
base_elastic_model = ElasticNet()

**The Elastic Net model has two main parameters, alpha and the L1 ratio. I created a dictionary parameter grid of values for the ElasticNet.**

In [13]:
param_grid = {'alpha':[0.1,1,5,10,50,100],
              'l1_ratio':[.1, .5, .7, .9, .95, .99, 1]}

**I created a GridSearchCV object and ran a grid search for the best parameters for my model based on my scaled training data.**

GridSearchCV is a method that performs exhaustive search over a specified parameter values grid for an estimator. It's used to find the best combination of parameters from the provided param_grid.

**Mean Squared Error (MSE):**

This is a common metric used for regression tasks, which measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.

Why Negative?: In scikit-learn, many model selection tools are designed to choose the higher score as better. However, because MSE is a value that is better when lower (you want a small error), scikit-learn uses the negative of the MSE. This way, a model that results in a lower MSE will have a higher “negative MSE score” and be deemed better by selection tools like GridSearchCV.

**Higher is Better:**

In [14]:
from sklearn.model_selection import GridSearchCV

In [15]:
# verbose number a personal preference
grid_model = GridSearchCV(estimator=base_elastic_model,
                          param_grid=param_grid,
                          scoring='neg_mean_squared_error',
                          cv=5,
                          verbose=1)

In [16]:
grid_model.fit(scaled_X_train,y_train)

Fitting 5 folds for each of 42 candidates, totalling 210 fits


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

GridSearchCV(cv=5, estimator=ElasticNet(),
             param_grid={'alpha': [0.1, 1, 5, 10, 50, 100],
                         'l1_ratio': [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1]},
             scoring='neg_mean_squared_error', verbose=1)

**Here I have displayed the best combination of parameters for my model**

In [17]:
grid_model.best_params_

{'alpha': 100, 'l1_ratio': 1}

**Here is the evaluation for my model's performance on the unseen 10% scaled test set.**

In [18]:
y_pred = grid_model.predict(scaled_X_test)

In [19]:
from sklearn.metrics import mean_absolute_error,mean_squared_error

In [20]:
mean_absolute_error(y_test,y_pred)

14195.354900562172

In [21]:
np.sqrt(mean_squared_error(y_test,y_pred))

20558.508566893164

In [22]:
np.mean(df['SalePrice'])

180815.53743589742