# Introduction: Ridge, Lasso, and Elastic Net Regression with the Ames Housing Dataset

Welcome to this Jupyter Notebook, where we'll be exploring the world of regression analysis using the **Ames Housing Dataset**. In this analysis, we'll not only delve into simple linear regression but also extend our exploration to multiple linear regression and regularized regression techniques like  <span style="color:red">**Ridge**</span>,  <span style="color:red">**Lasso**</span>, and  <span style="color:red">**Elastic Net**</span>. By employing these techniques, we aim to predict a continuous target variable, the house price, based on multiple input features.

### Objective:
Our primary goal is to understand the relationships between various features of a house, such as its size, overall quality, exterior quality, and others, and its sale price. By constructing a multiple linear regression model, we aim to predict house prices based on these selected features and assess the accuracy and performance of our model using various metrics.

$$\Large \displaystyle \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_2 x_{i2}+ ...++ \hat{\beta}_p x_{ip}$$

 - $\hat{y}_i$ is the predicted value of the dependent variable (house price) for the $i^{th}$ observation.
 - $\hat{\beta}_0$ is the y-intercept of the regression line.
 - $\hat{\beta}_j$ represents the coefficient of the $j^{th}$ feature.
 - $x_ij$ is the value of the the $j^{th}$ feature for the $i^{th}$ observation.
 - $p$ is the total number of features used in the model.


### Dataset Overview:
The Ames Housing Dataset provides a comprehensive snapshot of the housing market in Ames, Iowa. It contains detailed information about various attributes of houses, from their physical characteristics to sale details. 

### Structure of this Notebook:
1. [Installing and Importing Necessary Libraries](#ch1)
2. [Loading the Ames Housing Dataset](#ch2)
3. [Data Processing](#ch3)
4. [Splitting the Data](#ch4)
5. [Training the Models](#ch5)
6. [Overview of Model Training Results](#ch6)
7. [Making Predictions](#ch7)
8. [Evaluating the Model](#ch8)
9. [Conclusion](#ch9)

By the end of this notebook, you'll have a comprehensive understanding of how to build, train, and evaluate various regression models, including Linear Regression, Ridge, Lasso, and Elastic Net, using Python and scikit-learn. Moreover, you'll be familiar with the power of GridSearchCV in model optimization. Let's embark on this analytical journey!

## 1. Installing and Importing Necessary Libraries <a id='ch1'></a>

Before starting our analysis, we need to import the necessary Python libraries that will be used throughout this notebook:

- **`pandas`**: A foundational library for data manipulation and analysis. It provides data structures for efficiently storing large datasets and tools for reshaping, aggregating, and merging data.

- **`sklearn.datasets`**: From scikit-learn, this module allows us to fetch datasets, including the Ames Housing dataset, providing a convenient way to load data for our analysis.

- **`sklearn.model_selection`**: This module offers various utilities for model selection, including `train_test_split` for partitioning our data, `GridSearchCV` for exhaustive search over specified parameter values, and `cross_val_score` for evaluating metric scores by cross-validation.

- **`sklearn.linear_model`**: Houses various linear models. In this notebook, we'll be using `LinearRegression` for basic regression, and `Ridge`, `Lasso`, and `ElasticNet` for regularized regression techniques.

- **`sklearn.metrics`**: Provides functions for model evaluation, including `mean_squared_error` to measure the average squared difference between actual and predicted values, `r2_score` to compute the coefficient of determination indicating the model's explanatory power, and `make_scorer` to create custom scoring functions for model validation processes.

- **`sklearn.impute`**: Contains methods for imputation, which is the process of replacing missing data with substituted values. Here, we'll use `SimpleImputer` to handle missing values in our dataset.

- **`sklearn.preprocessing`**: Offers common utility functions and transformer classes to change raw feature vectors into a representation more suitable for downstream estimators. We'll use `StandardScaler` to standardize features, `PolynomialFeatures` to generate polynomial and interaction features, and `OneHotEncoder` to convert categorical variables into a format that can be provided to machine learning algorithms to improve predictions.

- **`sklearn.pipeline`**: Provides utilities to build a composite estimator, streamlining many of the routine processes. We'll use `Pipeline` to assemble several steps that can be cross-validated together.

- **`sklearn.compose`**: This module provides utilities to work with heterogeneous data and to integrate transformers into a pipeline. We'll use `ColumnTransformer` to apply different preprocessing steps to different subsets of the features.

By importing these libraries upfront, we ensure a smooth workflow, allowing us to focus on the core analysis without interruptions.

In [1]:
# Importing libraries
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

## 2. Loading the Amex Housing Dataset <a id='ch2'></a>

The **Ames Housing Dataset** is a comprehensive dataset that provides detailed information about individual residential homes in Ames, Iowa. It contains **79 explanatory variables** describing various aspects of the houses, such as their physical characteristics, location, and sale details. Given the richness of this dataset, it's a popular choice for regression analysis in machine learning.

For this analysis, we'll utilize 10 of the available features in the dataset to predict the sale price of the houses:

- `GrLivArea`: This represents the above-ground living area in square feet. It's intuitive that the size of the living area would be a significant predictor of house price.

- `OverallQual`: This is an overall material and finish quality rating, ranging from 1 (very poor) to 10 (very excellent). Houses with higher quality materials and finishes generally sell for higher prices.

- `YearBuilt`: The year the house was originally constructed. Newer houses might fetch higher prices due to modern design, better insulation, newer materials, etc.

- `TotalBsmtSF`: Total square feet of the basement area. A larger basement can add significant value to a house, especially if it's finished.

- `GarageCars`: Size of the garage in car capacity. A larger garage can add value, especially in areas where parking is at a premium or where winters are harsh.

- `GarageArea`: Size of the garage in square feet. A larger garage can add value to a property.

- `1stFlrSF`: First-floor square feet. The size of the first floor can be a significant factor in house pricing.

- `FullBath`: Number of full bathrooms above grade. Bathrooms can significantly influence a home's value.

- `CentralAir`: Central air conditioning can be a significant factor in home pricing, especially in areas with hot summers.

- `ExterQual`: Evaluates the quality of the material on the exterior. High-quality materials can enhance the curb appeal and durability of a home, thus affecting its price.

These features capture various aspects of a house, including its size, quality, age, and amenities. Including a mix of continuous (like `GrLivArea` and `TotalBsmtSF`) and categorical (like `OverallQual`) features can provide a comprehensive view of the house's characteristics.

To load the dataset, we use the `fetch_openml` function from scikit-learn. This function fetches datasets from the **OpenML repository**, making it easy to access a wide range of datasets for machine learning and data analysis.

In [2]:
# Fetch the Ames Housing Dataset and load it as a pandas DataFrame
housing = fetch_openml(name="house_prices", as_frame=True)

# Define the entire dataset as X
X = housing.data

# Select the desired independent variables (features) 
selected_features = [
    'GrLivArea', 'OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GarageCars', 
    'GarageArea', '1stFlrSF', 'FullBath', 'CentralAir', 'ExterQual'
]

# Extract these features from the dataset
X_selected = X[selected_features]

# Set the target variable as the sale price of the houses
y = housing.target

## 3. Splitting the Data: <a id='ch3'></a>

Before training our linear regression model, it's essential to split the dataset into two parts: a <span style="color:red">**training set**</span> and a <span style="color:red">**testing set**</span>. This allows us to train our model on one subset of the data and then test its performance on a separate, unseen subset. This approach helps evaluate how well our model is likely to perform on new, unseen data.

In this notebook, we're using the `train_test_split` function from scikit-learn to achieve this split:

- `X_train`, `y_train`: These are the features and target variable for the training set, respectively. The model will learn from this data.

- `X_test`, `y_test`: These are the features and target variable for the testing set, respectively. We'll use this data to evaluate the model's performance.

We've reserved 20% of the data for testing (`test_size=0.2`). The random_state parameter is set to 1337, ensuring that the data split is reproducible. This means that every time we run this code, we'll get the same train/test split, which is useful for consistent results and comparisons.

In [3]:
# Split the preprocessed data into training and testing sets, with 20% of the data reserved for testing
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=1337)

## 4. Data Processing <a id='ch4'></a>

The `SettingWithCopyWarning` is a common warning in pandas and arises when you try to modify a subset of a DataFrame, which might be a view rather than a copy. This can lead to unexpected behavior because changes to the subset might not reflect in the original DataFrame or vice versa.

To fix this, we explicitly create a copy of the DataFrame before making modifications. This ensures that we're working with an independent copy and not a view of the original data.

By using `X_clean = X.copy()`, we create an explicit copy of the data, and then all modifications are made to this copy, avoiding the warning.

In [4]:
# Create a copy of the data to avoid SettingWithCopyWarning
X_clean = X_selected.copy()

### Data Preprocessing

Data preprocessing is a crucial step in the machine learning pipeline. It involves preparing the raw data to make it suitable for model training. For the **Ames Housing Dataset**, we'll focus on several preprocessing tasks:

1. <span style="color:red">**Identifying Data Types**</span>: The first step is to categorize each feature based on its data type. Features can be broadly classified into:
 - **Numerical Features**: These are quantifiable variables, representing a measurable quantity.
 - **Categorical Features**: These represent distinct categories or labels without any inherent order or priority.


2. <span style="color:red">**Handling Missing Values**</span>: Real-world datasets frequently contain missing values, which can hinder the performance of machine learning algorithms. Instead of discarding rows with missing values, we'll address this by imputing them:
 - **Numerical Columns**: Missing values will be replaced with the column's mean.
 - **Categorical Columns**: Missing values will be substituted with the most frequent category in the column. This approach ensures we retain valuable data that would otherwise be lost.
 
 
3. <span style="color:red">**Feature Engineering**</span>: This step involves creating new features or modifying existing ones to enhance the model's predictive power. For our dataset:
 - **Standardization**: We'll employ the StandardScaler to standardize numerical features, ensuring each has a mean of 0 and a standard deviation of 1. This transformation ensures that each feature contributes equally to the model's performance and aids in the optimization algorithm's convergence during training.
 - **Polynomial Features**: To capture potential non-linear relationships between the features and the target variable, we'll introduce polynomial features. Specifically, we'll generate quadratic features (degree 2) using the PolynomialFeatures class. This introduces new features derived from the squares of the original ones, potentially helping our model recognize parabolic relationships.


4. <span style="color:red">**Constructing the Preprocessing Pipeline**</span>: To streamline the preprocessing steps, we use a `ColumnTransformer`. This allows us to apply different preprocessing steps to different subsets of the features.


5. <span style="color:red">**Creating the Final Pipelines for Multiple Models**</span>: To facilitate a systematic comparison of different regression models, we'll set up individual pipelines for each of them. This approach ensures that each model benefits from the same preprocessing steps, allowing for a fair comparison. Here's a breakdown:
 - **Models List**: We've chosen a set of regression models for our analysis: Linear Regression, Ridge, Lasso, and ElasticNet. Each of these models has its strengths and can provide unique insights into the data.
 - **Pipelines Dictionary**: For each model, we create a dedicated pipeline that integrates our preprocessing steps with the model. This results in a dictionary of pipelines, one for each model.

By meticulously preprocessing our data, we lay a robust foundation for the subsequent modeling phase, ensuring our algorithms operate on clean and well-structured data.

### 4.1 - Identifying Data Types

In [5]:
# Identify numerical columns
num_cols = X_clean.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Identify categorical columns
cat_cols = X_clean.select_dtypes(exclude=['float64', 'int64']).columns.tolist()

### 4.2 - Handle Missing Values

In [6]:
# Define imputer for numerical columns (using mean as the default strategy)
num_imputer = SimpleImputer(strategy="mean")

# Define imputer for categorical columns (using the most frequent value as the strategy)
cat_imputer = SimpleImputer(strategy="most_frequent")

### 4.3 - Feature Engineering

In [7]:
# Define standard scaler for numerical columns
scaler = StandardScaler()

# Define polynomial feature creator (degree 2 for quadratic features)
poly = PolynomialFeatures(degree=2, include_bias=False)

### 4.4 - Constructing the Preprocessing Pipeline

In [8]:
# Define a preprocessing pipeline to handle both numerical and categorical columns.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([('imputer', num_imputer), ('scaler', scaler), ('poly', poly)]), num_cols),
        ('cat', Pipeline([('imputer', cat_imputer), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), cat_cols)
    ])

### 4.5 - Creating the Final Pipeline

In [9]:
# List of models to loop through
models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'ElasticNet': ElasticNet()
}

pipelines = {}

# Create a pipeline for each model
for model_name, model in models.items():
    pipelines[model_name] = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])

## 5. Training the Models <a id='ch5'></a>

After preprocessing our data, the next crucial step is to train our machine learning models. In this analysis, we're not just limiting ourselves to a single model. Instead, we're exploring multiple regression models from scikit-learn, namely `LinearRegression`, `Ridge`, `Lasso`, and `ElasticNet`.

The process of training these models involves a few steps:

- <span style="color:red">**Defining Hyperparameters Grid**</span>: For each model, we define a set of hyperparameters that we want to tune. For instance, for the `Ridge` and `Lasso` models, we're tuning the `alpha` parameter, which controls the strength of the regularization. For ElasticNet, we're also tuning the `l1_ratio`, which determines the mix of **L1** and **L2 regularization**.

- <span style="color:red">**Grid Search with Cross-Validation**</span>: We utilize `GridSearchCV` to perform an exhaustive search over the specified hyperparameter values for each model. This method not only trains the models but also performs cross-validation to determine which hyperparameter combinations yield the best performance.

- <span style="color:red">**Storing the Best Models**</span>: After the grid search, we store the best-performing model (with the best hyperparameters) and its cross-validation score for each model type.

The code for this process is as follows:

In [10]:
# Hyperparameters grid for each model
param_grids = {
    'LinearRegression': {},
    'Ridge': {'regressor__alpha': [0.01, 0.1, 1, 10, 100]},
    'Lasso': {'regressor__alpha': [0.01, 0.1, 1, 10, 100]},
    'ElasticNet': {
        'regressor__alpha': [0.01, 0.1, 1, 10, 100],
        'regressor__l1_ratio': [0.1, 0.5, 0.9]
    }
}

# Define multiple scoring metrics
scoring = {
    'r2': make_scorer(r2_score),
    'neg_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False)
}

# Dictionary to store the best models and their scores
best_models = {}
best_r2_scores = {}
best_mse_scores = {}

# Perform grid search with cross-validation for each model
for model_name, pipeline in pipelines.items():
    grid_search = GridSearchCV(pipeline, param_grids[model_name], cv=5, scoring=scoring, refit='r2', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    # Store the best model and its scores
    best_models[model_name] = grid_search.best_estimator_
    best_r2_scores[model_name] = grid_search.cv_results_['mean_test_r2'][grid_search.best_index_]
    best_mse_scores[model_name] = grid_search.cv_results_['mean_test_neg_mean_squared_error'][grid_search.best_index_]
    
    # Convert the cv_results_ to a DataFrame
    results_df = pd.DataFrame(grid_search.cv_results_)

By the end of this step, we'll have trained models for each type, optimized for their best hyperparameters, and ready for evaluation on the test data.

### Code Breakdown: <span style="color:red">GridSearchCV</span>

`GridSearchCV` is a function provided by scikit-learn that performs an exhaustive search over a specified parameter grid. It trains a model for every combination of hyperparameters in the grid and uses cross-validation to evaluate the performance of each combination. This helps in finding the best hyperparameters for a given model.

Here's a breakdown of the parameters and methods used:

1. <span style="color:red">**Initializating Parameters**</span>
 - `pipeline`: This is the model pipeline that will be trained and evaluated. A pipeline typically consists of a sequence of data processing steps followed by a machine learning model. In this case, the pipeline includes preprocessing steps (like imputation, scaling, etc.) and a regression model (like `LinearRegression`, `Ridge`, etc.).

 - `param_grids[model_name]`: This is the grid of hyperparameters over which the search will be performed. It's a dictionary where keys are hyperparameter names (with the appropriate prefix if they're part of a pipeline) and values are lists of values to try. For example, for the `Ridge` model, it might look like `{'regressor__alpha': [0.01, 0.1, 1, 10, 100]}`, indicating that the `alpha` hyperparameter of the Ridge regressor should be tuned over those values.

 - `cv=5`: This specifies the number of cross-validation folds. The training data is split into 5 parts (or "folds"). The model is trained 5 times, each time using 4 of the folds for training and the remaining fold for validation. This helps in getting a more robust estimate of the model's performance.

- `scoring`: This is a dictionary specifying multiple metrics to evaluate the performance of the model during cross-validation. In this case, we're using both the negative mean squared error and the \$R^2\$ score.

- `refit='r2'`: This parameter indicates which metric should be used to determine the best hyperparameters. Here, we're using the  \$R^2\$ score.

 - `n_jobs=-1`: This parameter specifies how many processors should be used to train and evaluate the models in parallel. A value of `-1` means that all available processors on the machine will be used, speeding up the grid search process.


1. <span style="color:red">**Methods and Attributes**</span>
 - `grid_search.fit(X_train, y_train)`: Starts the grid search, training the model for every combination of hyperparameters using the training data.
 - `best_models[model_name] = grid_search.best_estimator_`: Extracts and stores the best model for the current model type.
 - `best_r2_scores[model_name]` and `best_mse_scores[model_name]`: Extract and store the best cross-validation scores achieved by the best model for both R^2 and negative mean squared error metrics.
 - `results_df = pd.DataFrame(grid_search.cv_results_)`: Converts the detailed results of the grid search into a pandas DataFrame.

In summary, the line of code initializes a grid search with cross-validation for the given model pipeline and hyperparameter grid. Once the `fit` method is called on this `grid_search` object, it will train the model for every combination of hyperparameters in the grid, evaluate their performance using 5-fold cross-validation, and store the best model and its hyperparameters.

## 6. Overview of Model Training Results <a id='ch6'></a>

After performing an exhaustive grid search with cross-validation for each regression model, we've compiled the results into a comprehensive table. This table provides insights into the performance of each model-hyperparameter combination, allowing us to compare and identify the best-performing model.

Here's a breakdown of the metrics presented in the table:

- `mean_fit_time`: Represents the average time (in seconds) taken to fit the model to the training data for each cross-validation fold. This metric gives an idea of the computational efficiency of the model.

- `std_fit_time`: Indicates the standard deviation of the fit times across the cross-validation folds. A higher value might suggest variability in the training times, possibly due to differences in the data folds or other factors.

- `mean_score_time`: Represents the average time (in seconds) taken to score the model on the validation set for each cross-validation fold. This metric provides insights into how quickly the model can make predictions.

- `std_score_time`: Indicates the standard deviation of the score times across the cross-validation folds. Like `std_fit_time`, a higher value might suggest variability in the scoring times.

- `param_regressor__alpha` (and other `param_` columns): These columns show the specific hyperparameters used for the model in that row. For instance, `param_regressor__alpha` displays the regularization strength used for Ridge, Lasso, and ElasticNet models.

- `split0_test_score` to `split4_test_score`: These columns represent the performance of the model on each of the five cross-validation folds. They provide insights into the consistency of the model's performance across different subsets of the data.

- `mean_test_score`: Represents the average performance of the model across all cross-validation folds. This is a crucial metric as it gives an overall idea of how well the model is expected to perform on unseen data.

- `std_test_score`: Indicates the standard deviation of the test scores across the cross-validation folds. A smaller value suggests that the model's performance is consistent across different data subsets.

- `rank_test_score`: Provides a ranking of the models based on their `mean_test_score`, with 1 being the best.

By analyzing these metrics, we can gain a comprehensive understanding of each model's performance, efficiency, and consistency.

In [11]:
# Sort the results dataframe by mean_test_r2 in descending order
sorted_results_df = results_df.sort_values(by='mean_test_r2', ascending=False)
sorted_results_df.head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_regressor__alpha,param_regressor__l1_ratio,params,split0_test_r2,split1_test_r2,split2_test_r2,...,std_test_r2,rank_test_r2,split0_test_neg_mean_squared_error,split1_test_neg_mean_squared_error,split2_test_neg_mean_squared_error,split3_test_neg_mean_squared_error,split4_test_neg_mean_squared_error,mean_test_neg_mean_squared_error,std_test_neg_mean_squared_error,rank_test_neg_mean_squared_error
4,0.111555,0.043172,0.010418,0.005,0.1,0.5,"{'regressor__alpha': 0.1, 'regressor__l1_ratio...",0.864866,-0.125797,0.848594,...,0.393954,1,-936723100.0,-7065881000.0,-889845000.0,-891917300.0,-1157572000.0,-2188388000.0,2440747000.0,1
5,0.14942,0.082181,0.009667,0.001973,0.1,0.9,"{'regressor__alpha': 0.1, 'regressor__l1_ratio...",0.863165,-0.13135,0.851002,...,0.395582,2,-948518600.0,-7100732000.0,-875689100.0,-931071100.0,-1160791000.0,-2203360000.0,2450602000.0,2
0,0.096889,0.041137,0.021173,0.016388,0.01,0.1,"{'regressor__alpha': 0.01, 'regressor__l1_rati...",0.862856,-0.134728,0.851108,...,0.396857,3,-950659000.0,-7121933000.0,-875066100.0,-933497800.0,-1162220000.0,-2208675000.0,2458551000.0,3
1,0.133434,0.106109,0.010348,0.001218,0.01,0.5,"{'regressor__alpha': 0.01, 'regressor__l1_rati...",0.861127,-0.153743,0.851586,...,0.404051,4,-962641300.0,-7241278000.0,-872257600.0,-945673000.0,-1169978000.0,-2238366000.0,2503412000.0,4
3,0.147301,0.092709,0.008855,0.001076,0.1,0.1,"{'regressor__alpha': 0.1, 'regressor__l1_ratio...",0.863109,-0.171717,0.846866,...,0.411898,5,-948902400.0,-7354088000.0,-900000600.0,-880904100.0,-1178482000.0,-2252475000.0,2553022000.0,5


To get the overall best model across all the models, we can compare the best scores stored in the best_scores dictionary. Let's provide the code to display the best estimator across all models:

In [12]:
# Identify the model with the highest R^2 score
best_model_name = max(best_r2_scores, key=best_r2_scores.get)

# Display the best estimator across all models
best_overall_model = best_models[best_model_name]
best_overall_model

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler()),
                                                                  ('poly',
                                                                   PolynomialFeatures(include_bias=False))]),
                                                  ['GrLivArea', 'OverallQual',
                                                   'YearBuilt', 'TotalBsmtSF',
                                                   'GarageCars', 'GarageArea',
                                                   '1stFlrSF', 'FullBath']),
                                                 ('cat',
                            

Based on the results from the `GridSearchCV` and the provided hyperparameter grid, the ElasticNet model with an `alpha` value of 0.1 performed the best in terms of the chosen evaluation metric (negative mean squared error) during cross-validation on the training data.

Remember, the `alpha` parameter in ElasticNet controls the strength of the regularization. A smaller `alpha` means less regularization and a model that's more likely to fit the training data closely, while a larger `alpha` increases the regularization strength, which can make the model more general and potentially prevent overfitting.

In the context of the hyperparameters we tried `([0.01, 0.1, 1, 10, 100])`, an `alpha` of 0.1 is on the smaller end of the spectrum. This means the chosen model has relatively less regularization compared to models with higher `alpha` values from the provided grid. However, it's still regularized more than if `alpha` were set to a very tiny value like 0.01.

It's also worth noting that while this model performed the best on the training data during cross-validation, it's essential to evaluate its performance on a separate test set to get a true sense of its predictive capabilities on unseen data.

## 7. Making Predictions <a id='ch7'></a>

Having trained various models on the Ames Housing Dataset, our Elastic Net model with an alpha of
0.1 emerged as the best performer on the training data. To further assess its efficacy, it's essential to evaluate its predictions on data it hasn't encountered during training. This step is pivotal, offering insights into the model's potential real-world performance and robustness.

To make predictions on the testing data, we utilize the `predict` method of our best-performing Elastic Net model. This method ingests the features from our testing set (`X_test`) and outputs the predicted values for our target variable, `SalePrice`.

The following code:

In [13]:
# Predict the target values for the testing data using the best model
y_pred = best_overall_model.predict(X_test)

carries out this prediction, saving the results in the `y_pred` variable. Subsequently, we can juxtapose these predicted values with the actual ones to gauge the accuracy and reliability of our chosen model.

## 8. Evaluating the Model <a id='ch8'></a>

After making predictions with our trained model, it's crucial to evaluate its performance. Two common metrics used for regression tasks are:

1. <span style="color:red">**Mean Squared Error (MSE)**</span>: Measures the average squared difference between the actual values (`y_test`) and the predicted values (`y_pred`). A lower MSE indicates a better fit of the model to the data, while a higher MSE suggests potential underfitting or overfitting. The formula for MSE is given by:

$$\Large \displaystyle \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

 - $n$ is the total number of observations
 - $y_i$ is the actual value of the observation
 - $\hat{y}_i$ is the predicted value of the observation

2. <span style="color:red">**Coefficient of Determination ($R^2$ value)**</span>: Represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). An $R^2$ value closer to 1 indicates a better fit. The formula for $R^2$ is:

$$\Large \displaystyle R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}$$

 - $\text{SS}_{\text{res}}$ represents the the sum of the squared residuals
 - $\text{SS}_{\text{tot}}$ represents the total sum of squares

In this section, we compute both the MSE using scikit-learn's `mean_squared_error` function and the $R^2$ value using the `score` method of our trained model. These values give us a quantitative measure of how well our linear regression model predicts house prices based on the living area.

In [14]:
# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared Value: {r2:.2f}")

Mean Squared Error (MSE): 654287636.08
R-squared Value: 0.85


## 9. Conclusion: <a id='ch9'></a>

Throughout this analysis, we delved deep into the Ames Housing dataset, exploring the relationships between various features and the target variable: house prices. Our journey began with a simple linear regression model, which provided a foundational understanding of the data's structure and the potential predictors.

The initial model, focusing solely on the living area of houses, achieved an \$R^2\$ value of 49.2%. This was a decent starting point, but it indicated that nearly half of the variability in house prices was left unexplained by the model.

As we expanded our horizons, incorporating a broader set of features, the results became increasingly promising. A multiple regression model that included variables like the overall quality of materials and finishes, and the exterior quality, achieved an \$R^2\$ value of 82.7%. This was a substantial leap from the simple model, explaining over 82% of the variability in house prices.

However, our best model, an Elastic Net model with an alpha of 0.1, surpassed even that, achieving an \$R^2\$ value of 85%. This model not only outperformed the simple linear regression but also the multiple regression model, underscoring the power of regularization in enhancing predictive accuracy.

Furthermore, the **mean squared error (MSE)**, a metric indicating the average squared difference between actual and predicted prices, saw consistent reductions as we refined our models. This trajectory underscores the enhanced accuracy achieved with each iteration.

In conclusion, the size of the living area, the overall quality of materials and finishes, and the exterior quality play pivotal roles in determining house prices in the Ames Housing dataset. All three variables positively influence the price, emphasizing the premium placed on size, quality, and aesthetic appeal in the real estate market. The progression from our simple to the Elastic Net model underscores the importance of feature selection, regularization, and the potential of machine learning in making more accurate predictions.

### Potential Improvements

1. <span style="color:red">**Higher Degree Polynomials**</span>: Introducing higher-degree polynomials can capture more complex relationships in the data. However, care should be taken to avoid overfitting.
2. <span style="color:red">**Expanded Hyperparameter Search**</span>: While we explored a range of values for alpha in our Elastic Net model, a more exhaustive search or using techniques like RandomizedSearchCV could potentially find even better hyperparameters.
3. <span style="color:red">**Feature Engineering**</span>: Creating new features or transforming existing ones can sometimes unveil hidden patterns in the data.
4. <span style="color:red">**Ensemble Methods**</span>: Combining predictions from multiple models can often lead to more robust and accurate predictions.

By continually refining our approach and exploring new techniques, we can hope to push the boundaries of predictive accuracy even further in future analyses.