# Solution Seekers Group

Lead of the Study Group Discussion: **Badr Bensassi**

Author: **Youssef Laouina**

Email: *laouina.yusuf@gmail.com*

# Prepare Data

Importing our libraries

In [None]:
import warnings

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

warnings.simplefilter(action="ignore", category=FutureWarning)

## Importing our data into a Pandas DataFrame

In [None]:
df = pd.read_csv(
    'https://raw.githubusercontent.com/tirthajyoti/Machine-Learning-with-Python/master/Datasets/USA_Housing.csv', index_col=False)

In [None]:
df.info()

In [None]:
df.head()

## Specifying our 2D dataset

In [None]:
df_2d = df[['Avg. Area House Age', 'Price']]

In [None]:
# To display the numbers in normal notation
pd.options.display.float_format = '{:.2f}'.format

df_2d.describe()

According to the **Central Limit Theorem**, the distributions of ***sample means*** and ***sums*** tend to approximate a **normal distribution** as the sample size increases, regardless of the distribution of the population from which the samples are drawn.

Let's Examine that!

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax_flat = ax.flatten()

sns.kdeplot(data=df_2d, x='Avg. Area House Age', color='orange', fill=True, ax=ax_flat[0])
sns.kdeplot(data=df_2d, x='Price'              , color='blue'  , fill=True, ax=ax_flat[1])
    
plt.show()

### Why it is usefull to know the distribution of your data?

Exploring the distribution of data features can help you understand their characteristics, such as their central tendencies and variability. This information is useful for several reasons:

* Identify any outliers or unusual patterns in the data.

* Help you choose appropriate statistical techniques and model assumptions.

* It provides insights into the relationship between the variables, which can inform the choice of predictors in your model.

## Explore: Visual analytics methods

In [None]:
sns.scatterplot(data=df_2d, x='Avg. Area House Age', y='Price', color='yellow', edgecolor='black', s=50);

- It seems like there is some kind of a relationship between the `Price` and `Avg. Area House Age`

## Regression Model Assumptions

Will a linear model be sufficient to catch the relationship between `Price` and `Avg. Area House Age`?

Well, it depends on some factors...

We make a few assumptions when we use linear regression to model the relationship between a response and a predictor. These assumptions are essentially **conditions** that should be met before we draw inferences regarding the model estimates or before we use a model to make a prediction.

* The true relationship is linear (between a response and a predictor)
* Errors are normally distributed
* Homoscedasticity of errors (or, equal variance around the line).
* Independence of the observations



**How do we check regression assumptions?** We examine the variability left over **after** we fit the regression line. We simply graph the residuals and look for any unusual patterns.

If a linear model makes sense, the residuals will:

* have a constant variance
* be approximately normally distributed (with a mean of zero), and
* be independent of one another.

## Splitting our data

In [None]:
feature_matrix = df[['Avg. Area House Age']]
target_vector = df['Price']

print(f"Feature Matrix: {feature_matrix.shape}",
      f"Target Vector: {target_vector.shape}",
      sep='\n')

# Building the model

Importing the sci-kit learn packages

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(feature_matrix,
                                                    target_vector,
                                                    train_size=0.8,
                                                    random_state=7)

## Baseline Model

Calculate the mean of our target vector y_train and assign it to the variable y_mean.

In [None]:
y_mean = np.mean(y_train)

In [None]:
# dummy model predictions
y_pred_baseline = [y_mean] * len(y_train)

### Performance Metrics: Baseline Model

Calculate the baseline mean absolute error for your predictions in `y_pred_baseline` as compared to the true targets in `y_train`.


In [None]:
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)

print("Mean house price", round(y_mean, 2))
print("Baseline MAE:", round(mae_baseline, 2))
print("R-squared: ", round(r2_score(y_true=y_train, y_pred=y_pred_baseline), 4))

### Visual representation of the baseline model

In [None]:
sns.scatterplot(x=X_train['Avg. Area House Age'], y=y_train, color='yellow', edgecolor='black', s=50)
plt.plot(X_train['Avg. Area House Age'], y_pred_baseline, color='red', label='Dummy Model')

plt.legend()
plt.show()

## Simple Linear Regression Model

In [None]:
# Instanciate the model
model = LinearRegression()

# Fitting the model
model.fit(X_train, y_train)

In [None]:
y_pred_lm = model.predict(X_train)

### Performance Metrics: Simple Linear Regression Model

In [None]:
mae_lm = mean_absolute_error(y_train, y_pred_lm)

print("Mean house price", round(y_mean, 2))
print("Baseline MAE:", round(mae_lm, 2))
print("R-squared: ", round(r2_score(y_true=y_train, y_pred=y_pred_lm), 4))

It looks like our model performs a little better than the baseline. 🎉🎉

Now let's check our test performance. Remember, once we test our model, there's no more iteration allowed. 

### Visual representation of the Simple Linear Regression Model

In [None]:
sns.scatterplot(x=X_train['Avg. Area House Age'], y=y_train, color='yellow', edgecolor='black', s=50)
plt.plot(X_train['Avg. Area House Age'], y_pred_lm, color='red', label='Linear Model')
# plt.plot(X_train['Avg. Area House Age'], y_pred_baseline, color='blue', label='Dummy Model')


plt.legend()
plt.show()

# Communicate Results

Let's take a look at the equation our model has come up with for predicting `Price` based on `Avg. Area House Age`.

<center><img src="../images/proj-2.005_single.png" alt="Equation: y = beta 0 + beta 1 * x" style="width: 400px;"/></center> 

In [None]:
intercept    = model.intercept_
coefficients = model.coef_

print(
    f"Price = {np.round(intercept, 2)} + ({np.round(coefficients[0], 2)} * Avg. Area House Age) "
)

## Model Deployment

In [None]:
from ipywidgets import Dropdown, FloatSlider, IntSlider, interact

In [None]:
def make_prediction(house_age):
    data = {
        "Avg. Area House Age" : house_age
    }
    
    df = pd.DataFrame(data, index=[0])
    prediction = model.predict(df).round(2)[0]
    
    return f"Predicted house price: ~ ${prediction}"

In [None]:
make_prediction(5)

In [None]:
interact(
    make_prediction,
    house_age=IntSlider(
        min=0,
        max=50,
        value=X_train["Avg. Area House Age"].mean()
    )
);

# Prepare Data

## Specifying our 3D dataset

In [None]:
df_3d = df[['Avg. Area House Age', 'Avg. Area Number of Rooms', 'Price']]

In [None]:
df_3d.describe()

Examining the distribution of our data features

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
ax_flat = ax.flatten()

sns.kdeplot(data=df_3d, x='Avg. Area Number of Rooms', color='yellow', fill=True, ax=ax_flat[0])
sns.kdeplot(data=df_3d, x='Avg. Area House Age'      , color='orange', fill=True, ax=ax_flat[1])
sns.kdeplot(data=df_3d, x='Price'                    , color='blue'  , fill=True, ax=ax_flat[2])
    
plt.show()

## Explore: Visual analytics methods

In [None]:
import plotly.express as px
import plotly.graph_objects as go

In [None]:
# Create 3D scatter plot
fig = px.scatter_3d(
    data_frame=df_3d,
    x='Avg. Area House Age',
    y='Avg. Area Number of Rooms',
    z='Price',
    color_discrete_sequence=['yellow'],
    width=700,
    height=600,
)

# Refine formatting
fig.update_traces(
    marker={"size": 4, "line": {"width": 2, "color": "DarkSlateGrey"}},
    selector={"mode": "markers"},
)

# Display figure
fig.show()

## Splitting our data

In [None]:
feature_matrix_3d = df[['Avg. Area Number of Rooms', 'Avg. Area House Age']]
target_vector_3d = df['Price']

print(f"Feature Matrix: {feature_matrix_3d.shape}",
      f"Target Vector: {target_vector_3d.shape}",
      sep='\n')

# Building the model

Train-Test Split

In [None]:
X_train_3d, X_test_3d, y_train_3d, y_test_3d = train_test_split(feature_matrix_3d,
                                                    target_vector_3d,
                                                    train_size=0.8,
                                                    random_state=7)

## Baseline Model

In [None]:
y_mean_3d = np.mean(y_train_3d)

# dummy model predictions
y_pred_baseline_3d = [y_mean_3d] * len(y_train_3d)

## Performance Metrics: Baseline Model

In [None]:
mae_baseline_3d = mean_absolute_error(y_train_3d, y_pred_baseline_3d)

print("Mean house price", round(y_mean_3d, 2))
print("Baseline MAE:", round(mae_baseline_3d, 2))
print("R-squared: ", round(r2_score(y_true=y_train_3d, y_pred=y_pred_baseline_3d), 4))

## Visual representation of the baseline model

In [None]:
# Create 3D scatter plot
fig = px.scatter_3d(
    data_frame=df_3d,
    x='Avg. Area House Age',
    y='Avg. Area Number of Rooms',
    z='Price',
    color_discrete_sequence=['yellow'],
    width=700,
    height=600,
)

# Create x and y coordinates for model representation
x_plane = np.linspace(df_3d["Avg. Area House Age"].min(), df_3d["Avg. Area House Age"].max(), 10)
y_plane = np.linspace(df_3d["Avg. Area Number of Rooms"].min(), df_3d["Avg. Area Number of Rooms"].max(), 10)

xx, yy = np.meshgrid(x_plane, y_plane)

# z coordinates
z_plane = np.linspace(y_pred_baseline_3d[0], 10)


zz = np.tile(z_plane, (10, 1))

# Add plane to figure
fig.add_trace(go.Surface(x=xx, y=yy, z=zz))


# Refine formatting
fig.update_traces(
    marker={"size": 4, "line": {"width": 2, "color": "DarkSlateGrey"}},
    selector={"mode": "markers"},
)

# Display figure
fig.show()

# Multiple Linear Regression Model

In [None]:
# Instanciate the model
model_3d = LinearRegression()

# Fitting the model
model_3d.fit(X_train_3d, y_train_3d)

In [None]:
y_pred_lm_3d = model_3d.predict(X_train_3d)

### Performance Metrics: Multiple Linear Regression Model

In [None]:
mae_lm_3d = mean_absolute_error(y_train_3d, y_pred_lm_3d)

print("Mean house price", round(y_mean_3d, 2))
print("Baseline MAE:", round(mae_lm_3d, 2))
print("R-squared: ", round(r2_score(y_true=y_train_3d, y_pred=y_pred_lm_3d), 4))

It looks like our model performs a little better than the baseline. 🎉

Now let's check our test performance. Remember, once we test our model, there's no more iteration allowed. 

### Visual representation of the Multiple Linear Regression Model

In [None]:
# Create 3D scatter plot
fig = px.scatter_3d(
    data_frame=df_3d,
    x='Avg. Area House Age',
    y='Avg. Area Number of Rooms',
    z='Price',
    color_discrete_sequence=['yellow'],
    width=700,
    height=600,
)

# Create x and y coordinates for model representation
x_plane = np.linspace(df_3d["Avg. Area House Age"].min(), df_3d["Avg. Area House Age"].max(), 10)
y_plane = np.linspace(df_3d["Avg. Area Number of Rooms"].min(), df_3d["Avg. Area Number of Rooms"].max(), 10)

xx, yy = np.meshgrid(x_plane, y_plane)

# Use model to predict z coordinates
z_plane = model_3d.predict(pd.DataFrame({ "Avg. Area Number of Rooms": y_plane, "Avg. Area House Age": x_plane}))


zz = np.tile(z_plane, (10, 1))

# Add plane to figure
fig.add_trace(go.Surface(x=xx, y=yy, z=zz))


# Refine formatting
fig.update_traces(
    marker={"size": 4, "line": {"width": 2, "color": "DarkSlateGrey"}},
    selector={"mode": "markers"},
)

# Display figure
fig.show()

# Communicate Results

Let's take a look at the equation our model has come up with for predicting `Price` based on `Avg. Area House Age` and `Avg. Area Number of Rooms`.

<center><img src="../images/proj-2.005_3d.png" alt="Equation: y = beta 0 + beta 1 * x" style="width: 400px;"/></center>

In [None]:
intercept    = model_3d.intercept_
coefficients = model_3d.coef_

print(
    f"Price = {np.round(intercept, 2)} "
    f"+ ({np.round(coefficients[0], 2)} * Avg. Area Number of Rooms) "
    f"+ ({np.round(coefficients[1], 2)} * Avg. Area House Age) "
)

## Model Deployment

In [None]:
def make_prediction_3d(house_rooms, house_age):
    data = {
        "Avg. Area Number of Rooms" : house_rooms,
        "Avg. Area House Age" : house_age,
    }
    
    df = pd.DataFrame(data, index=[0])
    prediction = model_3d.predict(df).round(2)[0]
    
    return f"Predicted house price: ~ ${prediction}"

In [None]:
make_prediction_3d(3, 5)

In [None]:
interact(
    make_prediction_3d,
    house_age=IntSlider(
        min=0,
        max=50,
        value=X_train_3d["Avg. Area House Age"].mean()
    ),
    house_rooms=IntSlider(
        min=0,
        max=10,
        value=3
    )
);

# Assessing model accuracy

In [None]:
# dictionary of results
results_dict = {'Training RMSE':
                    {
                        "SLR": np.sqrt(mean_squared_error(y_train, model.predict(X_train))),
                        "MLR": np.sqrt(mean_squared_error(y_train_3d, model_3d.predict(X_train_3d))),
                        "Diff": (np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
                                - np.sqrt(mean_squared_error(y_train_3d, model_3d.predict(X_train_3d))))
                    },
                'Test RMSE':
                    {
                        "SLR": np.sqrt(mean_squared_error(y_test, model.predict(X_test))),
                        "MLR": np.sqrt(mean_squared_error(y_test_3d, model_3d.predict(X_test_3d))),
                        "Diff": (np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
                                - np.sqrt(mean_squared_error(y_test_3d, model_3d.predict(X_test_3d))))
                    },
                'Training MAE':
                    {
                        "SLR": mean_absolute_error(y_train, model.predict(X_train)),
                        "MLR": mean_absolute_error(y_train_3d, model_3d.predict(X_train_3d)),
                        "Diff": (mean_absolute_error(y_train, model.predict(X_train))
                                - mean_absolute_error(y_train_3d, model_3d.predict(X_train_3d)))
                    },
                'Test MAE':
                    {
                        "SLR": mean_absolute_error(y_test, model.predict(X_test)),
                        "MLR": mean_absolute_error(y_test_3d, model_3d.predict(X_test_3d)),
                        "Diff": (mean_absolute_error(y_test, model.predict(X_test))
                                - mean_absolute_error(y_test_3d, model_3d.predict(X_test_3d)))
                    },
                'R-squared':
                    {
                        "SLR": round(r2_score(y_true=y_train, y_pred=y_pred_lm), 4),
                        "MLR": round(r2_score(y_true=y_train_3d, y_pred=y_pred_lm_3d), 4),
                        "Diff": ( - round(r2_score(y_true=y_train, y_pred=y_pred_lm), 4)
                                 + round(r2_score(y_true=y_train_3d, y_pred=y_pred_lm_3d), 4))
                    }
                }

In [None]:
results_df = pd.DataFrame(data=results_dict); results_df

It appears that adding more predictors to our model helped us make it a litte bit better!

Are these metrics enough to assess the quality of our model?
What can be done to make our

# **“All Models are wrong, but some are useful.”**
***George Box***