# Solution Seekers Group

Lead of the Study Group Discussion: **Badr Bensassi**

Author: **Youssef Laouina**


> **“All Models are wrong, but some are useful.”**

*George Box*

# Prepare Data

Importing our libraries

In [None]:
import warnings

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.model_selection import train_test_split

warnings.simplefilter(action="ignore", category=FutureWarning)

## Importing our data into a Pandas DataFrame

In [None]:
df = pd.read_csv(
    'https://raw.githubusercontent.com/tirthajyoti/Machine-Learning-with-Python/master/Datasets/USA_Housing.csv', index_col=False)

In [None]:
df.info()

In [None]:
df.head()

# Residuals Analysis: 2D dataset

In [None]:
df_2d = df[['Avg. Area House Age', 'Price']]

feature_matrix = df_2d[['Avg. Area House Age']]
target_vector = df_2d['Price']

print(f"Feature Matrix: {feature_matrix.shape}",
      f"Target Vector: {target_vector.shape}",
      sep='\n')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(feature_matrix,
                                                    target_vector,
                                                    train_size=0.8,
                                                    random_state=7)

## Restoring the model with Pickle


In [None]:
# restore the model just like we would read a file
model_load_path = "slr_model.pkl"

with open(model_load_path,'rb') as file:
    unpickled_model = pickle.load(file)

### Using the model file to make predictions


In [None]:
# get predictions from unpickled model
y_pred_lm_train = unpickled_model.predict(X_train)

**Let's check the calculated fit of the line** by measuring how far the true y-values of each point are from their corresponding y-value on the line.   
   
We'll use the equation below to calculate the error of each generated value of y:   
   
$$e_i = y_i - \hat{y}_i$$   
   

In [None]:
residuals = y_train - y_pred_lm_train

## Visualizing Residuals

In [None]:
sns.histplot(residuals, kde=True, color='orange', fill=True)
plt.title("Residuals' Distribution")
plt.xlabel('Residuals')
plt.show()

### Fitted vs. residuals plot to check homoscedasticity

In [None]:
sns.scatterplot(x=y_pred_lm_train, y=residuals, color='yellow', edgecolor='black', s=80, label='Fitted values')
plt.axhline(0, color='red')
plt.title('Residuals vs. Fitted values')
plt.ylabel('Residuals')
plt.xlabel('Fitted Values')

plt.legend()
plt.show()

### Residual Sum of Squares

 Now we will measure the overall error of the fit by calculating the **Residual Sum of Squares**:
   
$$RSS = \sum_{i=1}^n(y_i-\hat{y}_i)^2$$

In [None]:
print("Residual sum of squares:", (residuals ** 2).sum())

**Note:** The RSS is influenced by the scale of the data being forecasted, meaning that for forecasts involving large values, such as in billions, the RSS will also likely be a large number, reflecting the squared deviations of these large values.