# Linear Regression

---

In [None]:
!pip install numpy

In [None]:
!pip install pandas

In [None]:
!pip install matplotlib

In [None]:
!pip install seaborn

## 2. Import the required libraries

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

## 3. Load Data

In [None]:
# Load data
train_data = pd.read_csv('data/myfileCleaned.csv')
# Separate features and target variable
#X_train = train_data.drop(['SalePrice'], axis=1)
y = train_data['y']

In [None]:
train_data

## 6. Linear Regression
**Preparing Some tools :**

**Evaluate the model using Most Common Regression metrics :**

https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

**Mean Absolute Error (MAE) :** 
\begin{align*} 
MAE = \frac{1}{n} \Sigma_{i=1}^n |{y}-\hat{y}|
\end{align*}


**Residual Sum of Squares (RSS) :** 

\begin{align*} 
 RSS = \Sigma_{i=1}^n({y}-\hat{y})^2
\end{align*}

**Mean Squared Error (MSE) :** 
\begin{align*} 
  MSE = \frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2
\end{align*}


**Root Mean Squared Error (RMSE) :** 
\begin{align*} 
  RMSE = \sqrt {\frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2 }
\end{align*}


* Comparing these metrics :
    * **MAE** is the easiest to understand, because it is the average error
    * **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world
    * **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units
* All of these are **loss functions**, because we want to minimize them

**R^2 :** 
* aka : coefficient of determination
* if the model is only prdicting the mean of the targets, R^2 value would be 0
  ==> model is poor
* if the model is perfectly predicting the targets, R^2 value would be 1

In [None]:
!pip install scikit-learn

In [None]:
# Create evaluation functions 
from sklearn.metrics import mean_squared_error, mean_absolute_error, explained_variance_score

# Create function to get metrics from predicted values vs true values
def get_scores(predictions, y):
    """
    Caculates between predictions and true labels: 
        - mae,
        - mse,
        - rmse
        - r2
    """
    mae = mean_absolute_error(y, predictions)
    mse = mean_squared_error(y, predictions)
    rmse = np.sqrt(mse)
    r2 = explained_variance_score(y, predictions)
    scores = {"MAE": mae,
              "MSE": mse,
              "RMSE": rmse,
              "R2": r2
              }
    return scores

def show_scores(scores):
    """
    Shows metrics: 
        - mae,
        - mse,
        - rmse
        - r2
    """
    print(f"\tR^2 : {scores['R2']:.2f}")
    print(f"\tMAE : {scores['MAE']:.2f}")
    print(f"\tMSE : {scores['MSE']:.2f}")
    print(f"\tRMSE : {scores['RMSE']:.2f}")
    print('')

def show_scores_data_frame(scores, col_name):
    """
    Shows metrics in a data frame: 
        - mae,
        - mse,
        - rmse
        - r2
    """
    df = pd.DataFrame(scores.values(), scores.keys(), columns=[col_name])
    print(df)

def show_intercept_coefs(lr, cols):
    print(f"Intercept b_0 : {lr.intercept_}")
    print('')
    cdf = pd.DataFrame(lr.coef_, cols, columns=['Coefficient'])
    cdf = cdf.sort_values(by=['Coefficient'], ascending=False)
    print(cdf)   

In [None]:
train_data_encoded = train_data

## 7. Modeling (I)

**Splitting Dataset into the Training Set and Validation Set :**

**Simple Linear Regression :**
* Trying to find the `coefficient` and the `slope` for :
\begin{align*}
\\
\hat{y} = b_0 + b_1{x}_{1} +b_2{G_F}
\end{align*}

In [None]:
train_data_encoded.shape

In [None]:
X = pd.DataFrame(train_data_encoded[['x', 'G_F']])
y = train_data_encoded['y']
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

**Making Predictions on `X_train` and `X_val` :**

In [None]:
train_predictions = lr.predict(X_train)
val_predictions = lr.predict(X_val)

train_scores = get_scores(train_predictions, y_train)
val_scores = get_scores(val_predictions, y_val)

print('')
#print(f"Training :")
#show_scores(train_scores)
show_scores_data_frame(train_scores, 'Training')
print('')
#print(f"Validation :")
#show_scores(val_scores)
show_scores_data_frame(val_scores, 'Validation')
#
print('')

show_intercept_coefs(lr, X_train.columns)

* Trying to find the `coefficient` and the `slope` for :
\begin{align*}
\hat{y} = b_0 + b_1{x}_{1} + b_2{G_F}
\end{align*}

  $$\hat{y} = 9.97 + 3.00 * {x}_{1} + 10.00{G_F}$$

**R^2 (default score): Each model has a default metric set up by default :**

In [None]:
r_2_on_train = lr.score(X_train, y_train) 
print(f"R^2 on Training data = {r_2_on_train}")
r_2_on_val = lr.score(X_val, y_val) 
print(f"R^2 on Validation data = {r_2_on_val}")

**Plot Predicted values vs True Values :**

In [None]:
plt.scatter(y_val, val_predictions)
plt.xlabel('Y val ( True Values )')
plt.ylabel('Predicted Values ')
plt.show()

In [None]:
# Example of predicting a single value using a new data point
new_data = np.array([[20, 1]])  # Example values for x1 and x2
single_prediction = lr.predict(new_data)
print(f"Predicted value for the new data point {new_data[0]}: {single_prediction[0]}")

**Saving and Loading Models in Scikit-learn :**
- Scikit-learn supports the Python packages :
  * `Pickle` ( not covered in student guide )
  * `Joblib`

In [None]:
import pickle

# save the model as a pickle file
model_pkl_file = "models/lr1.pkl"  


with open(model_pkl_file, 'wb') as file:  
    pickle.dump(lr, file)

In [None]:
# load model from pickle file
with open(model_pkl_file, 'rb') as file:  
    loaded_model = pickle.load(file)

In [None]:
# Example of predicting a single value using a new data point
new_data = np.array([[20, 1]])  # Example values for x1 and x2
single_prediction = lr.predict(new_data)
print(f"Predicted value for the new data point {new_data[0]}: {single_prediction[0]}")