# Machine Learning: Training and evaluating a Regression model: An example
---
## Contents
1. Introduction
2. Setup
   1. Import required Python modules
   2. Load the dataset into a Pandas dataframe
3. Explore the dataset
4. Prepare the dataset
   1. Separte the dataset into `X` and `y` sets
   2. Apply required data engineering (one-hot encoding)
   3. Split the data in training and test datasets
5. Train the `ridge regression` model
6. Evaluate the model
7. Assignment
---
## Introduction
This notebook provides a an example of training and evaluating a Regression model

This dataset has data collected from New York, California and Florida about 50 business Startups "17 in each state". The variables used in the dataset are Profit, R&D spending, Administration Spending, and Marketing Spending. 

This is a publicly available dataset from Kaggle: https://www.kaggle.com/datasets/farhanmd29/50-startups

Our ML training objective is to predict the amount of profit (this is the dependent, 'y' variable), based on the independent variables (X), R&D spending, Administration Spending, Marketing Spending and the location (state) of the business.


## Download content from Amazon S3

In [None]:
# if the data directory does not exist then create it
import os

target_path = f"./data"

if not os.path.exists(target_path):
    os.makedirs(target_path)

Read in essential static variables used across notebooks from the store. These values are set in notebook 00.
The variables include a reference to where the data is on Amazon S3

In [None]:
%store -r

In [None]:
s3_path_to_data = f"{djl_mme_sklearn_data}{startups_test_data_csv}"
s3_path_to_data

The following line copies the data to the local folder that we just created. After this the file can be access and used directly.

In [None]:
!aws s3 cp $s3_path_to_data $target_path

---
## Setup
Import the Python modules that we need for the model training and evaluation process.

The following set is also fairly typical for scikit-learn statistical model training

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error,root_mean_squared_error, r2_score

Load the dataset that we are going to use for training, testing. 

This dataset noted above (from Kaggle), with information about 50 startups, has been downloaded to a local directory, `data` in this case.

The next step is read the CSV file into a Pandas dataframe so that we can easily explore the content of the dataset and get the dataset ready for model training.

The dataframe `head` function displays the first few rows of the dataframe

In [None]:
startups_df = pd.read_csv('data/50_Startups.csv')
startups_df.head()

---
## Explore the dataset 


Print a concise summary of a DataFrame.
The `info` method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [None]:
startups_df.info()

Take a look a the statistics of the numerical fields

In [None]:
startups_df.describe()

Take a look at the shape of the dataset (rows, columns)

In [None]:
startups_df.shape

---
## Prepare the dataset for training

1. Separate the data in the `independent variables`, aka `X`, aka the `observations`, and the `dependent variable`, aka `y`, aka the target `label`
2. Apply `feature engineering`. In this example, this is limited to `one-hot encoding` to the State variable.
3. Check the `cross-correlation` of the `independent variables`
4. Split the data into `training` and `test` datasets


In [None]:
X = startups_df.iloc[:, :-1]    # All but the last column are the observations (aka independent varibles)
y = startups_df.iloc[:, -1]     # The last column is the dependent variable, that we want to predict
X.head()

Categorical data can not be used directly for regression and needs to be transformed into numeric data. The solution is to use dummy variables. We create dummy variables for regression analysis that take on one of two values: zero or one.

In [None]:
# create an instance of one-hot-encoder
enc = OneHotEncoder()

enc_df = pd.DataFrame(enc.fit_transform(X[['State']]).toarray())
enc_df.columns = ['California', 'Florida', 'New York']
# merge with main df on key values
X = X.join(enc_df)
X.head()

At this point we drop the State variable from X. Variable of this datatype (string) cannot be in the dataset when it is given to the ML algorithm 

In [None]:
X = X.drop('State', axis=1)
X.head()

For many ML algorithms there is a requirement the independent variables are independent. That they are not dependent on each other. 
This can be explored by looking at the cross-correlation of the variables. 
It is expected that there will be moderate correlation of the one-hot-encoded variables, as these have identical values and are mutually exclusive. 
This is not an issue that we are concerned about for this example.
The following shows the correlations metrics as a table and as a heatmap.
The metrics show a high correlation between R&D Spend and Marketing Spend, however, we may reasonably assume that these two are not dependent on each other.

In [None]:
X.corr()

In [None]:
sns.heatmap(X.corr(), annot=True)

In [None]:
# SPLITTING DATA FOR train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

---
## Train the model

1. Instantiate the model (in this case we'lll use Linear Regression)
2. Train the model, using the `fit` method

In [None]:
# FITTING THE MODEL/TRAIN
regressor = Ridge() # Instatiate the LinearRrgression model
regressor.fit(X_train, y_train) # fit the model

---
## Evaluate the model

1. Run inference on the test data set and get the predictions
2. Compare the predictions to the target values for the test dataset

In [None]:
# Run the predictions
y_pred = regressor.predict(X_test)

In [None]:
# Assess the model with the R^2 metric
score = r2_score(y_test, y_pred)
print(f'R2 Model Score: {score:0.4}')

In [None]:
# Assess the model with the Root Mean Squared Error Model Error metric
score = root_mean_squared_error(y_test, y_pred)
print(f'Root Mean Squared Error Model Score: {score:0.6}')

In [None]:
# Assess the model with the Mean Absolute Error metric
score = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error Model Score: : {score:0.6}')

Compare the actual values and predicted values

In [None]:
# Calcluate the residuals
residuals = y_test - y_pred
print('Residuals: ', residuals)

In [None]:
# Visualize the residuals (the differences between the labels and predicted values)
sns.scatterplot(x=y_test, y = y_pred,s=140)
plt.xlabel('y_test data')
plt.ylabel('Predictions')

### Evaluation Conclusion
Given the small size of the dataset both in terms of the number of observations and features, the model has a fairly high level of accuracy. 

With model results such as these, we might check in with the projects business lead to discuss our findings and next steps.