### Installing Required Packages

Before starting, we need to install the necessary packages such as `Random Forest `, `xgboost`, `ElasticNet`,`LSTM `,`WaveNet `,`ARIMA`,`SARIMA `, `prophet`, and `catboost`. To install these packages, use the following command:


In [1]:
!pip install tensorflow xgboost statsmodels prophet catboost

Collecting catboost
  Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


**MileStone #3** </br>
First we import the important libraries

### Importing Required Libraries

In this section, we import all the necessary libraries that are needed for data processing, model training, and evaluation.

1. **Data Manipulation and Preprocessing**:
   - `numpy`: Used for numerical operations and handling arrays.
   - `pandas`: Essential for data manipulation and reading datasets (e.g., CSV files).
   - `StandardScaler`: For scaling and normalizing data to have zero mean and unit variance.

2. **Modeling and Machine Learning Algorithms**:
   - `XGBRegressor`: The XGBoost model, which is a gradient boosting method for regression.
   - `CatBoostRegressor`: A gradient boosting model optimized for categorical features.
   - `RandomForestRegressor`: A tree-based ensemble method for regression tasks.

3. **Model Selection and Evaluation**:
   - `GridSearchCV`: Used for hyperparameter tuning by searching across a parameter grid.
   - `train_test_split`: To split the dataset into training and testing sets.
   - `mean_squared_error`, `mean_absolute_error`, `r2_score`, `explained_variance_score`, `mean_absolute_percentage_error`: Metrics to evaluate the model’s performance.

4. **Serialization**:
   - `joblib`: To save and load models after training for later use.

5. **Data Path**:
   - The dataset is located at the path `'/content/walmart_cleaned_machine.csv'`.

This setup prepares all the necessary tools for building, training, evaluating, and saving machine learning models using XGBoost, CatBoost, and Random Forest algorithms.

In [2]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, explained_variance_score, mean_absolute_percentage_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from joblib import dump, load
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor

datapath = r'/content/walmart_cleaned_machine.csv'

### Loading and Preparing the Data

In this step, we load the dataset and prepare it for time series analysis:

1. **Loading the Data**:
   - We use `pandas.read_csv()` to load the dataset from a CSV file located at `'/content/walmart_cleaned_machine.csv'`. The `parse_dates` parameter ensures that the **'date'** column is parsed as a datetime object, which is crucial for time series analysis.

2. **Previewing the Data**:
   - We use `df.head()` to preview the first few rows of the dataset, allowing us to inspect its structure and check that the data has been loaded and formatted correctly.

In [3]:
import pandas as pd
df = pd.read_csv(r'/content/walmart_cleaned_machine.csv', parse_dates=['date'])

df.head()

Unnamed: 0,Store,Dept,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Type,Size,Month,Year,WeekOfYear,Quarter,Season,IsPromoWeek,date
0,1,1,1643690.9,0,42.31,2.572,211.096358,8.106,2,151315,2,2010,5,1,0,False,2010-02-05
1,1,1,1641957.44,1,38.51,2.548,211.24217,8.106,2,151315,2,2010,6,1,0,True,2010-02-12
2,1,1,1611968.17,0,39.93,2.514,211.289143,8.106,2,151315,2,2010,7,1,0,False,2010-02-19
3,1,1,1409727.59,0,46.63,2.561,211.319643,8.106,2,151315,2,2010,8,1,0,False,2010-02-26
4,1,1,1554806.68,0,46.5,2.625,211.350143,8.106,2,151315,3,2010,9,1,1,False,2010-03-05


### Preparing the Data for All Models

1. **Loading the Data**:
    - We load the dataset from a CSV file using **pandas**.
    ```python
    data = pd.read_csv(datapath)
    ```

2. **Feature Scaling**:
    - We use **StandardScaler** to standardize the features to ensure they have a mean of 0 and a standard deviation of 1. This is important for models like **XGBoost**, **CatBoost**, and **Random Forest**.
    ```python
    scaler = StandardScaler()

    features = ['Fuel_Price', 'Temperature', 'CPI', 'Unemployment', 'Size', 'Weekly_Sales']
    data[features] = scaler.fit_transform(data[features])
    ```

3. **Preparing Input and Output Variables**:
    - **X** represents the features (independent variables), and **y** represents the target variable (`Weekly_Sales`).
    ```python
    X = data.drop(columns=['Weekly_Sales', 'date'])
    y = data['Weekly_Sales'].values.reshape(-1, 1)
    ```

4. **Splitting the Data**:
    - We split the data into training and testing sets using **train_test_split** from **sklearn** with 80% for training and 20% for testing. This is used for models such as **XGBoost**, **CatBoost**, and **Random Forest**.
    ```python
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    ```

5. **Printing Data Shapes**:
    - Finally, we print the shapes of the training and testing datasets to confirm that the data is correctly formatted for the models.
    ```python
    print(f'X_train shape: {X_train.shape}')
    print(f'y_train shape: {y_train.shape}')
    ```

In [4]:
# Prepare The Data for all models.
# This is the part where you prepare the data for each and every model before getting into training
data = pd.read_csv(datapath)

scaler = StandardScaler()

features = ['Fuel_Price', 'Temperature', 'CPI', 'Unemployment', 'Size', 'Weekly_Sales']

data[features] = scaler.fit_transform(data[features])

X = data.drop(columns=['Weekly_Sales', 'date'])
y = data['Weekly_Sales'].values.reshape(-1, 1)

# This is used for XGBoost, CatBoost, Random Forest
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'X_train shape: {X_train.shape}')
print(f'y_train shape: {y_train.shape}')

X_train shape: (337256, 15)
y_train shape: (337256, 1)


### Training the  Models

In this section, we train the models with their respective parameters. Each model has its own set of hyperparameters that are fine-tuned to achieve optimal performance.

1. **Training the XGBoost Model**:
    - We start by initializing the **XGBRegressor** model with the following hyperparameters:
        - **n_estimators**: The number of boosting rounds (200).
        - **max_depth**: The maximum depth of the decision trees (7).
        - **learning_rate**: The rate at which the model learns (0.01).
        - **subsample**: The fraction of samples used to build each tree (0.8).
        - **colsample_bytree**: The fraction of features used to build each tree (1.0).
        - **objective**: The objective function used for regression tasks (`'reg:squarederror'`).
        - **random_state**: A random seed to ensure reproducibility (42).
    
    - After initializing the model, we fit it to the training data (**X_train**, **y_train**).

2. **Confirmation**:
    - After training the model, we print a message confirming that **XGBoost** has been successfully trained.

This training process ensures that the XGBoost model is ready for making predictions and evaluations.

In [5]:
# Now for the training part.
# Each model has its own parameters

# XGBoost
xgb_model = XGBRegressor(n_estimators= 200, max_depth=7, learning_rate=0.01, subsample=0.8, colsample_bytree=1.0,objective='reg:squarederror', random_state=42)
xgb_model.fit(X_train, y_train)
###
print(f'XGBoost has been trained!')

XGBoost has been trained!


### Training the CatBoost Model

In this section, we train the **CatBoostRegressor** model with the specified parameters:

1. **Training the CatBoost Model**:
    - We initialize the **CatBoostRegressor** model with the following hyperparameters:
        - **iterations**: The number of boosting iterations (10000).
        - **learning_rate**: The rate at which the model learns during each iteration (0.1).
        - **verbose**: Controls the verbosity of the training process (set to 1 to display training progress).

    - After initializing the model, we fit it to the training data (**X_train**, **y_train**).

2. **Confirmation**:
    - Once the model has been trained, we print a message confirming that **CatBoost** has been successfully trained.

This process ensures that the CatBoost model is ready for predictions and further evaluation.

In [6]:
# CatBoost
catboost_model = CatBoostRegressor(iterations=10000, learning_rate=0.1, verbose=1)
catboost_model.fit(X_train, y_train)
###
print(f'CatBoost has been trained!')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
5001:	learn: 0.0100187	total: 3m 58s	remaining: 3m 58s
5002:	learn: 0.0100160	total: 3m 58s	remaining: 3m 58s
5003:	learn: 0.0100138	total: 3m 58s	remaining: 3m 58s
5004:	learn: 0.0100113	total: 3m 58s	remaining: 3m 58s
5005:	learn: 0.0100085	total: 3m 58s	remaining: 3m 58s
5006:	learn: 0.0100054	total: 3m 58s	remaining: 3m 58s
5007:	learn: 0.0100032	total: 3m 58s	remaining: 3m 58s
5008:	learn: 0.0100012	total: 3m 59s	remaining: 3m 58s
5009:	learn: 0.0099986	total: 3m 59s	remaining: 3m 58s
5010:	learn: 0.0099935	total: 3m 59s	remaining: 3m 58s
5011:	learn: 0.0099922	total: 3m 59s	remaining: 3m 58s
5012:	learn: 0.0099901	total: 3m 59s	remaining: 3m 57s
5013:	learn: 0.0099878	total: 3m 59s	remaining: 3m 57s
5014:	learn: 0.0099856	total: 3m 59s	remaining: 3m 57s
5015:	learn: 0.0099832	total: 3m 59s	remaining: 3m 57s
5016:	learn: 0.0099809	total: 3m 59s	remaining: 3m 57s
5017:	learn: 0.0099779	total: 3m 59s	remaining: 3m 57s


### Training the Random Forest Regressor Model

In this section, we train the **RandomForestRegressor** model with the specified parameters:

1. **Training the Random Forest Regressor**:
    - We initialize the **RandomForestRegressor** model with the following hyperparameters:
        - **n_estimators**: The number of trees in the forest (200).
        - **random_state**: A random seed to ensure reproducibility (42).

    - After initializing the model, we fit it to the training data (**X_train**, **y_train**).

2. **Confirmation**:
    - Once the model has been trained, we print a message confirming that the **Random Forest Regressor** has been successfully trained.

This ensures that the Random Forest Regressor model is ready for predictions and further analysis.

In [7]:
# Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=200, random_state=42)
rf_model.fit(X_train, y_train)
###
print(f'Random Forest Regressor has been trained!')

  return fit_method(estimator, *args, **kwargs)


Random Forest Regressor has been trained!


### Hyperparameter Tuning for XGBoost Model using GridSearchCV

In this section, we use **GridSearchCV** to find the best hyperparameters for the **XGBoost** model by performing an exhaustive search over a specified parameter grid:

1. **Defining the Hyperparameters Grid**:
    - We define a dictionary `xgb_params` that contains the possible values for the following hyperparameters:
        - **n_estimators**: Number of boosting rounds (100 or 200).
        - **max_depth**: Maximum depth of the trees (4, 6, or 8).
        - **learning_rate**: Rate at which the model learns (0.01 or 0.1).
        - **subsample**: Fraction of samples used to train each tree (0.8 or 1.0).
        - **colsample_bytree**: Fraction of features used to train each tree (0.8 or 1.0).

2. **GridSearchCV Setup**:
    - We initialize the **GridSearchCV** with the following parameters:
        - **estimator**: The base model, in this case, an **XGBRegressor** with `objective='reg:squarederror'` and a fixed `random_state=42`.
        - **param_grid**: The hyperparameters to search over (`xgb_params`).
        - **scoring**: The evaluation metric used for selecting the best model (`neg_mean_squared_error`).
        - **cv**: The number of cross-validation folds (3).
        - **n_jobs**: The number of CPU cores to use during the search (set to -1 to use all available cores).
        - **verbose**: The verbosity level of the output (set to 1 to show progress).

3. **Fitting the Model**:
    - We fit the **GridSearchCV** to the training data (**X_train**, **y_train**) to perform the hyperparameter search.

4. **Best Model Selection**:
    - Once the search is complete, we print the best parameters found during the search and extract the best estimator (model) with `xgb_grid.best_estimator_`.

This process ensures that the **XGBoost** model is trained with the optimal hyperparameters for better performance.


In [8]:
from sklearn.model_selection import GridSearchCV
xgb_params = {
    'n_estimators': [100, 200],
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}
xgb_grid = GridSearchCV(estimator=XGBRegressor(objective='reg:squarederror', random_state=42),
                        param_grid=xgb_params, scoring='neg_mean_squared_error',
                        cv=3, n_jobs=-1, verbose=1)
xgb_grid.fit(X_train, y_train)
print("Best parameters for XGBoost:", xgb_grid.best_params_)
xgb_model = xgb_grid.best_estimator_

Fitting 3 folds for each of 48 candidates, totalling 144 fits
Best parameters for XGBoost: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 8, 'n_estimators': 200, 'subsample': 0.8}


### Hyperparameter Tuning for CatBoost Model using GridSearchCV

In this section, we use **GridSearchCV** to find the best hyperparameters for the **CatBoost** model by performing an exhaustive search over a specified parameter grid:

1. **Defining the Hyperparameters Grid**:
    - We define a dictionary `cat_params` that contains the possible values for the following hyperparameters:
        - **iterations**: The number of boosting iterations (500 or 1000).
        - **learning_rate**: The rate at which the model learns (0.01 or 0.1).
        - **depth**: The depth of the trees (6, 8, or 10).

2. **GridSearchCV Setup**:
    - We initialize the **GridSearchCV** with the following parameters:
        - **estimator**: The base model, in this case, a **CatBoostRegressor** with `verbose=0` to suppress training output.
        - **param_grid**: The hyperparameters to search over (`cat_params`).
        - **scoring**: The evaluation metric used for selecting the best model (`neg_mean_squared_error`).
        - **cv**: The number of cross-validation folds (3).
        - **n_jobs**: The number of CPU cores to use during the search (set to -1 to use all available cores).
        - **verbose**: The verbosity level of the output (set to 1 to show progress).

3. **Fitting the Model**:
    - We fit the **GridSearchCV** to the training data (**X_train**, **y_train**) to perform the hyperparameter search.

4. **Best Model Selection**:
    - Once the search is complete, we print the best parameters found during the search and extract the best estimator (model) with `catboost_grid.best_estimator_`.

This process ensures that the **CatBoost** model is trained with the optimal hyperparameters for improved performance.


In [None]:
from sklearn.model_selection import GridSearchCV
cat_params = {
    'iterations': [500, 1000],
    'learning_rate': [0.01, 0.1],
    'depth': [6, 8, 10]
}
catboost_grid = GridSearchCV(estimator=CatBoostRegressor(verbose=0),
                             param_grid=cat_params, scoring='neg_mean_squared_error',
                             cv=3, n_jobs=-1, verbose=1)
catboost_grid.fit(X_train, y_train)
print("Best parameters for CatBoost:", catboost_grid.best_params_)
catboost_model = catboost_grid.best_estimator_

### Hyperparameter Tuning for Random Forest Regressor using GridSearchCV

In this section, we use **GridSearchCV** to find the best hyperparameters for the **RandomForestRegressor** model by performing an exhaustive search over a specified parameter grid:

1. **Defining the Hyperparameters Grid**:
    - We define a dictionary `rf_params` that contains the possible values for the following hyperparameters:
        - **n_estimators**: The number of trees in the forest (100 or 200).
        - **max_depth**: The maximum depth of the trees (`None`, 10, or 20).
        - **min_samples_split**: The minimum number of samples required to split an internal node (2 or 5).

2. **GridSearchCV Setup**:
    - We initialize the **GridSearchCV** with the following parameters:
        - **estimator**: The base model, in this case, a **RandomForestRegressor** with a fixed `random_state=42` for reproducibility.
        - **param_grid**: The hyperparameters to search over (`rf_params`).
        - **scoring**: The evaluation metric used for selecting the best model (`neg_mean_squared_error`).
        - **cv**: The number of cross-validation folds (3).
        - **n_jobs**: The number of CPU cores to use during the search (set to -1 to use all available cores).
        - **verbose**: The verbosity level of the output (set to 1 to show progress).

3. **Fitting the Model**:
    - We fit the **GridSearchCV** to the training data (**X_train**, **y_train**) to perform the hyperparameter search.

4. **Best Model Selection**:
    - Once the search is complete, we print the best parameters found during the search and extract the best estimator (model) with `rf_grid.best_estimator_`.

This process ensures that the **RandomForestRegressor** model is trained with the optimal hyperparameters for better performance.


In [None]:
from sklearn.model_selection import GridSearchCV
rf_params = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}
rf_grid = GridSearchCV(estimator=RandomForestRegressor(random_state=42),
                       param_grid=rf_params, scoring='neg_mean_squared_error',
                       cv=3, n_jobs=-1, verbose=1)
rf_grid.fit(X_train, y_train)
print("Best parameters for Random Forest:", rf_grid.best_params_)
rf_model = rf_grid.best_estimator_

### Model Predictions and Evaluation

In this section, we make predictions using various trained models and evaluate their performance using multiple regression metrics.

1. **Making Predictions**:
   - Predictions are made for each model using the test data (**X_test**):
     - **XGBoost**: `xgboost_pred = xgb_model.predict(X_test)`
     - **CatBoost**: `catboost_pred = catboost_model.predict(X_test)`
     - **Random Forest**: `rf_pred = rf_model.predict(X_test)`

2. **Evaluating Model Performance**:
   - We define a function `calculate_metrics(y_true, y_pred)` to compute multiple evaluation metrics for each model:
     - **Mean Squared Error (MSE)**: Measures the average squared difference between actual and predicted values.
     - **Root Mean Squared Error (RMSE)**: The square root of MSE, giving a measure in the same units as the target variable.
     - **Mean Absolute Error (MAE)**: The average of the absolute differences between actual and predicted values.
     - **Mean Absolute Percentage Error (MAPE)**: Measures the percentage difference between actual and predicted values.
     - **R2 Score**: A measure of how well the model explains the variance of the target variable.
     - **Explained Variance Score**: Measures the proportion of variance in the target variable that is explained by the model.

3. **Next Steps**:
   - The metrics for each model can now be calculated and compared to evaluate which model performs the best on the given task.

This section ensures that predictions are made for all models and their performance is evaluated using standard regression metrics.

In [11]:
# Moment of truth! predictions and evaluations

xgboost_pred = xgb_model.predict(X_test)
catboost_pred = catboost_model.predict(X_test)
rf_pred = rf_model.predict(X_test)

# Define a function to calculate multiple regression metrics for multiple models
def calculate_metrics(y_true, y_pred):
    metrics = {
        'Mean Squared Error (MSE)': mean_squared_error(y_true, y_pred),
        'Root Mean Squared Error (RMSE)': np.sqrt(mean_squared_error(y_true, y_pred)),
        'Mean Absolute Error (MAE)': mean_absolute_error(y_true, y_pred),
        'Mean Absolute Percentage Error (MAPE)': mean_absolute_percentage_error(y_true, y_pred),
        'R2 Score': r2_score(y_true, y_pred),
        'Explained Variance Score': explained_variance_score(y_true, y_pred)
    }
    return metrics

### Model Performance Evaluation

In this section, we calculate and display the evaluation metrics for the models **XGBoost**, **CatBoost**, and **Random Forest**.

1. **Metric Calculation**:
   - We use the `calculate_metrics()` function to compute the following metrics for each model's predictions:
     - **XGBoost**: Evaluated on `xgboost_pred` with `y_test`.
     - **CatBoost**: Evaluated on `catboost_pred` with `y_test`.
     - **Random Forest**: Evaluated on `rf_pred` with `y_test`.

2. **Displaying Results**:
   - The results for each model are stored in a dictionary called `predicts`.
   - We iterate through the dictionary to print the calculated metrics for each model.
   - For each model, the following evaluation metrics are displayed:
     - **Mean Squared Error (MSE)**
     - **Root Mean Squared Error (RMSE)**
     - **Mean Absolute Error (MAE)**
     - **Mean Absolute Percentage Error (MAPE)**
     - **R2 Score**
     - **Explained Variance Score**

This process enables a clear comparison between the models, helping identify which performs best on the given dataset.

In [12]:
# Metric calculations cell
predicts = {'XGBoost': calculate_metrics(y_test, xgboost_pred),
            'CatBoost': calculate_metrics(y_test, catboost_pred),
            'Random Forest': calculate_metrics(y_test, rf_pred),
            }

print(f'Model Performance Evaluation: ')
for model, dictionary in predicts.items():
    print(f'----------------------\nFor {model} Model:')
    for key, value in dictionary.items():
        print(f'{key}: {value}')

Model Performance Evaluation: 
----------------------
For XGBoost Model:
Mean Squared Error (MSE): 0.0005504128331826292
Root Mean Squared Error (RMSE): 0.023460878781124742
Mean Absolute Error (MAE): 0.017533436705632444
Mean Absolute Percentage Error (MAPE): 0.08700977166345086
R2 Score: 0.9994506636322722
Explained Variance Score: 0.9994506662495007


In [13]:
# # Save the models.
dump(xgb_model, 'xgb_model.joblib')

['xgb_model.joblib']