### Summary of the Steps

---

#### **Initial Model Training**

Trained five models using **default hyperparameters**:

1. **Linear Regression**
2. **Decision Tree Regressor**
3. **Random Forest Regressor**
4. **Gradient Boosting Regressor**
5. **XGBoost Regressor**

Measured model performance using:
- **Mean Squared Error (MSE)**
- **R² Score**

---

#### **Grid Search Hyperparameter Tuning**

- Applied **GridSearchCV** to:
  - **Random Forest Regressor**
  - **Gradient Boosting Regressor**
- Optimized their hyperparameters.
- Evaluated the models after tuning to check for performance improvements.

---

#### **Random Search Hyperparameter Tuning**

- Used **RandomizedSearchCV** for:
  - **Random Forest Regressor**
  - **Gradient Boosting Regressor**
- Random Search efficiently sampled hyperparameter combinations compared to Grid Search.

---

### Interpretation of Results

---

#### **Best Performing Models**

1. **Linear Regression**  
   - **MSE**: 1,120,112.41  
   - **R² Score**: 0.502  
   - Explains about **50.2%** of the variance in sales, making it the **best-performing model** overall.

2. **Gradient Boosting Regressor (Random Search)**  
   - **MSE**: 1,172,775.89  
   - **R² Score**: 0.479  
   - After **Random Search** tuning, this model improved and came close to Linear Regression's performance.

---

#### **Random Forest Regressor**

- **Performance Improved** with both **Grid Search** and **Random Search**:
  - **Default R² Score**: 0.367  
  - **Random Search R² Score**: 0.435  
- Tuning helped reduce MSE and improve R², but it still lags behind Linear Regression and Gradient Boosting.

---

#### **XGBoost Regressor**

- **R² Score**: 0.268  
- Underperformed due to potential issues with:
  - **Hyperparameters** not being optimal.
  - **Feature set** not capturing enough patterns for XGBoost to leverage.

---

#### **Decision Tree Regressor**

- **R² Score**: -0.034  
- Poor performance likely due to:
  - **Overfitting** the training data.
  - Failing to **generalize** to new data.

---

### Conclusion

- **Linear Regression** and **Gradient Boosting** performed the best.
- **Random Forest** showed improvement with hyperparameter tuning.
- **XGBoost** and **Decision Tree** underperformed, suggesting the need for further tuning or feature engineering.


## 1. Feature Overview

### Numerical Features

1. **`mean_temp_c`** – Mean temperature (continuous)  
2. **`total_rain_mm`** – Total rainfall (continuous)  
3. **`total_snow_mm`** – Total snowfall (continuous)  
4. **`is_holiday`** – Binary indicator (0 or 1)  
5. **`is_holiday_prev_1`** – Binary indicator (0 or 1)  
6. **`is_holiday_next_1`** – Binary indicator (0 or 1)  
7. **`is_holiday_prev_2`** – Binary indicator (0 or 1)  
8. **`is_holiday_next_2`** – Binary indicator (0 or 1)  
9. **`month`** – Categorical (but can be treated as numerical or encoded)  

### Categorical Features

1. **`day_of_week`** – Day of the week (categorical: Monday, Tuesday, etc.)  
2. **`season`** – Season (categorical: Winter, Spring, Summer, Fall)  

---

## 2. Preprocessing Steps

### Numerical Preprocessing

For numerical features, apply **standardization** to scale the data. This ensures all numerical features have a mean of 0 and a standard deviation of 1, which helps models like **Linear Regression** perform better.

#### **StandardScaler** will be used for:

- `mean_temp_c`  
- `total_rain_mm`  
- `total_snow_mm`  
- `month`  

### Categorical Preprocessing

For categorical features, apply **one-hot encoding** to convert them into binary columns. This is suitable for models like **Linear Regression** and **tree-based models**.

#### **OneHotEncoder** will be used for:

- `day_of_week`  
- `season`  

### Binary Features

Binary features (0 or 1) do not need scaling or encoding.

#### **Binary features**:

- `is_holiday`  
- `is_holiday_prev_1`  
- `is_holiday_next_1`  
- `is_holiday_prev_2`  
- `is_holiday_next_2`  


In [32]:
import pandas as pd
import numpy as np
import yaml
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

In [6]:
# Load the config.yaml file
with open("../config.yaml", "r") as file:
    config = yaml.safe_load(file)

In [7]:
# Get the path to the clean data
clean_data_path = config['output_data']['cleaned_merged_climate_sales']

# Load the clean data
ml_df = pd.read_csv(clean_data_path)

# Inspect the data
print(ml_df.head())

         date        day  gross_sales  returns  discounts_comps  net_sales  \
0  2023-02-01  Wednesday       919.07      0.0           -33.35     885.72   
1  2023-02-02   Thursday      1463.52      0.0           -20.61    1442.91   
2  2023-02-03     Friday      1051.04      0.0            -9.60    1041.44   
3  2023-02-04   Saturday      2243.72      0.0           -12.43    2231.29   
4  2023-02-05     Sunday      1405.99      0.0           -25.12    1380.87   

   gift_card_sales     tax     tip  refunds_by_amount  ...  total_precip_mm  \
0              0.0   84.44   42.35                0.0  ...              0.0   
1              0.0  108.76   72.70                0.0  ...              0.0   
2              0.0   93.65   49.94                0.0  ...              0.3   
3              0.0  176.67  186.98                0.0  ...              0.0   
4              0.0   85.04   77.20                0.0  ...              0.0   

   total_snow_mm  holiday_name  is_holiday  is_holiday_p

In [10]:
# Ensure 'date' is in datetime format
ml_df['date'] = pd.to_datetime(ml_df['date'], errors='coerce')

# Feature Engineering: Add day of week and month if not already present
ml_df['day_of_week'] = ml_df['date'].dt.day_name()
ml_df['month'] = ml_df['date'].dt.month

In [12]:
# Target variable
y = ml_df['net_sales']

# Features
X = ml_df[['mean_temp_c', 'total_rain_mm', 'total_snow_mm', 'is_holiday', 
        'is_holiday_prev_1', 'is_holiday_next_1', 'is_holiday_prev_2', 
        'is_holiday_next_2', 'day_of_week', 'month', 'season']]

In [18]:
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regressor': DecisionTreeRegressor(random_state=0),
    'Random Forest Regressor': RandomForestRegressor(n_estimators=100, random_state=0),
    'Gradient Boosting Regressor': GradientBoostingRegressor(n_estimators=100, random_state=0)
}

In [19]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train and evaluate the models
results = {}

for model_name, model in models.items():
    # Create pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    # Train the model
    pipeline.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = pipeline.predict(X_test)
    
    # Calculate performance metrics
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Store the results
    results[model_name] = {'MSE': mse, 'R² Score': r2}

In [20]:
# Display the results
results

{'Linear Regression': {'MSE': 1120112.4094821122,
  'R² Score': 0.5019548878268173},
 'Decision Tree Regressor': {'MSE': 2324804.485278358,
  'R² Score': -0.03369760110640829},
 'Random Forest Regressor': {'MSE': 1423006.169818243,
  'R² Score': 0.36727665770800844},
 'Gradient Boosting Regressor': {'MSE': 1202801.463574221,
  'R² Score': 0.46518815006713143}}

In [23]:
# Define the parameter grid for Random Forest Regressor
rf_param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [10, 20],
    'model__min_samples_split': [2, 5],
    'model__min_samples_leaf': [1, 2]
}

# Define the parameter grid for Gradient Boosting Regressor
gb_param_grid = {
    'model__n_estimators': [100, 200],
    'model__learning_rate': [0.01, 0.1],
    'model__max_depth': [3, 5],
    'model__min_samples_split': [2, 5]
}

# Create pipelines for the models
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(random_state=0))
])

gb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', GradientBoostingRegressor(random_state=0))
])

# Perform GridSearchCV for Random Forest Regressor
rf_grid_search = GridSearchCV(rf_pipeline, rf_param_grid, cv=5, scoring='r2', verbose=1, n_jobs=-1)
rf_grid_search.fit(X_train, y_train)

# Perform GridSearchCV for Gradient Boosting Regressor
gb_grid_search = GridSearchCV(gb_pipeline, gb_param_grid, cv=5, scoring='r2', verbose=1, n_jobs=-1)
gb_grid_search.fit(X_train, y_train)

# Get the best parameters and scores for each model
rf_best_params = rf_grid_search.best_params_
rf_best_score = rf_grid_search.best_score_

gb_best_params = gb_grid_search.best_params_
gb_best_score = gb_grid_search.best_score_

rf_best_params, rf_best_score, gb_best_params, gb_best_score


Fitting 5 folds for each of 16 candidates, totalling 80 fits
Fitting 5 folds for each of 16 candidates, totalling 80 fits


({'model__max_depth': 10,
  'model__min_samples_leaf': 2,
  'model__min_samples_split': 5,
  'model__n_estimators': 200},
 np.float64(0.41418605028285765),
 {'model__learning_rate': 0.1,
  'model__max_depth': 3,
  'model__min_samples_split': 5,
  'model__n_estimators': 100},
 np.float64(0.45097737141212557))

In [26]:
# Train the Random Forest Regressor with the best parameters
best_rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(
        n_estimators=200,
        max_depth=10,
        min_samples_split=5,
        min_samples_leaf=2,
        random_state=0))
])

best_rf_pipeline.fit(X_train, y_train)

# Train the Gradient Boosting Regressor with the best parameters
best_gb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', GradientBoostingRegressor(
        learning_rate=0.1,
        n_estimators=100,
        max_depth=3,
        min_samples_split=5,
        random_state=0))
])

best_gb_pipeline.fit(X_train, y_train)

# Evaluate both models on the test set
rf_y_pred = best_rf_pipeline.predict(X_test)
gb_y_pred = best_gb_pipeline.predict(X_test)

# Calculate performance metrics
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_r2 = r2_score(y_test, rf_y_pred)

gb_mse = mean_squared_error(y_test, gb_y_pred)
gb_r2 = r2_score(y_test, gb_y_pred)

# Display the results
{'Random Forest': {'MSE': rf_mse, 'R² Score': rf_r2},
 'Gradient Boosting': {'MSE': gb_mse, 'R² Score': gb_r2}}


{'Random Forest': {'MSE': 1314337.1948173903, 'R² Score': 0.41559506877629393},
 'Gradient Boosting': {'MSE': 1223375.971764633,
  'R² Score': 0.45603993141259513}}

In [31]:
# Define the target variable
y = ml_df['net_sales']

# Define the features with updated columns
X = ml_df[['mean_temp_c', 'total_rain_mm', 'total_snow_mm', 'is_holiday',
        'is_holiday_prev_1', 'is_holiday_next_1', 'is_holiday_prev_2',
        'is_holiday_next_2', 'day_of_week', 'month', 'season']]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Define numerical and categorical features
numerical_features = ['mean_temp_c', 'total_rain_mm', 'total_snow_mm', 'month']
categorical_features = ['day_of_week', 'season']

# Preprocessing for numerical data
num_transformer = StandardScaler()

# Preprocessing for categorical data
cat_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine preprocessors in a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, numerical_features),
        ('cat', cat_transformer, categorical_features)
    ],
    remainder='passthrough'  # Keep binary features as they are
)

# Apply preprocessing to training and testing data separately
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

# Initialize the XGBoost model with reduced estimators for quicker execution
xgb_model = XGBRegressor(n_estimators=50, random_state=0)

# Train the XGBoost model on the preprocessed data
xgb_model.fit(X_train_preprocessed, y_train)

# Make predictions on the test set
xgb_y_pred = xgb_model.predict(X_test_preprocessed)

# Calculate performance metrics
xgb_mse = mean_squared_error(y_test, xgb_y_pred)
xgb_r2 = r2_score(y_test, xgb_y_pred)

# Display the results for XGBoost
{'XGBoost': {'MSE': xgb_mse, 'R² Score': xgb_r2}}

{'XGBoost': {'MSE': 1646640.7665049487, 'R² Score': 0.2678401039748596}}

In [34]:
# Define the parameter grid for Random Forest Regressor
rf_param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [10, 20, 30, None],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4]
}

# Define the parameter grid for Gradient Boosting Regressor
gb_param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__learning_rate': [0.01, 0.05, 0.1],
    'model__max_depth': [3, 5, 7],
    'model__min_samples_split': [2, 5, 10]
}

# Create pipelines for the models
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(random_state=0))
])

gb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', GradientBoostingRegressor(random_state=0))
])

# Perform RandomizedSearchCV for Random Forest Regressor
rf_random_search = RandomizedSearchCV(rf_pipeline, rf_param_grid, n_iter=20, cv=5, scoring='r2', verbose=1, n_jobs=-1, random_state=0)
rf_random_search.fit(X_train, y_train)

# Perform RandomizedSearchCV for Gradient Boosting Regressor
gb_random_search = RandomizedSearchCV(gb_pipeline, gb_param_grid, n_iter=20, cv=5, scoring='r2', verbose=1, n_jobs=-1, random_state=0)
gb_random_search.fit(X_train, y_train)

# Get the best parameters and scores for each model
rf_best_params = rf_random_search.best_params_
rf_best_score = rf_random_search.best_score_

gb_best_params = gb_random_search.best_params_
gb_best_score = gb_random_search.best_score_

rf_best_params, rf_best_score, gb_best_params, gb_best_score

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Fitting 5 folds for each of 20 candidates, totalling 100 fits


({'model__n_estimators': 100,
  'model__min_samples_split': 10,
  'model__min_samples_leaf': 4,
  'model__max_depth': 10},
 np.float64(0.4339342127867581),
 {'model__n_estimators': 100,
  'model__min_samples_split': 5,
  'model__max_depth': 3,
  'model__learning_rate': 0.1},
 np.float64(0.45097737141212557))

In [35]:
# Retrain Random Forest Regressor with the best hyperparameters
best_rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(
        n_estimators=200,
        max_depth=30,
        min_samples_split=10,
        min_samples_leaf=4,
        random_state=0))
])

best_rf_pipeline.fit(X_train, y_train)
rf_y_pred = best_rf_pipeline.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_r2 = r2_score(y_test, rf_y_pred)

# Retrain Gradient Boosting Regressor with the best hyperparameters
best_gb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.05,
        max_depth=3,
        min_samples_split=10,
        random_state=0))
])

best_gb_pipeline.fit(X_train, y_train)
gb_y_pred = best_gb_pipeline.predict(X_test)
gb_mse = mean_squared_error(y_test, gb_y_pred)
gb_r2 = r2_score(y_test, gb_y_pred)

# Display the results
{'Random Forest': {'MSE': rf_mse, 'R² Score': rf_r2},
 'Gradient Boosting': {'MSE': gb_mse, 'R² Score': gb_r2}}


{'Random Forest': {'MSE': 1270920.2990006572, 'R² Score': 0.4348998926173707},
 'Gradient Boosting': {'MSE': 1172775.894501888,
  'R² Score': 0.4785386743449651}}