### __Demand Prediction Model Documentation__
This documentation provides a detailed overview of the data preprocessing, feature selection, model training, and evaluation processes used to build a demand prediction model using XGBoost. Each step is meticulously designed to handle the complexities of the data and optimize the model's predictive power.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Data loading:__

In [None]:
import pandas as pd

df = pd.read_feather("/Users/skylerwilson/Desktop/PartsWise/Data/Processed/parts_data.feather")

__Purpose__: Load the dataset from a Feather file for efficient data reading and processing.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Feature and Target Selection__

In [None]:
# Select the features (X) and target (y) for modeling
feature_cols = [col for col in df.columns if col not in {'part_number', 'description', 'supplier_name',
                                                         'sales_last_jan','sales_last_feb', 'sales_last_mar', 'sales_last_apr', 'sales_last_may',
                                                         'sales_last_jun', 'sales_last_jul', 'sales_last_aug', 'sales_last_sep',
                                                         'sales_last_oct', 'sales_last_nov', 'sales_last_dec', 'sales_jan',
                                                         'sales_feb', 'sales_mar', 'sales_apr', 'sales_may', 'sales_jun', 
                                                         'sales_jul', 'sales_aug', 'sales_sep', 'sales_oct', 'sales_nov', 
                                                         'sales_dec', 'sales_this_year', 'sales_last_year', 'sales_revenue',
                                                         'price', 'sales_to_stock_ratio', 'rolling_12_month_sales', 'cogs',
                                                         'margin', 'quantity', 'demand'}]
X = df[feature_cols]
y = df['rolling_12_month_sales']


__Purpose:__ Select features (X) excluding non-relevant columns and set the target variable (y) as rolling_12_month_sales. Rolling 12 month sales is used to guage demand for each part based on the number of sales on a rolling basis. 


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Train-Test Split:__

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, shuffle=True)


__Purpose:__ Split the dataset into training (70%) and testing (30%) sets to evaluate model performance on unseen data.
__Hyperparameters:__
test_size=0.3: 30% of the data is used for testing.
random_state=42: Ensures reproducibility.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Preprocessing Pipeline:__

In [None]:
from sklearn.preprocessing import RobustScaler, PowerTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('scaler', RobustScaler()),
            ('power_trans', PowerTransformer(method='yeo-johnson'))]),
        numerical_features)
    ])

X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)


__Purpose:__ Apply robust scaling and power transformation to numerical features.
-  __Robust scalar:__ removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
-  __Yeo-johnson Transformation:__ inflates low variance data and deflates high variance data to create a more uniform dataset 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Hyperparameter Space for Hyperopt__

In [None]:
from hyperopt import hp

space = {
    'objective': 'reg:pseudohubererror',
    'colsample_bytree': hp.uniform('colsample_bytree', 0.4, 1.0),
    'gamma': hp.uniform('gamma', 0.25, 1.0),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.05)),
    'max_depth': hp.quniform('max_depth', 5, 15, 1),
    'min_child_weight': hp.quniform('min_child_weight', 3, 15, 1),
    'n_estimators': hp.quniform('n_estimators', 350, 750, 10),
    'reg_alpha': hp.loguniform('reg_alpha', np.log(0.0001), np.log(1)),
    'reg_lambda': hp.loguniform('reg_lambda', np.log(1), np.log(3)),
    'subsample': hp.uniform('subsample', 0.5, 1.0),
    'max_delta_step': hp.quniform('max_delta_step', 5, 10, 1),
    'huber_slope': hp.uniform('huber_slope', 0.2, 0.3),
}


__Purpose:__ Define the search space for hyperparameter optimization using Hyperopt.
__Hyperparameters:__
-  objective: 'reg:pseudohubererror'
-  colsample_bytree: Fraction of features to consider for each tree.
-  gamma: Minimum loss reduction required to make a further partition.
-  learning_rate: Step size shrinkage used to prevent overfitting.
-  max_depth: Maximum depth of a tree.
-  min_child_weight: Minimum sum of instance weight needed in a child.
-  n_estimators: Number of boosting rounds.
-  reg_alpha: L1 regularization term on weights.
-  reg_lambda: L2 regularization term on weights.
-  subsample: Fraction of samples to be used for each tree.
-  max_delta_step: Maximum delta step we allow each tree's weight estimate to be.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__KFold Cross-Validation:__

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

__Purpose:__ Use KFold cross-validation to evaluate the model's performance.

__Hyperparameters:__
-  n_splits=5: Number of folds.
-  shuffle=True: Shuffle the data before splitting into folds.
-  random_state=42: Ensures reproducibility.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Objective Function for Hyperopt:__

In [None]:
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor
from hyperopt import STATUS_OK

def objective(params):
    params['n_estimators'] = int(params['n_estimators'])
    params['max_depth'] = int(params['max_depth'])
    params['min_child_weight'] = int(params['min_child_weight'])
    params['max_delta_step'] = int(params['max_delta_step']) 
    
    model = XGBRegressor(**params)
    scores = cross_val_score(model, X_train_transformed, y_train, scoring='neg_mean_absolute_error', cv=5)
    return {'loss': -scores.mean(), 'status': STATUS_OK}

__Purpose:__ Define the objective function for Hyperopt to minimize the negative mean absolute error (MAE).

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Hyperparameter Optimization with Hyperopt__

In [None]:
from hyperopt import fmin, tpe, Trials

trials = Trials()
best_hyperparams = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=500, trials=trials)
print("Best Hyperparameters:", best_hyperparams)

__Purpose:__ Perform hyperparameter optimization using the Tree-structured Parzen Estimator (TPE) algorithm.
__Hyperparameters:__
__max_evals=500:__ Maximum number of evaluations.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Recursive Feature Elimination with Cross-Validation (RFECV):__

In [None]:
from sklearn.feature_selection import RFECV

rfecv = RFECV(estimator=model, step=1, cv=KFold(10), scoring='neg_mean_absolute_error')
rfecv.fit(X_train_transformed, y_train)
selected_features_mask = rfecv.support_
feature_ranking = rfecv.ranking_
selected_features = [feature for feature, selected in zip(numerical_features, selected_features_mask) if selected]
print(f"Optimal number of features: {rfecv.n_features_}")

__Purpose:__ Perform feature selection using RFECV to select the best subset of features.

__Hyperparameters:__

__step=1:__ Number of features to remove at each iteration.

__cv=KFold(10):__ 10-fold cross-validation.

__scoring='neg_mean_absolute_error':__ Scoring metric.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Model Training with Best Hyperparameters:__

In [None]:
model.fit(X_train_transformed_rfe, y_train)

__Purpose:__ Train the XGBoost model using the best hyperparameters and selected features.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Model Evaluation__

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

y_pred = model.predict(X_test_transformed_rfe)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'\nModel Performance')
print(f"Test MSE: {mse}")
print(f"Test RMSE: {rmse}")
print(f"Test MAE: {mae}")
print(f"Test R² Score: {r2}")

__Purpose:__ Evaluate the model's performance using various metrics.
__Metrics:__

__Mean Squared Error (MSE):__ MSE measures the average of the squares of the errors, which is the difference between the actual and predicted values.
Purpose: It provides an idea of how close the predicted values are to the actual values. Lower MSE indicates better model performance.

__Root Mean Squared Error (RMSE):__ square root of MSE and provides an error metric in the same units as the target variable.

__Mean Absolute Error (MAE):__ MAE measures the average magnitude of the errors in a set of predictions, without considering their direction.

__R² Score:__ measures how well the regression predictions approximate the real data points.