### __Obsolescence Risk Machine Learning Model__

__Load and Prepare the Dataset__
The dataset is loaded from a Feather file. Obsolete items are excluded based on months_no_sale, and a new binary target variable obsolescence_dummy is created.

In [None]:
df = pd.read_feather("/Users/skylerwilson/Desktop/PartsWise/Data/Processed/parts_data.feather")

# Exclude obsolete items and assign obsolescence_dummy
df = df[df['months_no_sale'] < 12].copy()
df['obsolescence_dummy'] = np.where(df['months_no_sale'] >= 6, 1, 0)


__Purpose:__ Segment data using a 6 month binary classifier because 6 months would represent the 50% mark for obsolescence risk

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Define Features and Target Variable:__ The features and target variable are defined, excluding columns irrelevant to the prediction.

In [None]:
drop_cols = ['part_number', 'description', 'supplier_name', 
            'sales_to_stock_ratio', 'reorder_point', 'sales_last_jan', 'sales_last_feb',
            'sales_last_mar', 'sales_last_apr', 'sales_last_may', 'sales_last_jun',
            'sales_last_jul', 'sales_last_aug', 'sales_last_sep', 'sales_last_oct',
            'sales_last_nov', 'sales_last_dec', 'sales_jan', 'sales_feb',
            'sales_mar', 'sales_apr', 'sales_may', 'sales_jun', 'sales_jul',
            'sales_aug', 'sales_sep', 'sales_oct', 'sales_nov', 'sales_dec',
            'sales_revenue', 'cogs', 'cost_per_unit', 'rolling_12_month_sales',
            'price', 'safety_stock', 'months_no_sale']

X = df.drop(columns=drop_cols + ['obsolescence_dummy'])
y = df['obsolescence_dummy']


__Purpose:__ Define the target variable and remove any columns that have high multicolinierity

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Split the Data:__ The dataset is split into training, validation, and test sets.

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


__Purpose:__ ensure no data leakage and that the model works well on unseen data

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Data Preprocessing:__ Numerical features are scaled and transformed. SMOTE is applied to handle class imbalance. SMOTE artifically created more instances of the 0 class because there was a large imbalance toward the 1 class due to a large amount of obsolete parts in inventory

In [None]:
numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns
preprocessor = ColumnTransformer(transformers=[
    ('num', Pipeline([
        ('scaler', RobustScaler()),
        ('power_trans', PowerTransformer(method='yeo-johnson'))]),
     numerical_features)])

# Apply preprocessing to training data
X_train_preprocessed = preprocessor.fit_transform(X_train)

# Apply SMOTE after preprocessing
smote = SMOTE(random_state=42)
X_train_rs, y_train_rs = smote.fit_resample(X_train_preprocessed, y_train)

X_val_transformed = preprocessor.transform(X_val)
X_test_transformed = preprocessor.transform(X_test)


__Purpose:__ preprocess the data to a format thats better for classification ML models

-  __Robust scalar:__ removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
-  __column transformer:__ ensures all numerical features are of type int or float
-  __Yeo-johnson Transformation:__ inflates low variance data and deflates high variance data to create a more uniform dataset 
- __SMOTE:__ an oversampling technique that generates synthetic samples from the minority class 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Hyperparameter Tuning:__ Hyperparameters for the RandomForestClassifier are defined and optimized using RandomizedSearchCV

In [None]:
param_distributions = {
    'n_estimators': np.arange(200, 501, 100),
    'max_depth': list(np.arange(2, 10, 1)),  # Reduced upper limit
    'min_samples_split': np.arange(5, 25, 5),  # Increased the lower limit
    'min_samples_leaf': np.arange(5, 25, 5),  # Increased the lower limit
    'max_features': ['sqrt', 'log2'],  # Added a fixed fraction
    'bootstrap': [True]
}

model = RandomForestClassifier(random_state=42)
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_distributions,
    n_iter=1000,  # You can adjust the number of iterations
    cv=5,  # Cross-validation setting
    verbose=2,
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train_rs, y_train_rs)

# Best model from grid search
best_model = random_search.best_estimator_
print(best_model)
best_params = random_search.best_params_


__Hyperparameters:__
-  __n_estimators:__ This represents the number of trees in the random forest. A higher number of trees generally improves the performance but also increases the computational cost. Here, it's set to a range from 200 to 500, with steps of 100.
-  __max_depth:__ This parameter determines the maximum depth of each tree in the forest. A deeper tree can model more complex patterns but might also lead to overfitting. The range here is set from 2 to 10.
-  __min_samples_split:__ This parameter specifies the minimum number of samples required to split an internal node. Setting a higher value can prevent the model from learning overly specific patterns (overfitting). The range here is from 5 to 25, with steps of 5.
-  __min_samples_leaf:__ This represents the minimum number of samples required to be at a leaf node. This helps control overfitting by ensuring that leaf nodes have a minimum number of observations. The range is from 5 to 25, with steps of 5.
-  __max_features:__ This parameter specifies the number of features to consider when looking for the best split. Common values are 'sqrt' (square root of the number of features) and 'log2' (log base 2 of the number of features). These values help reduce the correlation between trees in the forest.
-  __bootstrap:__ This boolean parameter determines whether bootstrap samples are used when building trees. If set to True, each tree is built from a random sample with replacement from the training set.

__Randomized Search__: performs hyperparameter optimization by randomly sampling from the defined parameter distributions

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Feature Selection:__ Recursive feature elimination with cross-validation (RFECV) is used to select the most important features.

In [None]:
selector = RFECV(estimator=best_model, step=1, cv=StratifiedKFold(10), scoring='f1')
selector = selector.fit(X_train_rs, y_train_rs)

# Get the selected features
feature_names = X_train.columns.tolist()  # Get the feature names from your DataFrame
selected_features = [feature_names[i] for i in range(len(feature_names)) if selector.support_[i]]

# Transform training and test data to keep only selected features
X_train_selected = selector.transform(X_train_rs)
X_val_selected = selector.transform(X_val_transformed)
X_test_selected = selector.transform(X_test_transformed)


__Purpose:__ Perform feature selection to remove insignificant features and help prevent overfitting

__RECV:__ Recursive feature elimination with cross-validation to select features based on the average score change in the scorer after cross valiadation

__Stratified K Fold:__ Cross validation technique where folds are made by preserving the percentage of samples for each class.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

__Model Training and Calibration:__ The best model is trained on the selected features and calibrated using validation data. Calibration improves performance of the predicted probabilities generated by the classifier

In [None]:
# Re-train your classifier with the selected features
final_model = best_model.fit(X_train_selected, y_train_rs)

# Calibrate the model on the validation set using the selected features
calibrator = CalibratedClassifierCV(final_model, cv='prefit', method='sigmoid')
calibrated_model = calibrator.fit(X_val_selected.astype(np.float64), y_val.astype(np.float64))

# Predict probabilities on the test set using the selected features
y_test_proba = calibrated_model.predict_proba(X_test_selected)[:, 1].astype(np.float64)

calibrator_ = CalibratedClassifierCV(final_model, cv='prefit', method='isotonic')
calibrated_model_ = calibrator_.fit(X_val_selected.astype(np.float64), y_val.astype(np.float64))


__Purpose:__ predicted probabilities are adjusted to be more representative of the true likelihood of an event. 

-  __Sigmoid:__ fits a logistic regression model to the scores output by the classifier. This method assumes that the relationship between the predicted probabilities and the true likelihoods follows a sigmoid curve.
-  __Isotonic:__ non-parametric calibration method that fits a piecewise constant non-decreasing function to the predicted probabilities.