<a href="https://colab.research.google.com/github/wikeyen/ox_ml_for_biz/blob/master/ML_for_Business_Assessment_HT25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning for Business - Assessment Notebook HT25

This is the assigment of the class. You can go through the code by executing each one of the cells. We start the notebook describing the setting and providing you with the context of the exercise. At the end, you will find the questions that you need to answer and further information on your final deliverable.

### Please enter your candidate number below:

*(Enter your candidate number here)*



---



# 1. The Assignment Setting

Our assignment is centered around a major problem in the management of retailer inventory. Specifically, we will explore the application of machine learning techniques as a method to identify and correct inventory record errors.

## Your Role in this Exercise

Imagine that you serve as the head of the Inventory Division of a leading grocery retailer with about 450 large stores across Europe. Your team is responsible for managing the inventory operations of the organization. One of the major challenges that your team aims to address is the issue of inventory errors.

## The Business Challenge: Inventory Errors

Ensuring the availability of fast-moving items is essential for grocery retailers like your company. To this end, maintaining accurate inventory records has been, and remains, a central problem for managing retail operations. In many cases, however, retailers are unable to make accurate reordering decisions because their inventory records (i.e., how much is actually available on the shelf) are inaccurate. These discrepancies between the physical and recorded stock lead to poor reordering decisions and the over- or understocking of products that increase waste or lost sales, respectively.

## An Opportunity to Apply Machine Learning

You recently hired a data scientist (Anna) in your team to explore how data-driven tools, like machine learning, could improve business operations. After six months of work, Anna is called to present the progress she has made. She is arguing that machine learning can solve the inventory errors problem via models that predict the likelihood of records with potential discrepancies between the amount of stock registered in the inventory records and the amount of stock actually on the shelf. She claims that being able to predict potential discrepancies will allow stores to correct emerging out-of-stock scenarios quickly and by reordering products accordingly.

Specifically, she shares with you the following notebook, suggesting that she can now predict potential errors in the inventory stock at the SKU level. In her work, she is using a dataset that was created by your team that includes information at the SKU (stock-keeping unit) level about actual sales, forecast sales, and product size, among others.

Currently, Anna is the only member of your team who is an expert on ML. Given that you are her direct manager and have experience in Analytics and Machine Learning during your MBA at SBS, you have asked Anna to share her coding notebook with you in advance so that you can look into it in preparation for your meeting.


## Your Task

Run the Python code in the following cells similarly to how we do it in the classroom and try to get a good understanding of it. This is the first part of your assignment. You can use BARD or ChatGPT if you encounter any difficulties in interpreting parts of the code. However, you are expected to conduct the review and write the report on your own. Once you run all the commands, your assignment will be to answer the respective questions described at the end of the notebook and submit your report to us.


# 2. Setup

### Import libraries

Import libraries for managing data structures and plotting figures

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Import the dataset

Note that all the values for continuous variables have been normalized to facilitate the algorithm training process. Thus, they do not reflect the actual numbers that were recorded by the business.

In [None]:
data = pd.read_csv(f"https://drive.google.com/uc?id=15akNjGB2rSjQwz7sXsWevZk0xDqarNja", encoding="utf-8")

The **dictionary** for the dataset is located here: [DataDictionary](https://docs.google.com/document/d/1AkvZHloktsCa6Tu0p7IkcXpyOM4zMGuJ/edit?usp=sharing&ouid=107468850368923160966&rtpof=true&sd=true)

In [None]:
data.head(6)

# 3. Data exploration

### Numerical exploration

Number of columns and rows in the dataset

In [None]:
print('Data size : ', data.shape)

Checking for missing values

In [None]:
print('Null values per column : \n', data.isnull().sum())

Get an overview of all the variables in the dataset.

In [None]:
data.describe()

Removing values in the data with NAs.

In [None]:
data = data.dropna()

In [None]:
print('\nBalance of positive and negative error classes (%): \n',
      data['stock_error'].value_counts(normalize=True) * 100)

### Splitting the data



In [None]:
from sklearn.model_selection import train_test_split

Splitting the independent from the target variable.

In [None]:
X = data.drop(['stock_error'], axis = 1)
target = data['stock_error']

Let's perform one-hot-encoding to ensure that all the variables in the dataset are in numerical format.

In [None]:
cat = X.select_dtypes(include='O').keys()
X_new = pd.get_dummies(X, columns = cat, drop_first=True)

Splitting the data into our training and testing data sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_new,
                                                    target,
                                                    test_size = 0.3,
                                                    random_state = 44,
                                                    stratify=target)

Check: how many observations are included in our training and testing sets?

In [None]:
print(X_train.shape)
print(X_test.shape)

# 4. Training of Algorithm # 1

### Loading the model

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf = RandomForestClassifier(random_state=22)

Training the model

In [None]:
clf.fit(X_train, y_train)

## Making predictions

We set a classification threshold first.


In [None]:
t=0.5

### Predictions on the training set


Computing ROC-AUC

In [None]:
from sklearn.metrics import roc_auc_score, confusion_matrix

In [None]:
prob_est_train = clf.predict_proba(X_train)
roc_train = roc_auc_score(y_train, prob_est_train[:, 1].T)
print('The {} has an ROC-AUC on the training set of {}'.format('Random Forest', roc_train))

Plotting the Confusion Matrix

In [None]:
y_pred_train_rf = np.where(prob_est_train[:,1] > t, 1, 0)
cm_rf_train = confusion_matrix(y_true=y_train, y_pred=y_pred_train_rf)

In [None]:
%matplotlib inline
plt.figure(figsize=(5, 5))
sns.heatmap(cm_rf_train, annot=True, fmt="d")
plt.title('Confusion matrix for {}'.format('Random Forest on Training Set'))
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

### Predictions on the testing set

In [None]:
prob_est_test_rf = clf.predict_proba(X_test)
roc_test_rf = roc_auc_score(y_test, prob_est_test_rf[:, 1].T)
print('The {} has an ROC-AUC on the testing set of {}'.format('Random Forest', roc_test_rf))

Plotting the Confusion Matrix

In [None]:
y_pred_test_rf = np.where(prob_est_test_rf[:,1] > t, 1, 0)
cm_rf_test = confusion_matrix(y_true=y_test, y_pred = y_pred_test_rf)

In [None]:
%matplotlib inline
plt.figure(figsize=(5, 5))
sns.heatmap(cm_rf_test, annot=True, fmt="d")
plt.title('Confusion matrix for {}'.format('Random Forest on Testing Set'))
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Calculating further metrics

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, y_pred_test_rf))

Deep-dive into model predictions. Which features are driving the model?

In [None]:
feature_importances = clf.feature_importances_

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({
    "Feature": X_train.columns,
    "Importance": feature_importances
}).sort_values(by="Importance", ascending=False)
print(importance_df[0:15])

Let's use the SHAP framework. The model might take some time to run so account for multiple minutes for this operation to run.

In [None]:
import shap
clf.set_params(n_jobs=-1)
explainer = shap.TreeExplainer(clf, approximate=True)

sample_size = int(50)  # sample dataset


sample_indices = np.random.choice(len(X_test), sample_size, replace=True)
X_sample = X_test.iloc[sample_indices].copy()
shap_values_sample = explainer.shap_values(X_sample)


In [None]:
plt.title("SHAP Summary Plot for Random Forest")
shap.summary_plot(shap_values_sample[:, :, 1] ,X_sample, feature_names=X_sample.columns)

# 5. Training of Algorithm # 2

In [None]:
import tensorflow as tf
import keras.metrics

### Defining the metrics for the evaluation

In [None]:
METRICS = [
    keras.metrics.TruePositives(name='tp'),
    keras.metrics.FalsePositives(name='fp'),
    keras.metrics.TrueNegatives(name='tn'),
    keras.metrics.FalseNegatives(name='fn'),
    keras.metrics.BinaryAccuracy(name='accuracy'),
    keras.metrics.AUC(name='auc'),
    keras.metrics.AUC(name='prc', curve='PR'),
]

### Loading the model

In [None]:
# Setting the number of layers and neurons per layer
neurons = 70
hidden_layers = 2

In [None]:
# Calculating the initial bias
neg, pos = np.bincount(target)
initial_bias = np.log([pos / neg])

In [None]:
# Splitting the data into training and validation sets
X_train_ann, X_val_ann, y_train_ann, y_val_ann = train_test_split(X_train, y_train, test_size=0.2, stratify=y_train,
                                                          random_state=44)

In [None]:
# Initialising the model
ann = tf.keras.models.Sequential()

# Adding fully connected layers
for layers in range(hidden_layers):
    ann.add(tf.keras.layers.Dense(units=neurons, activation='relu'))

ann.add(tf.keras.layers.Dropout(0.2))                                                           # Add a dropout layer
ann.add(tf.keras.layers.Dense(units=1,activation='sigmoid', bias_initializer=tf.keras.initializers.Constant(initial_bias)))    # Add the output layer

# Compiling the model
ann.compile(optimizer= tf.optimizers.Adam(learning_rate=0.01), loss='binary_crossentropy', metrics=METRICS)

Training of the model


In [None]:
baseline_history = ann.fit(X_train_ann,
                           y_train_ann,
                           batch_size=32,
                           epochs=100,
                           validation_data=(X_val_ann, y_val_ann),
                           )


## Making predictions

### Predictions on the training set

In [None]:
plt.plot(baseline_history.epoch, baseline_history.history['auc'])
# print('The {} has an ROC-AUC on the training set of {}'.format('Neural Network', roc_train))

In [None]:
t2 =0.5
train_predictions_baseline = pd.DataFrame(ann.predict(X_train))

In [None]:
roc_train_ann = roc_auc_score(y_train, train_predictions_baseline.iloc[:, 0])
print('The {} has an ROC-AUC on the training set of {}'.format('Neural Network', roc_train_ann))

In [None]:
confusion_train_ann = confusion_matrix(y_true= y_train, y_pred = train_predictions_baseline.iloc[:, 0] > t2)

Plotting the Confusion Matrix

In [None]:
%matplotlib inline
plt.figure(figsize=(5, 5))
sns.heatmap(confusion_train_ann, annot=True, fmt="d")
plt.title('Confusion matrix for {}'.format('Neural Network on Training Set'))
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

### Predictions on the testing set


In [None]:
ann_predictions_test = pd.DataFrame(ann.predict(X_test))

In [None]:
roc_test_ann = roc_auc_score(y_test, ann_predictions_test.iloc[:, 0])
print('The {} has an ROC-AUC on the testing set of {}'.format('Neural Network', roc_test_ann))

In [None]:
confusion_test_ann = confusion_matrix(y_true= y_test, y_pred = ann_predictions_test.iloc[:, 0] > t2)

Plotting the Confusion Matrix

In [None]:
%matplotlib inline
plt.figure(figsize=(5, 5))
sns.heatmap(confusion_test_ann, annot=True, fmt="d")
plt.title('Confusion matrix for {}'.format('Neural Network on Testing Set'))
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Plotting SHAP for ANN

In [None]:
background = X_train_ann.sample(50, random_state=42).astype(np.float32).values

sample_size = 50
sample_indices = np.random.choice(len(X_test), sample_size, replace=True)
X_sample_np = X_test.iloc[sample_indices].copy().astype(np.float32).values
X_sample_df = pd.DataFrame(X_sample_np, columns=X_test.columns)


explainer_ANN = shap.DeepExplainer(ann, background)
shap_values_ANN = explainer_ANN.shap_values(X_sample_np)

In [None]:
# Plot the summary plot using the SHAP values for the output (index 0)
plt.title("SHAP Summary Plot for ANN")
shap.summary_plot(np.squeeze(shap_values_ANN, axis=-1), X_sample_df, feature_names=X_sample_df.columns)



---



# Your Assignment

First, do make sure to have added your **candidate number** at the top!

Then run the above notebook, and **answer the following  questions** (all parts are weighted as indicated in the parentheses at the beginning of each question):

1.   (20%) State what kind of machine learning algorithms have been implemented in this workbook and briefly interpret the results obtained. In addition, discuss the advantages and limitations of the two modeling approaches taken here. Justify your answer based on the specific case study and results you observe!

2.   (30%) Based on the data provided, compare the key factors driving the predictive power of the proposed learners and explain whether they match your intuition. Are there any features that you would like to exclude or further investigate? If yes, why? Leveraging your insights from Questions 1-2, state which approach you would propose to Anna to implement for the task at hand.

3.   (30%) Propose potential ways to understand better and improve the performance of the derived models to Anna. Are there other machine learning algorithms, data sources, data pre-processing, or model post-processing techniques that she could use to improve these results? Your answer here is expected to be descriptive (coding is not necessary). However, we expect you to provide arguments, using the course materials, to justify your suggestions. In your answers, keep into consideration the cost, feasibility, and effectiveness of your recommendations for this particular business context.

4.  (20%) Suppose that you conclude with Anna that one of the models can proceed into implementation. Are there additional tests or analyses that you would like to conduct before integrating the model into the stores' IT system? What are some potential challenges that you foresee and how do you expect to remedy these challenges?

5.  (OPTIONAL) You can add your own code to the notebook to improve the predictive performance of the derived models or support with additional evidence your answers to questions 1-4. You can suggest modified or new models, using other algorithms or implement some of your suggestions from Q3. If you decide to do so, indicate your changes with an appropriate text comment before each cell. Of note, submitting your own code to the assignment will not "hurt" (reduce) your grade. However, please ensure that your code runs without any errors before the submission and that the output of the models is printed. Remember that this part is optional and can only help you towards a distinction in the class. You can use open-access large language models, such as ChatGPT or BARD for the optional "coding" part of the assignment.

# Your answer (1,500 words max)

Word counts cover the main body of text, including in-text citations, tables, figures, and diagrams, but excluding appendices, footnotes, references, as well as any python code or comments that you add to the notebook as part of the optional question.

*(Please type here)*



---



***Now Save your notebook, print it as PDF, and submit the PDF to SAMS!***