# Problem

Is to identify products at risk of backorder before the event occurs so that business has time to react. 

## What is a Backorder?
Backorders are products that are temporarily out of stock, but a customer is permitted to place an order against future inventory.   
A backorder generally indicates that customer demand for a product or service exceeds a company’s capacity to supply it. Back orders are both good and bad. Strong demand can drive back orders, but so can suboptimal planning. 

## Data description

Data file contains the historical data for the 8 weeks prior to the week we are trying to predict. The data was taken as weekly snapshots at the start of each week. Columns are defined as follows:

    sku - Random ID for the product

    national_inv - Current inventory level for the part

    lead_time - Transit time for product (if available)

    in_transit_qty - Amount of product in transit from source

    forecast_3_month - Forecast sales for the next 3 months

    forecast_6_month - Forecast sales for the next 6 months

    forecast_9_month - Forecast sales for the next 9 months

    sales_1_month - Sales quantity for the prior 1 month time period

    sales_3_month - Sales quantity for the prior 3 month time period

    sales_6_month - Sales quantity for the prior 6 month time period

    sales_9_month - Sales quantity for the prior 9 month time period

    min_bank - Minimum recommend amount to stock

    potential_issue - Source issue for part identified

    pieces_past_due - Parts overdue from source

    perf_6_month_avg - Source performance for prior 6 month period

    perf_12_month_avg - Source performance for prior 12 month period

    local_bo_qty - Amount of stock orders overdue

    deck_risk - Part risk flag

    oe_constraint - Part risk flag

    ppap_risk - Part risk flag

    stop_auto_buy - Part risk flag

    rev_stop - Part risk flag

    went_on_backorder - Product actually went on backorder. This is the target value.
    
         Yes or 1 : Product backordered

         No or 0  : Product not backordered

# Loading the required libraries

In [101]:
#!pip install imblearn

In [102]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler

from sklearn.ensemble import RandomForestClassifier

from imblearn.over_sampling import SMOTE

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, roc_curve, auc

import matplotlib.pyplot as plt

# Identify Right Error Metrics

    Based on the business have to identify the right error metrics.

# Loading the data

In [103]:
data = pd.read_csv('BackOrders.csv')

# Understand the Data - Exploratory Data Analysis (EDA)

## Number row and columns

In [104]:
data.shape

(61589, 23)

## First and last 5 rows

Unnamed: 0,sku,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,...,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
0,1888279,117,,0,0,0,0,0,0,15,...,0,-99.0,-99.0,0,No,No,Yes,Yes,No,No
1,1870557,7,2.0,0,0,0,0,0,0,0,...,0,0.5,0.28,0,Yes,No,No,Yes,No,No
2,1475481,258,15.0,10,10,77,184,46,132,256,...,0,0.54,0.7,0,No,No,No,Yes,No,No
3,1758220,46,2.0,0,0,0,0,1,2,6,...,0,0.75,0.9,0,Yes,No,No,Yes,No,No
4,1360312,2,2.0,0,4,6,10,2,2,5,...,0,0.97,0.92,0,No,No,No,Yes,No,No


Unnamed: 0,sku,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,...,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
61584,1397275,6,8.0,0,24,24,24,0,7,9,...,0,0.98,0.98,0,No,No,No,Yes,No,No
61585,3072139,130,2.0,0,40,80,140,18,108,230,...,0,0.51,0.28,0,No,No,No,Yes,No,No
61586,1909363,135,9.0,0,0,0,0,10,40,65,...,0,1.0,0.99,0,No,No,Yes,Yes,No,No
61587,1845783,63,,0,0,0,0,452,1715,3425,...,0,-99.0,-99.0,1,No,No,No,No,No,Yes
61588,1200539,0,2.0,0,8,8,8,0,1,1,...,0,0.79,0.78,0,Yes,No,No,Yes,No,Yes


## Statistic summary 
    Using describe function

Unnamed: 0,sku,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty
count,61589.0,61589.0,58186.0,61589.0,61589.0,61589.0,61589.0,61589.0,61589.0,61589.0,61589.0,61589.0,61589.0,61589.0,61589.0,61589.0
mean,2037188.0,287.721882,7.559619,30.192843,169.2728,315.0413,453.576,44.742957,150.732631,283.5465,419.6427,43.087256,1.6054,-6.264182,-5.863664,1.205361
std,656417.8,4233.906931,6.498952,792.869253,5286.742,9774.362,14202.01,1373.805831,5224.959649,8872.27,12698.58,959.614135,42.309229,25.537906,24.844514,29.981155
min,1068628.0,-2999.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-99.0,-99.0,0.0
25%,1498574.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.62,0.64,0.0
50%,1898033.0,10.0,8.0,0.0,0.0,0.0,0.0,0.0,2.0,4.0,6.0,0.0,0.0,0.82,0.8,0.0
75%,2314826.0,57.0,8.0,0.0,12.0,25.0,36.0,6.0,17.0,34.0,51.0,3.0,0.0,0.96,0.95,0.0
max,3284895.0,673445.0,52.0,170976.0,1126656.0,2094336.0,3062016.0,295197.0,934593.0,1799099.0,2631590.0,192978.0,7392.0,1.0,1.0,2999.0


## Data type

sku                    int64
national_inv           int64
lead_time            float64
in_transit_qty         int64
forecast_3_month       int64
forecast_6_month       int64
forecast_9_month       int64
sales_1_month          int64
sales_3_month          int64
sales_6_month          int64
sales_9_month          int64
min_bank               int64
potential_issue       object
pieces_past_due        int64
perf_6_month_avg     float64
perf_12_month_avg    float64
local_bo_qty           int64
deck_risk             object
oe_constraint         object
ppap_risk             object
stop_auto_buy         object
rev_stop              object
went_on_backorder     object
dtype: object

__Observations__

* sku is `categorical` but is interpreted as `int64` 
* potential_issue, deck_risk, oe_constraint, ppap_risk, stop_auto_buy, rev_stop, and went_on_backorder are also `categorical` but is interpreted as `object`. 

# Data pre-processing

## Convert all the attributes to appropriate type

Data type conversion

    Using astype('category') to convert potential_issue, deck_risk, oe_constraint, ppap_risk, stop_auto_buy, rev_stop, and went_on_backorder attributes to categorical attributes.


### Re-display data type of each variable

sku                     int64
national_inv            int64
lead_time             float64
in_transit_qty          int64
forecast_3_month        int64
forecast_6_month        int64
forecast_9_month        int64
sales_1_month           int64
sales_3_month           int64
sales_6_month           int64
sales_9_month           int64
min_bank                int64
potential_issue      category
pieces_past_due         int64
perf_6_month_avg      float64
perf_12_month_avg     float64
local_bo_qty            int64
deck_risk            category
oe_constraint        category
ppap_risk            category
stop_auto_buy        category
rev_stop             category
went_on_backorder    category
dtype: object

### Statistic summary 
    Using describe function display the summary Statistics for all features

Unnamed: 0,sku,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,...,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
count,61589.0,61589.0,58186.0,61589.0,61589.0,61589.0,61589.0,61589.0,61589.0,61589.0,...,61589.0,61589.0,61589.0,61589.0,61589,61589,61589,61589,61589,61589
unique,,,,,,,,,,,...,,,,,2,2,2,2,2,2
top,,,,,,,,,,,...,,,,,No,No,No,Yes,No,No
freq,,,,,,,,,,,...,,,,,48145,61577,53792,59303,61569,50296
mean,2037188.0,287.721882,7.559619,30.192843,169.2728,315.0413,453.576,44.742957,150.732631,283.5465,...,1.6054,-6.264182,-5.863664,1.205361,,,,,,
std,656417.8,4233.906931,6.498952,792.869253,5286.742,9774.362,14202.01,1373.805831,5224.959649,8872.27,...,42.309229,25.537906,24.844514,29.981155,,,,,,
min,1068628.0,-2999.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-99.0,-99.0,0.0,,,,,,
25%,1498574.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.62,0.64,0.0,,,,,,
50%,1898033.0,10.0,8.0,0.0,0.0,0.0,0.0,0.0,2.0,4.0,...,0.0,0.82,0.8,0.0,,,,,,
75%,2314826.0,57.0,8.0,0.0,12.0,25.0,36.0,6.0,17.0,34.0,...,0.0,0.96,0.95,0.0,,,,,,


## Delete sku attribute

In [112]:
# check the unique values of sku


61589

## Missing Data

    Missing value analysis and dropping the records with missing values

national_inv            0
lead_time            3403
in_transit_qty          0
forecast_3_month        0
forecast_6_month        0
forecast_9_month        0
sales_1_month           0
sales_3_month           0
sales_6_month           0
sales_9_month           0
min_bank                0
potential_issue         0
pieces_past_due         0
perf_6_month_avg        0
perf_12_month_avg       0
local_bo_qty            0
deck_risk               0
oe_constraint           0
ppap_risk               0
stop_auto_buy           0
rev_stop                0
went_on_backorder       0
dtype: int64

Observing the number of records before and after missing value records removal

0

Since the number of missing values is about 5% and as we have around 61K records. For initial analysis we ignore all these records

## Train and test split

### Target attribute distribution 

No     47217
Yes    10969
Name: went_on_backorder, dtype: int64

In [118]:
#display the normalised values of Target distribution


No     0.811484
Yes    0.188516
Name: went_on_backorder, dtype: float64

### Split the data into train and test
sklearn.model_selection.train_test_split

    Split arrays or matrices into random train and test subsets

In [119]:
# Split the data into trainset and testset

### Target attribute distribution after the split

In [120]:
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

No     0.811859
Yes    0.188141
Name: went_on_backorder, dtype: float64
No     0.81061
Yes    0.18939
Name: went_on_backorder, dtype: float64


## Convert categorical target attribute to numeric

Using `LabelEncoder` convert categorical target attribute __went_on_backorder__ to numeric

### Target attribute distribution for trainset and testset

0    0.811859
1    0.188141
dtype: float64

0    0.81061
1    0.18939
dtype: float64

## Checking the data types

national_inv            int64
lead_time             float64
in_transit_qty          int64
forecast_3_month        int64
forecast_6_month        int64
forecast_9_month        int64
sales_1_month           int64
sales_3_month           int64
sales_6_month           int64
sales_9_month           int64
min_bank                int64
potential_issue      category
pieces_past_due         int64
perf_6_month_avg      float64
perf_12_month_avg     float64
local_bo_qty            int64
deck_risk            category
oe_constraint        category
ppap_risk            category
stop_auto_buy        category
rev_stop             category
went_on_backorder    category
dtype: object

## Standardize the numerical attributes

__Note__: For Decision Tree and Random Forest Numeric attributes need not be standardized. 

In [126]:
#get numerical cols and store in num_cols


Index(['national_inv', 'lead_time', 'in_transit_qty', 'forecast_3_month',
       'forecast_6_month', 'forecast_9_month', 'sales_1_month',
       'sales_3_month', 'sales_6_month', 'sales_9_month', 'min_bank',
       'pieces_past_due', 'perf_6_month_avg', 'perf_12_month_avg',
       'local_bo_qty'],
      dtype='object')

## Converting Categorical attributes to Numeric attributes

### Store categorical attributes name

In [127]:
#get categorial cols and store in cat_cols

Index(['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk',
       'stop_auto_buy', 'rev_stop'],
      dtype='object')

### Using OneHotEncoder,  converting Categorical attributes to Numeric attributes 

OneHotEncoder(drop='first', sparse=False)

In [129]:
ohe.transform(X_train[cat_cols])

array([[0., 0., 0., 1., 1., 0.],
       [0., 0., 0., 1., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       ...,
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.]])

In [130]:
pd.DataFrame(ohe.transform(X_train[cat_cols]), 
                            columns = ohe.get_feature_names()).set_index(X_train.index)

Unnamed: 0,x0_Yes,x1_Yes,x2_Yes,x3_Yes,x4_Yes,x5_Yes
24039,0.0,0.0,0.0,1.0,1.0,0.0
12800,0.0,0.0,0.0,1.0,1.0,0.0
59501,0.0,0.0,0.0,0.0,1.0,0.0
35410,0.0,0.0,0.0,1.0,1.0,0.0
27604,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...
57535,0.0,0.0,0.0,0.0,1.0,0.0
18760,0.0,0.0,0.0,1.0,1.0,0.0
29635,0.0,0.0,0.0,0.0,1.0,0.0
16647,0.0,0.0,0.0,0.0,1.0,0.0


In [131]:
X_train[num_cols]

Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty
24039,479,8.0,0,0,0,0,0,0,0,0,1,0,0.83,0.91,0
12800,3,12.0,0,36,46,56,12,23,39,56,2,0,0.82,0.79,0
59501,1,5.0,0,2,3,5,2,2,4,6,1,0,0.99,0.97,0
35410,1,4.0,0,0,0,0,0,0,0,0,0,0,0.73,0.78,0
27604,65,4.0,63,167,309,465,44,180,321,509,47,0,0.96,0.85,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57535,53672,4.0,31750,56000,140000,196000,21071,66174,127006,185983,17747,0,0.98,0.96,0
18760,4,12.0,0,0,2,4,1,1,4,6,0,0,0.78,0.78,0
29635,3,2.0,0,4,8,12,0,4,6,12,0,0,0.98,0.99,0
16647,21,2.0,0,0,0,0,5,10,15,25,0,0,0.89,0.93,0


In [132]:
cat_df_train = pd.DataFrame(ohe.transform(X_train[cat_cols]), 
                            columns = ohe.get_feature_names()).set_index(X_train.index)

In [133]:
cat_df_test = pd.DataFrame(ohe.transform(X_test[cat_cols]), 
                           columns = ohe.get_feature_names()).set_index(X_test.index)

In [134]:
X_train[num_cols].index

Int64Index([24039, 12800, 59501, 35410, 27604, 13466, 32836, 45431, 17827,
            21719,
            ...
            23503, 25122, 48877,  8211, 16278, 57535, 18760, 29635, 16647,
            55804],
           dtype='int64', length=40730)

In [135]:
cat_df_train.index

Int64Index([24039, 12800, 59501, 35410, 27604, 13466, 32836, 45431, 17827,
            21719,
            ...
            23503, 25122, 48877,  8211, 16278, 57535, 18760, 29635, 16647,
            55804],
           dtype='int64', length=40730)

## Concatenate numerical & encoded categorical features

In [137]:
# check Null Values

0