# Final Project


For your final project, you will build a classifer for
the **Backorder Prediction** dataset by following our
operationalized machine learning pipeline.

![AppliedML_Workflow IMAGE MISSING](../images/AppliedML_Workflow.png)


--- 

## Data

Details of the dataset are located here:

Dataset (originally posted on Kaggle): https://www.kaggle.com/tiredgeek/predict-bo-trial

The files are accessible in the JupyterHub environment:
 * `/dsa/data/all_datasets/back_order/Kaggle_Training_Dataset_v2.csv`
 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`

The data is used to predict of product when on Back Order.
 
**NOTE:** The training data file is 117MB.  
You can easily lock up a notebook with bad coding practices.  
Please save you project early, and often, and use `git commits` to checkpoint your process.

## Exploration, Training, and Validation

You will examine the _training_ dataset and perform 
 * **data preparation and exploratory data analysis**, 
 * **anomaly detection / removal**,
 * **dimensionality reduction** and then
 * **train and validate 3 different models**.

Of the 3 different models, you are free to pick any estimator from scikit-learn 
or models we have so far covered using TensorFlow.

### Validation Assessment

Your first, intermediate, result will be an **assessment** of the models' performance.
This assessement should be grounded within a 10-fold cross-validation methodology.

This should include the confusion matrix and F-score for each classifier.


---

## Testing

Once you have chosen your final model, you will need to re-train it using all the training data.


--- 
##  Overview / Roadmap

**General steps**:

* Dataset carpentry & Exploratory Data Analysis
  * Develop functions to perform the necessary steps, you will have to carpentry the Training and the Testing data.
* Create 3 pipelines, each does:
    * Anomaly detection
    * Dimensionality reduction
    * Model training/validation
* Train chosen model full training data
* Evaluate model against testing
* Write a summary of your processing and an analysis of the model performance


#### <span style="background:yellow">Note:</span> The use of sklearn Pipelines and FeatureUnion is optional.   
However, your three models should follow a readable path from data to cross-validation statistics.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd

## Load dataset

**Description**
~~~
sku - Random ID for the product
national_inv - Current inventory level for the part
lead_time - Transit time for product (if available)
in_transit_qty - Amount of product in transit from source
forecast_3_month - Forecast sales for the next 3 months
forecast_6_month - Forecast sales for the next 6 months
forecast_9_month - Forecast sales for the next 9 months
sales_1_month - Sales quantity for the prior 1 month time period
sales_3_month - Sales quantity for the prior 3 month time period
sales_6_month - Sales quantity for the prior 6 month time period
sales_9_month - Sales quantity for the prior 9 month time period
min_bank - Minimum recommend amount to stock
potential_issue - Source issue for part identified
pieces_past_due - Parts overdue from source
perf_6_month_avg - Source performance for prior 6 month period
perf_12_month_avg - Source performance for prior 12 month period
local_bo_qty - Amount of stock orders overdue
deck_risk - Part risk flag
oe_constraint - Part risk flag
ppap_risk - Part risk flag
stop_auto_buy - Part risk flag
rev_stop - Part risk flag
went_on_backorder - Product actually went on backorder. **This is the target value.**
~~~

**Note**: This is a real-world dataset without any processing.  
There will also be warnings due to fact that the 1st column is mixing integer and string values.  
The last column is what we are trying to predict.

In [2]:
# Dataset location
DATASET = '/dsa/data/all_datasets/back_order/Kaggle_Training_Dataset_v2.csv'
assert os.path.exists(DATASET)

# Shuffling the dataset has been moved to later in the workflow to simplify exploratory analysis. Additional web sources 
# verified that the -99 values found in perf_6_month_avg and perf_12_month_avg are meant to be placeholders for NaN values. 
# The negative values in national_inv are described as valid values.

# We will replace the -99 values with NaNs while pushing the data into a Pandas dataframe.

na_other = {'perf_6_month_avg':-99, 'perf_12_month_avg':-99}

dataset = pd.read_csv(DATASET, na_values=na_other)
dataset.describe()


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty
count,1687860.0,1586967.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1558382.0,1565810.0,1687860.0
mean,496.1118,7.872267,44.05202,178.1193,344.9867,506.3644,55.92607,175.0259,341.7288,525.2697,52.7723,2.043724,0.7823812,0.7769763,0.6264507
std,29615.23,7.056024,1342.742,5026.553,9795.152,14378.92,1928.196,5192.378,9613.167,14838.61,1254.983,236.0165,0.2370141,0.2304902,33.72224
min,-27256.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.7,0.69,0.0
50%,15.0,8.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,4.0,0.0,0.0,0.85,0.83,0.0
75%,80.0,9.0,0.0,4.0,12.0,20.0,4.0,15.0,31.0,47.0,3.0,0.0,0.97,0.96,0.0
max,12334400.0,52.0,489408.0,1427612.0,2461360.0,3777304.0,741774.0,1105478.0,2146625.0,3205172.0,313319.0,146496.0,1.0,1.0,12530.0


In [3]:
# Summarise the non-numerical data

dataset.describe(include=['O'])

Unnamed: 0,sku,potential_issue,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
count,1687861,1687860,1687860,1687860,1687860,1687860,1687860,1687860
unique,1687861,2,2,2,2,2,2,2
top,1579532,No,No,No,No,Yes,No,No
freq,1,1686953,1300377,1687615,1484026,1626774,1687129,1676567


In [4]:
# Show top 10 rows

dataset.head(10)

Unnamed: 0,sku,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,...,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
0,1026827,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,,0.0,No,No,No,Yes,No,No
1,1043384,2.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.99,0.99,0.0,No,No,No,Yes,No,No
2,1043696,2.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,,0.0,Yes,No,No,Yes,No,No
3,1043852,7.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.1,0.13,0.0,No,No,No,Yes,No,No
4,1044048,8.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,,0.0,Yes,No,No,Yes,No,No
5,1044198,13.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.82,0.87,0.0,No,No,No,Yes,No,No
6,1044643,1095.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,,0.0,Yes,No,No,Yes,No,No
7,1045098,6.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,Yes,No,Yes,Yes,No,No
8,1045815,140.0,,0.0,15.0,114.0,152.0,0.0,0.0,0.0,...,0.0,,,0.0,No,No,No,Yes,No,No
9,1045867,4.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.82,0.87,0.0,No,No,No,Yes,No,No


In [5]:
# Show top 10 rows of columns that were cut out

dataset.ix[0:9,'sales_9_month':'potential_issue']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,sales_9_month,min_bank,potential_issue
0,0.0,0.0,No
1,0.0,0.0,No
2,0.0,0.0,No
3,0.0,1.0,No
4,4.0,2.0,No
5,0.0,0.0,No
6,0.0,4.0,No
7,0.0,0.0,No
8,0.0,0.0,No
9,0.0,0.0,No


In [6]:
# Show bottom 10 rows

dataset.tail(10)

Unnamed: 0,sku,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,...,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
1687851,1373539,-6.0,9.0,36.0,130.0,130.0,130.0,0.0,0.0,54.0,...,0.0,0.03,0.1,42.0,No,No,No,Yes,No,No
1687852,1478683,2.0,8.0,0.0,966.0,966.0,1116.0,47.0,512.0,1361.0,...,0.0,0.84,0.77,46.0,No,No,No,Yes,No,No
1687853,1489920,0.0,2.0,0.0,2071.0,3025.0,3412.0,4.0,764.0,764.0,...,0.0,0.98,0.99,4.0,No,No,No,No,No,Yes
1687854,1392420,124.0,8.0,140.0,410.0,780.0,1240.0,128.0,464.0,849.0,...,0.0,0.85,0.9,1.0,No,No,No,Yes,No,No
1687855,1407754,0.0,2.0,0.0,10.0,10.0,10.0,0.0,5.0,7.0,...,0.0,0.69,0.69,5.0,Yes,No,No,Yes,No,No
1687856,1373987,-1.0,,0.0,5.0,7.0,9.0,1.0,3.0,3.0,...,0.0,,,1.0,No,No,No,Yes,No,No
1687857,1524346,-1.0,9.0,0.0,7.0,9.0,11.0,0.0,8.0,11.0,...,0.0,0.86,0.84,1.0,Yes,No,No,No,No,Yes
1687858,1439563,62.0,9.0,16.0,39.0,87.0,126.0,35.0,63.0,153.0,...,0.0,0.86,0.84,6.0,No,No,No,Yes,No,No
1687859,1502009,19.0,4.0,0.0,0.0,0.0,0.0,2.0,7.0,12.0,...,0.0,0.73,0.78,1.0,No,No,No,Yes,No,No
1687860,(1687860 rows),,,,,,,,,,...,,,,,,,,,,


## Processing

In this section, goal is to figure out:

* which columns we can use directly,  
* which columns are usable after some processing,  
* and which columns are not processable or obviously irrelevant (like product id) that we will discard.

Then process and prepare this dataset for creating a predictive model.

In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1687861 entries, 0 to 1687860
Data columns (total 23 columns):
sku                  1687861 non-null object
national_inv         1687860 non-null float64
lead_time            1586967 non-null float64
in_transit_qty       1687860 non-null float64
forecast_3_month     1687860 non-null float64
forecast_6_month     1687860 non-null float64
forecast_9_month     1687860 non-null float64
sales_1_month        1687860 non-null float64
sales_3_month        1687860 non-null float64
sales_6_month        1687860 non-null float64
sales_9_month        1687860 non-null float64
min_bank             1687860 non-null float64
potential_issue      1687860 non-null object
pieces_past_due      1687860 non-null float64
perf_6_month_avg     1558382 non-null float64
perf_12_month_avg    1565810 non-null float64
local_bo_qty         1687860 non-null float64
deck_risk            1687860 non-null object
oe_constraint        1687860 non-null object
ppap_risk        

### Take samples and examine the dataset

In [8]:
dataset.iloc[:3,:6]

Unnamed: 0,sku,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month
0,1026827,0.0,,0.0,0.0,0.0
1,1043384,2.0,9.0,0.0,0.0,0.0
2,1043696,2.0,,0.0,0.0,0.0


In [9]:
dataset.iloc[:3,6:12]

Unnamed: 0,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank
0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
dataset.iloc[:3,12:18]

Unnamed: 0,potential_issue,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk
0,No,0.0,,,0.0,No
1,No,0.0,0.99,0.99,0.0,No
2,No,0.0,,,0.0,Yes


In [11]:
dataset.iloc[:3,18:24]

Unnamed: 0,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
0,No,No,Yes,No,No
1,No,No,Yes,No,No
2,No,No,Yes,No,No


### Drop columns that are obviously irrelevant or not processable

In [12]:
# Add code below this comment  (Question #E8001)
# ----------------------------------

dataset = dataset.drop(['sku'], axis = 1)

# We will also drop the last row consisting of all NaN values

dataset = dataset[:-1]


### Find unique values of string columns

Now try to make sure that these Yes/No columns really only contains Yes or No.  
If that's true, proceed to convert them into binaries (0s and 1s).

**Tip**: use [unique()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) function of pandas Series.

Example

~~~python
print('went_on_backorder', dataset['went_on_backorder'].unique())
~~~

In [13]:
# All the column names of these yes/no columns
yes_no_columns = list(filter(lambda i: dataset[i].dtype!=np.float64, dataset.columns))
print(yes_no_columns)

# Add code below this comment  (Question #E8002)
# ----------------------------------
print('potential_issue', dataset['potential_issue'].unique())
print('deck_risk', dataset['deck_risk'].unique())
print('oe_constraint', dataset['oe_constraint'].unique())
print('ppap_risk', dataset['ppap_risk'].unique())
print('stop_auto_buy', dataset['stop_auto_buy'].unique())
print('rev_stop', dataset['rev_stop'].unique())
print('went_on_backorder', dataset['went_on_backorder'].unique())


['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk', 'stop_auto_buy', 'rev_stop', 'went_on_backorder']
potential_issue ['No' 'Yes']
deck_risk ['No' 'Yes']
oe_constraint ['No' 'Yes']
ppap_risk ['No' 'Yes']
stop_auto_buy ['Yes' 'No']
rev_stop ['No' 'Yes']
went_on_backorder ['No' 'Yes']


You may see **nan** also as possible values representing missing values in the dataset.

We fill them using most popular values, the [Mode](https://en.wikipedia.org/wiki/Mode_%28statistics%29) in Stats.

In [14]:
# This step is not necessary because apparently the only row with NaNs for the string columns was that last row

# for column_name in yes_no_columns:
    # mode = dataset[column_name].apply(str).mode()[0]
    # print('Filling missing values of {} with {}'.format(column_name, mode))
    # dataset[column_name].fillna(mode, inplace=True)

In [15]:
# Lets see where our remaining NaNs are

dataset.isnull().any()

national_inv         False
lead_time             True
in_transit_qty       False
forecast_3_month     False
forecast_6_month     False
forecast_9_month     False
sales_1_month        False
sales_3_month        False
sales_6_month        False
sales_9_month        False
min_bank             False
potential_issue      False
pieces_past_due      False
perf_6_month_avg      True
perf_12_month_avg     True
local_bo_qty         False
deck_risk            False
oe_constraint        False
ppap_risk            False
stop_auto_buy        False
rev_stop             False
went_on_backorder    False
dtype: bool

In [16]:
# Lets see just how many rows include NaNs

dataset.shape[0] - dataset.dropna().shape[0]

129478

In [17]:
# If we were to remove these rows from the dataset how much data would we lose?

129478/1687860*100

7.671133861813184

In [18]:
# 7% is less than 10%. I believe it is reasonable to drop these observations without affecting the results of our modeling

dataset = dataset.dropna()
dataset.isnull().any().any()

False

In [73]:
# Referring back to when we described the went_on_backorder variable, we see that there were 1676567 "No" values. How many
# yes values were there?

1687860-1676567

11293

In [19]:
# Lets look at these figures now that we removed observations that included NaNs

dataset.describe(include=['O'])

Unnamed: 0,potential_issue,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
count,1558382,1558382,1558382,1558382,1558382,1558382,1558382
unique,2,2,2,2,2,2,2
top,No,No,No,No,Yes,No,No
freq,1557502,1244372,1558137,1375836,1522966,1558013,1547519


In [20]:
# We now have 1547519 "No" values, so how many did we lose?

1676567-1547519

129048

In [21]:
# Okay. So how many "Yes" values do we have now?

1558382-1547519

10863

In [22]:
# Which means we lost:

11293-10863

430

### Convert yes/no columns into binary (0s and 1s)

In [23]:
# Add code below this comment  (Question #E8003)
# ----------------------------------

for column_name in yes_no_columns:
    dataset[column_name] = dataset[column_name].apply(['No', 'Yes'].index)


Now all columns should be either int64 or float64.

In [24]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1558382 entries, 1 to 1687859
Data columns (total 22 columns):
national_inv         1558382 non-null float64
lead_time            1558382 non-null float64
in_transit_qty       1558382 non-null float64
forecast_3_month     1558382 non-null float64
forecast_6_month     1558382 non-null float64
forecast_9_month     1558382 non-null float64
sales_1_month        1558382 non-null float64
sales_3_month        1558382 non-null float64
sales_6_month        1558382 non-null float64
sales_9_month        1558382 non-null float64
min_bank             1558382 non-null float64
potential_issue      1558382 non-null int64
pieces_past_due      1558382 non-null float64
perf_6_month_avg     1558382 non-null float64
perf_12_month_avg    1558382 non-null float64
local_bo_qty         1558382 non-null float64
deck_risk            1558382 non-null int64
oe_constraint        1558382 non-null int64
ppap_risk            1558382 non-null int64
stop_auto_buy        

In [25]:
# Ok now let's shuffle the dataset

dataset = dataset.sample(frac = 1).reset_index(drop=True)

In [26]:
# As we mentioned our target variable classes are very imbaolanced. To compensate for this, we will downsample the data so there
# are as many "No"s as "Yes"s for went_on_backorder in our training set. Additionally, the smaller sample set will allow us to 
# train our models more quickly.

from sklearn.utils import resample

# Separate majority and minority values
dataset_majority = dataset[dataset.went_on_backorder==0]
dataset_minority = dataset[dataset.went_on_backorder==1]

# Downsample majority values
dataset_majority_downsampled = resample(dataset_majority, 
                                 replace=False, # sample without replacement
                                 n_samples=(dataset_minority.went_on_backorder).count(),     # to match minority class
                                 random_state=123) # reproducible results

# Combine minority value with downsampled majority value
dataset_downsampled = pd.concat([dataset_majority_downsampled, dataset_minority])

# Check that we did this right: show value counts (they should be equal)
dataset_downsampled.went_on_backorder.value_counts()

1    10863
0    10863
Name: went_on_backorder, dtype: int64

In [27]:
# And now let's split into train and test subsets

from sklearn.model_selection import train_test_split, cross_val_score

X = dataset_downsampled.iloc[:, :-1]
y = dataset_downsampled['went_on_backorder']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

## Pipeline

In this section, design an operationalized machine learning pipeline, which includes:

* Anomaly detection
* Dimensionality Reduction
* Train a model

You can add more notebook cells or import any Python modules as needed.

In [28]:
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest, RandomForestClassifier

from sklearn.decomposition import PCA, FactorAnalysis, NMF
from sklearn.preprocessing import StandardScaler, scale, Normalizer
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, accuracy_score


### Your 1st pipeline 
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [29]:
# Add code below this comment  (Question #E8004)
# ----------------------------------

# For our first pipeline we will use a Gaussian Naive Bayes model.
# We will start with anomolay detection/removal using an Elliptic Envelope.

envelope = EllipticEnvelope().fit(X_train)

outliers = envelope.predict(X_train)==-1
X_train_clean = X_train[~outliers]
y_train_clean = y_train[~outliers]

In [30]:
# Now we build the pipeline. We will scale the data using StandarScaler() and reduce dimensions by feature selection via 
# SelectKBest and mutual_info_classif. After experimentation, 6 features were determined to produce the best results.

pipe_nb = Pipeline([('scl', StandardScaler()),
                    ('selector', SelectKBest(mutual_info_classif, k=6)),
                    ('clf', GaussianNB())])

pipe_nb.fit(X_train_clean, y_train_clean)


Pipeline(memory=None,
     steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selector', SelectKBest(k=6, score_func=<function mutual_info_classif at 0x7f1367b05598>)), ('clf', GaussianNB(priors=None))])

In [31]:
scores_nb = cross_val_score(pipe_nb, X_test, y_test, cv=10)
scores_nb

array([ 0.50995406,  0.51687117,  0.53527607,  0.51533742,  0.50766871,
        0.50766871,  0.51687117,  0.50998464,  0.51459293,  0.50691244])

In [32]:
scores_nb.mean()

0.51411373250876813

In [33]:
predictions_nb = pipe_nb.predict(X_test)

unique_label = np.unique(y_test)
print(pd.DataFrame(confusion_matrix(y_test, predictions_nb, labels=unique_label), 
                   index=['true:{:}'.format(x) for x in unique_label], 
                   columns=['pred:{:}'.format(x) for x in unique_label]))

        pred:0  pred:1
true:0    2820     427
true:1    1732    1539


In [34]:
print(classification_report(y_test, predictions_nb))

             precision    recall  f1-score   support

          0       0.62      0.87      0.72      3247
          1       0.78      0.47      0.59      3271

avg / total       0.70      0.67      0.66      6518



In [35]:
accuracy_score(y_test, predictions_nb)

0.66876342436330161

### Your 2nd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [36]:
# Add code below this comment  (Question #E8005)
# ----------------------------------

# For our second pipeline we will use a Logistic Regression model.
# We will start with anomolay detection/removal using an Isolation Forest.

iso_forest = IsolationForest(n_estimators=250,bootstrap=True).fit(X_train, y_train)

outliers = iso_forest.predict(X_train)==-1
X_train_iso = X_train[~outliers]
y_train_iso = y_train[~outliers]


In [37]:
# Now we build the pipeline. We will scale the data using Normalizer() and reduce dimensions by feature extraction via 
# Principle Component Analysis. After experimentation, 10 components were determined to produce the best results.

pipe_lr = Pipeline([('scl', Normalizer()),
                    ('pca', PCA(n_components=10)),
                    ('clf', LogisticRegression())])

pipe_lr.fit(X_train_iso, y_train_iso)

Pipeline(memory=None,
     steps=[('scl', Normalizer(copy=True, norm='l2')), ('pca', PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [38]:
scores_lr = cross_val_score(pipe_lr, X_test, y_test, cv=10)
scores_lr

array([ 0.82848392,  0.82055215,  0.83128834,  0.83128834,  0.84662577,
        0.83128834,  0.83282209,  0.82334869,  0.85253456,  0.83717358])

In [39]:
scores_lr.mean()

0.83354057866798625

In [40]:
predictions_lr = pipe_lr.predict(X_test)

unique_label = np.unique(y_test)
print(pd.DataFrame(confusion_matrix(y_test, predictions_lr, labels=unique_label), 
                   index=['true:{:}'.format(x) for x in unique_label], 
                   columns=['pred:{:}'.format(x) for x in unique_label]))

        pred:0  pred:1
true:0    2472     775
true:1     326    2945


In [41]:
print(classification_report(y_test, predictions_lr))

             precision    recall  f1-score   support

          0       0.88      0.76      0.82      3247
          1       0.79      0.90      0.84      3271

avg / total       0.84      0.83      0.83      6518



In [42]:
accuracy_score(y_test, predictions_lr)

0.83108315434182267

### Your 3rd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [43]:
# Add code below this comment  (Question #E8006)
# ----------------------------------

# For our third pipeline we will use a Random Forest model.
# We will start with anomolay detection/removal using a One Class Support Vector Machine and utilize a Radial Basis Function
# kernel.

svm = OneClassSVM(kernel='rbf').fit(X_train, y_train)
                                   
svm_outliers = svm.predict(X_train)==-1
X_train_svm = X_train[~svm_outliers]
y_train_svm = y_train[~svm_outliers]

In [44]:
# Now we build the pipeline. We will not scale the data because the Random Forest model relies on rules, and would not be
# affected by any monotonic transformations of the variables. Dimension reduction will be executed via Factor Analysis.
# After experimentation, 10 factors were determined to produce the best results.

pipe_rf = Pipeline([('fa', FactorAnalysis(n_components=10)),
                    ('clf', RandomForestClassifier())])

pipe_rf.fit(X_train_svm, y_train_svm)

Pipeline(memory=None,
     steps=[('fa', FactorAnalysis(copy=True, iterated_power=3, max_iter=1000, n_components=10,
        noise_variance_init=None, random_state=0, svd_method='randomized',
        tol=0.01)), ('clf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [45]:
scores_rf = cross_val_score(pipe_rf, X_test, y_test, cv=10)
scores_rf

array([ 0.87136294,  0.86503067,  0.87883436,  0.85736196,  0.8696319 ,
        0.89110429,  0.86042945,  0.88018433,  0.87250384,  0.85867896])

In [46]:
scores_rf.mean()

0.87051227058086211

In [47]:
predictions_rf = pipe_rf.predict(X_test)

unique_label = np.unique(y_test)
print(pd.DataFrame(confusion_matrix(y_test, predictions_rf, labels=unique_label), 
                   index=['true:{:}'.format(x) for x in unique_label], 
                   columns=['pred:{:}'.format(x) for x in unique_label]))

        pred:0  pred:1
true:0    2010    1237
true:1     165    3106


In [48]:
print(classification_report(y_test, predictions_rf))

             precision    recall  f1-score   support

          0       0.92      0.62      0.74      3247
          1       0.72      0.95      0.82      3271

avg / total       0.82      0.78      0.78      6518



In [49]:
accuracy_score(y_test, predictions_rf)

0.78490334458422828

## Document the cross-validation analysis for the three models

**<span style="background:yellow">Don't forget to share your chosen models and their cross-validation performance with the class on the dicussion board for module 8.</span>** 

---

# Retrain a model using the full training data set

## Train
Use the full training data set to train the model.

In [50]:
# Add code below this comment  (Question #E8008)
# ----------------------------------

# We'll now train our chosen model on the entire training dataset. Remember, this is still training the model, so we still need
# to account for the Yes/No imbalance in our target variable. This time let's upsample the data:

dataset_minority_upsampled = resample(dataset_minority, 
                                replace=True,     # sample with replacement
                                n_samples=(dataset_majority.went_on_backorder).count(),    # to match majority class
                                random_state=456) # reproducible results

# Combine minority class with upsampled majority class
dataset_upsampled = pd.concat([dataset_minority_upsampled, dataset_majority])

# Check that we did this right: show value counts
dataset_upsampled.went_on_backorder.value_counts()

1    1547519
0    1547519
Name: went_on_backorder, dtype: int64

In [51]:
# What follows is a sequence of splitting the upsampled training dataset into 4 subsets, which we then train our model on. In
# this way we train our model on the entire training dataset without the kernel running out of memory. Each subset is ran
# through an Isolation Forest to detect outliers, just as we did earlier with this model.

one = np.random.rand(len(dataset_upsampled)) < .5
a = dataset_upsampled[one]
b = dataset_upsampled[~one]

In [52]:
two = np.random.rand(len(a)) < .5
c = a[two]
d = a[~two]

In [53]:
three = np.random.rand(len(b)) < .5
e = b[three]
f = b[~three]

In [54]:
X1 = c.iloc[:, :-1]
y1 = c['went_on_backorder']

In [55]:
X2 = d.iloc[:, :-1]
y2 = d['went_on_backorder']

In [56]:
X3 = e.iloc[:, :-1]
y3 = e['went_on_backorder']

In [57]:
X4 = f.iloc[:, :-1]
y4 = f['went_on_backorder']

In [58]:
iso_forest = IsolationForest(n_estimators=250,bootstrap=True).fit(X1, y1)

outliers = iso_forest.predict(X1)==-1
X1_iso = X1[~outliers]
y1_iso = y1[~outliers]

pipe_lr.fit(X1_iso, y1_iso)


Pipeline(memory=None,
     steps=[('scl', Normalizer(copy=True, norm='l2')), ('pca', PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [59]:
iso_forest = IsolationForest(n_estimators=250,bootstrap=True).fit(X2, y2)

outliers = iso_forest.predict(X2)==-1
X2_iso = X2[~outliers]
y2_iso = y2[~outliers]

pipe_lr.fit(X2_iso, y2_iso)


Pipeline(memory=None,
     steps=[('scl', Normalizer(copy=True, norm='l2')), ('pca', PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [60]:
iso_forest = IsolationForest(n_estimators=250,bootstrap=True).fit(X3, y3)

outliers = iso_forest.predict(X3)==-1
X3_iso = X3[~outliers]
y3_iso = y3[~outliers]

pipe_lr.fit(X3_iso, y3_iso)


Pipeline(memory=None,
     steps=[('scl', Normalizer(copy=True, norm='l2')), ('pca', PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [61]:
iso_forest = IsolationForest(n_estimators=250,bootstrap=True).fit(X4, y4)

outliers = iso_forest.predict(X4)==-1
X4_iso = X4[~outliers]
y4_iso = y4[~outliers]

pipe_lr.fit(X4_iso, y4_iso)


Pipeline(memory=None,
     steps=[('scl', Normalizer(copy=True, norm='l2')), ('pca', PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

### Save the trained model with the pickle library.

In [62]:
# Add code below this comment  (Question #E8009)
# ----------------------------------

from sklearn.externals import joblib as jb

jb.dump(pipe_lr, "PipedLogReg.pkl")




['PipedLogReg.pkl']

### Reload the trained model from the pickle file
### Load the Testing Data and evaluate your model

 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`

In [63]:
# Add code below this comment  (Question #E8010)
# ----------------------------------

from sklearn.externals import joblib as jb
loaded_model = jb.load('PipedLogReg.pkl')


In [64]:
# We will load the testing dataest and perform the same carpentry we performed on the training set pior to model fitting.

TESTSET = '/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv'
assert os.path.exists(TESTSET)

testset = pd.read_csv(TESTSET, na_values=na_other)


  interactivity=interactivity, compiler=compiler, result=result)


In [65]:
testset = testset.drop(['sku'], axis = 1)

testset = testset.dropna()

yes_no_columns = list(filter(lambda i: testset[i].dtype!=np.float64, testset.columns))
for column_name in yes_no_columns:
    testset[column_name] = testset[column_name].apply(['No', 'Yes'].index)

In [66]:
testset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 222974 entries, 2 to 242074
Data columns (total 22 columns):
national_inv         222974 non-null float64
lead_time            222974 non-null float64
in_transit_qty       222974 non-null float64
forecast_3_month     222974 non-null float64
forecast_6_month     222974 non-null float64
forecast_9_month     222974 non-null float64
sales_1_month        222974 non-null float64
sales_3_month        222974 non-null float64
sales_6_month        222974 non-null float64
sales_9_month        222974 non-null float64
min_bank             222974 non-null float64
potential_issue      222974 non-null int64
pieces_past_due      222974 non-null float64
perf_6_month_avg     222974 non-null float64
perf_12_month_avg    222974 non-null float64
local_bo_qty         222974 non-null float64
deck_risk            222974 non-null int64
oe_constraint        222974 non-null int64
ppap_risk            222974 non-null int64
stop_auto_buy        222974 non-null int64

In [67]:
testset.isnull().any().any()

False

In [68]:
testset = testset.sample(frac = 1).reset_index(drop=True)

In [69]:
X = testset.iloc[:, :-1]
y = testset['went_on_backorder']

## Test

Test your new model using the testing data set.
 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`

In [70]:
# Add code below this comment  (Question #E8011)
# ----------------------------------

loaded_model_predictions = loaded_model.predict(X)

unique_label = np.unique(y)
print(pd.DataFrame(confusion_matrix(y, loaded_model_predictions, labels=unique_label), 
                   index=['true:{:}'.format(x) for x in unique_label], 
                   columns=['pred:{:}'.format(x) for x in unique_label]))

        pred:0  pred:1
true:0  169829   50581
true:1     330    2234


In [71]:
print(classification_report(y, loaded_model_predictions))

             precision    recall  f1-score   support

          0       1.00      0.77      0.87    220410
          1       0.04      0.87      0.08      2564

avg / total       0.99      0.77      0.86    222974



In [72]:
accuracy_score(y, loaded_model_predictions)

0.771672930476199

## Conclusion

## Reflect

Imagine you are data scientist that has been tasked with developing a system to save your 
company money by predicting and preventing back orders of parts in the supply chain.

Write a **brief summary** for "management" that details your findings, 
your level of certainty and trust in the models, 
and recommendations for operationalizing these models for the business.

# Save your notebook!
## The `File > Close and Halt`