although the problem is probably **best seen as a regression problem**, I've chosen to use neural networks for this approach because 

**a)** they are nonlinear models and 

**b)** even though we are doing classification, neural nets (when using softmax output function) output a probability for *each* output so we will get probabilities for each of the possible outputs (quantity buckets)

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys

# helper functions
sys.path.insert(0, "../src/lib")

import dataset as dataset_funcs
import cleaning as cleaning_funcs

pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [2]:
sales_df = pd.read_csv('../data/raw/sales.csv')
sales_df.DATE_ORDER = sales_df.DATE_ORDER.astype("datetime64")
sales_df["UNIT_PRICE"] = sales_df["REVENUE"] / sales_df["QTY_ORDER"]
sales_df.sample(5)

Unnamed: 0,PROD_ID,DATE_ORDER,QTY_ORDER,REVENUE,UNIT_PRICE
51347,P7,2015-06-10,1.0,799.0,799.0
277848,P8,2015-08-21,1.0,379.0,379.0
66703,P7,2015-07-20,1.0,749.0,749.0
286360,P8,2015-10-13,1.0,379.0,379.0
154504,P7,2015-02-06,1.0,849.0,849.0


as we mentioned on the **EXPLORATORY DATA ANALYSIS**, we will remove some bad data from our dataset to avoid propagating these errors to the model (using helper functions).

In [3]:
sales_df = cleaning_funcs.clean_sales_dataframe(sales_df)
sales_df.shape

(351090, 5)

let's split the data into each product (because each product may have different dynamics)

In [4]:
group_ids = ['P1','P2','P3','P4','P5','P6','P7','P8','P9']
grouped = sales_df.groupby([sales_df.PROD_ID])
(p1,p2,p3,p4,p5,p6,p7,p8,p9) = [grouped.get_group(prod_id) for prod_id in group_ids]

let's use P7 in the first run because it's the product the with the most available data

In [5]:
p7 = p7.sort_values(['DATE_ORDER'])

In [6]:
# now just select the columns we will use in this very simple model
p7 = p7[["UNIT_PRICE","QTY_ORDER"]]

> I will perform a transformation to make this dataset more amenable to classification.

Although (as I said before) classification may not be the best way to approach this problem, we can make it a little bit better if we reduce the cardinality of the target and we make it more "discrete". In other words, I will reduce the space of possible targets from a continuous range to a discrete set of choices.

Since values larger than 10.0 for the QTY_ORDER are exceedingly rare, let's consider all values larger than or equal to 10 as just 10. 

This change will cause the dataset to have just 9 possible choices for the target, thereby making it more amenable to classification (rather than regression) methods.

In [7]:
p7["QTY_ORDER"] = p7["QTY_ORDER"].apply(lambda qty: 10.0 if qty >= 10.0 else qty)

In [8]:
X,y = dataset_funcs.make_Xy_simple(p7)

In [9]:
X.shape,y.shape

((195938, 1), (195938,))

For this approach I will use a simple neural network with softmax output function.

In [10]:
from sklearn.preprocessing import StandardScaler 
from sklearn.neural_network import MLPClassifier

neural networks work better with normalized, standardized data, so we need to rescale data first.

In [11]:
sc = StandardScaler()  
# if we were running a production system, we would obviously just fit on the training data
sc.fit(X)  
X = sc.transform(X)

In [12]:
from sklearn.neural_network import MLPClassifier

In [13]:
# tanh gives us more smoothness than the default rectified unit
mlp = MLPClassifier(activation='tanh')
mlp.fit(X,y)

MLPClassifier(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

Since our target column (y) has multiple (discrete) values (categories), sklearn's implementation of neural nets will automatically infer we want to get predictions for all categories, and will make method `predict_proba` available for us.

`predict_proba` takes an input as parameter and outputs the predicted probability of *each* category. So if we call `classifier.predict_proba(1200)`, we will get as result an array of 9 elements, the first element being the probability that QTY_ORDER is 1, the second being the probability that QTY_ORDER is 2, and so on.

> Note that, since the outputs of `predict_proba` are probabilities, they must always sum to 1.

In [14]:
mlp.predict_proba(np.array([1200]).reshape(-1,1))

array([[  8.99470467e-01,   3.21877380e-02,   5.92790407e-03,
          5.47249721e-03,   1.79870732e-03,   2.21117699e-06,
          1.76275238e-02,   3.65303541e-02,   9.82597760e-04]])

comments: this solution is a little bit unstable (may yield different results every time it's run) because of the inherent stochasticity of methods such as SGD (stochastic gradient descent) used in training neural nets (when trained using mini-batch learning).