# Pre-processing and Modeling

* [1. Pre-processing](#1)
    * [1.1 The goal](#1.1)
    * [1.2 Read in reduced input files](#1.2)
    * [1.3 Prepare X array](#1.3)
    * [1.4 Prepare y array](#1.4)
* [2. Modeling](#2)
    * [2.1 A simple LSTM](#2.1)
        * [2.1.1 Regression](#2.1.1)
        * [2.1.2 Classification](#2.1.2)
    * [2.2 Stacked LSTM](#2.2)

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl # colors
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
from datetime import datetime as dt
import pickle
import sys

plt.rcParams['axes.grid'] = True
plt.rcParams['axes.axisbelow'] = True
plt.rcParams['axes.xmargin'] = 0.02
plt.rcParams['axes.ymargin'] = 0.1
plt.rcParams['xtick.minor.visible'] = True
plt.rcParams['ytick.minor.visible'] = True

plt.rcParams['figure.dpi'] = 100
plt.rcParams['savefig.bbox'] = 'tight'

In [2]:
# how many products allowed per order
max_n_products = 80

We are trying to predict the customer's next basket content using reccurent neural network approach and the Long Short-Term Memory (LSTM) model. In the first part of this notebook we prepare the data, the second part covers the modeling.

## 1. Pre-processing <a id='1'></a>

### 1.1 The goal <a id='1.1'></a>

model(X) = y

The input X to the LSTM model is a 3-D array (experiments, time_dependence, features), while the output y is a 2-D array (experiments,features). The time_depedence and features vectors need to be of the same size for each entry from experiments. 

In our case, the desired input and output data should be of the format X=(clients,orders,products) and y=(clients,products), respectively. 

But the data that we currently have are: (clients,orders_c) in the orders file and (orders_c,products_o) in the products_prior and products_train files. The subscripts *_c and *_o denote the fact that the corresponding vectors are not of the fixed size, but vary for each customer and order, respectively. The product_prior file contains information related to X, while the product_train file to y.

In the following we will:


### 1.1 Read in reduced input files <a id='1.2'></a>

Let's read in the files, set up some memory control and select from the file orders only those orders that correspond to train sample (prior and next orders).

In [3]:
orders         = pd.read_csv("./data/orders.csv",usecols=['order_id','user_id','eval_set','order_number'])
products_prior = pd.read_csv("./data/order_products__prior.csv",usecols=['order_id','product_id','add_to_cart_order'])
products_train = pd.read_csv("./data/order_products__train.csv",usecols=['order_id','product_id','add_to_cart_order'])
#products_sub   = pd.read_csv("./data/sample_submission.csv")

print(orders.shape)
print(products_prior.shape) 
print(products_train.shape) 
#print(products_sub.shape)

(3421083, 4)
(32434489, 3)
(1384617, 3)


In [4]:
def size(x,n=2):
    return '{} G'.format(round(sys.getsizeof(x)/1_000_000_000.,n))

In [5]:
print(size(orders))
print(size(products_prior))
print(size(products_train))

0.29 G
0.78 G
0.03 G


In [6]:
round(orders.memory_usage(deep=True).sum()/1_000_000_000.,2)

0.29

In [7]:
orders.head(20)

Unnamed: 0,order_id,user_id,eval_set,order_number
0,2539329,1,prior,1
1,2398795,1,prior,2
2,473747,1,prior,3
3,2254736,1,prior,4
4,431534,1,prior,5
5,3367565,1,prior,6
6,550135,1,prior,7
7,3108588,1,prior,8
8,2295261,1,prior,9
9,2550362,1,prior,10


In [8]:
products_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order
0,2,33120,1
1,2,28985,2
2,2,9327,3
3,2,45918,4
4,2,30035,5


In [9]:
products_train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order
0,1,49302,1
1,1,11109,2
2,1,10246,3
3,1,49683,4
4,1,43633,5


Let's separate into customers whose data are used for training and for testing/prediction and select only those for training:

In [10]:
uid = orders.groupby('user_id')['eval_set'].unique()
uid

user_id
1         [prior, train]
2         [prior, train]
3          [prior, test]
4          [prior, test]
5         [prior, train]
               ...      
206205    [prior, train]
206206     [prior, test]
206207     [prior, test]
206208     [prior, test]
206209    [prior, train]
Name: eval_set, Length: 206209, dtype: object

In [11]:
uid.astype('str').str.contains('train')

user_id
1          True
2          True
3         False
4         False
5          True
          ...  
206205     True
206206    False
206207    False
206208    False
206209     True
Name: eval_set, Length: 206209, dtype: bool

In [12]:
uid_train = uid[uid.astype('str').str.contains('train')].reset_index()
uid_train = uid_train.drop('eval_set',axis=1)
uid_train

Unnamed: 0,user_id
0,1
1,2
2,5
3,7
4,8
...,...
131204,206199
131205,206200
131206,206203
131207,206205


In [13]:
size(uid_train,3)

'0.001 G'

Now merge back with orders, the result will contain only prior and future order in training sample. We got rid of test subsample (both prior and next orders).

In [14]:
orders_train = uid_train.merge(orders,on='user_id',how='left')
orders_train.head(50)

Unnamed: 0,user_id,order_id,eval_set,order_number
0,1,2539329,prior,1
1,1,2398795,prior,2
2,1,473747,prior,3
3,1,2254736,prior,4
4,1,431534,prior,5
5,1,3367565,prior,6
6,1,550135,prior,7
7,1,3108588,prior,8
8,1,2295261,prior,9
9,1,2550362,prior,10


In [15]:
orders_train.shape

(2178586, 4)

In [16]:
size(orders_train)

'0.2 G'

Finally, split into X and y. X corresponds to prior orders, y to train orders (if confused, think of "train" as "next" orders in the train sample).

In [17]:
orders_train_X = orders_train[orders_train['eval_set']=='prior'].drop('eval_set',axis=1)
orders_train_y = orders_train[orders_train['eval_set']=='train'].drop('eval_set',axis=1)

print(orders_train_X.shape,orders_train_y.shape)

(2047377, 3) (131209, 3)


In [18]:
size(orders_train_X),size(orders_train_y,3)

('0.07 G', '0.004 G')

In [19]:
#del orders_train

### 1.3 Prepare X array <a id='1.3'></a>

In [20]:
orders_train_X

Unnamed: 0,user_id,order_id,order_number
0,1,2539329,1
1,1,2398795,2
2,1,473747,3
3,1,2254736,4
4,1,431534,5
...,...,...,...
2178580,206209,2558525,9
2178581,206209,2266710,10
2178582,206209,1854736,11
2178583,206209,626363,12


In [21]:
size(orders_train_X)

'0.07 G'

#### 1.3.1 orders_c to orders <a id='1.3.1'></a>

Let's pivot and stack back the X array to change the floating "orders_c" vectors to fixed-size "orders" vector. The size is given by the max "orders_c" length in the file.

In [22]:
orders_train_X = orders_train_X.pivot(index='user_id',columns='order_number',values='order_id').fillna(0)
orders_train_X.head(25)

order_number,1,2,3,4,5,6,7,8,9,10,...,90,91,92,93,94,95,96,97,98,99
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2539329.0,2398795.0,473747.0,2254736.0,431534.0,3367565.0,550135.0,3108588.0,2295261.0,2550362.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2168274.0,1501582.0,1901567.0,738281.0,1673511.0,1199898.0,3194192.0,788338.0,1718559.0,1447487.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2717275.0,1909121.0,2267326.0,157374.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2565571.0,2402008.0,121053.0,1695742.0,3321109.0,1682003.0,1014379.0,776265.0,2507052.0,1396119.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,600894.0,2979257.0,2570360.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,280530.0,2762092.0,1830137.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,1224907.0,68288.0,2115522.0,83395.0,1353310.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,2618231.0,2560699.0,2288946.0,19256.0,1378982.0,2363981.0,1906169.0,1757094.0,519471.0,2298068.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,2906263.0,3206747.0,1366559.0,1385564.0,1691859.0,2127409.0,2505309.0,3067644.0,2109666.0,1625325.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,1737705.0,1681401.0,2680214.0,3197376.0,3237467.0,2616505.0,2648319.0,2430354.0,2373492.0,900554.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
orders_train_X.shape, size(orders_train_X)

((131209, 99), '0.1 G')

In [24]:
orders_train_X = orders_train_X.stack()
orders_train_X.name = 'order_id'

In [25]:
orders_train_X.shape,size(orders_train_X)

((12989691,), '0.18 G')

In [26]:
orders_train_X = orders_train_X.reset_index()
orders_train_X.head(50)

Unnamed: 0,user_id,order_number,order_id
0,1,1,2539329.0
1,1,2,2398795.0
2,1,3,473747.0
3,1,4,2254736.0
4,1,5,431534.0
5,1,6,3367565.0
6,1,7,550135.0
7,1,8,3108588.0
8,1,9,2295261.0
9,1,10,2550362.0


In [27]:
orders_train_X.groupby('user_id')['order_number'].count().head()

user_id
1    99
2    99
5    99
7    99
8    99
Name: order_number, dtype: int64

We can see zeros among order_id for a given user, and the length of orders array is always 99. Cool, that's what we wanted.

#### 1.3.2 products_o to products <a id='1.3.2'></a>

Let's modify the products file. We don't want the matrix to be too sparse so for each order we will only select the first 80 products. See the plot and discussion in Sec. 3.5.2.1 of the [previous notebook](./data_wringling_and_EDA.ipynb#3.5.2.1).

In [28]:
products_prior = products_prior[products_prior['add_to_cart_order']<= max_n_products]

In [29]:
size(products_prior)

'1.04 G'

Now, merge products with orders and pivot to change to fixed size products array. 

In [30]:
orders_train_X.shape,size(orders_train_X)

((12989691, 3), '0.31 G')

In [31]:
size(orders_train_X)

'0.31 G'

In [32]:
def convert_step2(x,p):
    x = x.merge(p,on='order_id',how='left').fillna(0)
    
    x = x.pivot(index=['user_id','order_number'],columns='add_to_cart_order',values='product_id').fillna(0)
    
    return x

In [33]:
%%time
orders_train_X2 = convert_step2(orders_train_X,products_prior)
orders_train_X2.head(25)

CPU times: user 46.6 s, sys: 26.1 s, total: 1min 12s
Wall time: 1min 44s


Unnamed: 0_level_0,add_to_cart_order,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,71.0,72.0,73.0,74.0,75.0,76.0,77.0,78.0,79.0,80.0
user_id,order_number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,1,0.0,196.0,14084.0,12427.0,26088.0,26405.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0.0,196.0,10258.0,12427.0,13176.0,26088.0,13032.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3,0.0,196.0,12427.0,10258.0,25133.0,30450.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,4,0.0,196.0,12427.0,10258.0,25133.0,26405.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5,0.0,196.0,12427.0,10258.0,25133.0,10326.0,17122.0,41787.0,13176.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,6,0.0,196.0,12427.0,10258.0,25133.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,7,0.0,196.0,10258.0,12427.0,25133.0,13032.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,8,0.0,12427.0,196.0,10258.0,25133.0,46149.0,49235.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,9,0.0,49235.0,46149.0,25133.0,196.0,10258.0,12427.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,10,0.0,196.0,46149.0,39657.0,38928.0,25133.0,10258.0,35951.0,13032.0,12427.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
del orders_train_X

In [35]:
orders_train_X2 = orders_train_X2.drop(orders_train_X2.columns[0],axis=1)

In [36]:
orders_train_X2.shape, size(orders_train_X2)

((12989691, 80), '8.38 G')

Cool. Out dataframe size changed from 0.3 G to 8.4 G, but that's still OK for our memory.

In [37]:
#id = 1
#orders_train_X2.iloc[99*(id-1):99*(id-1)+20]

In [38]:
#%whos DataFrame

In [39]:
#print(size(orders))
#print(size(orders_train))
#print(size(orders_train_X2))
#print(size(orders_train_y,3))
#print(size(products_prior))
#print(size(products_train))
#print(size(uid_train,3))

Now time to convert to numpy array. Of what type? We know that we have only positive integers in the dataset. What is the largest value?

In [40]:
orders_train_X2.min().min(),orders_train_X2.max().max()

(0.0, 49688.0)

49k is still below 65k, which is the max size allowed for uint16. Let's use data type uint16 then.

In [41]:
X = orders_train_X2.to_numpy(dtype=np.uint16) # max uint16 = 65535
X

array([[  196, 14084, 12427, ...,     0,     0,     0],
       [  196, 10258, 12427, ...,     0,     0,     0],
       [  196, 12427, 10258, ...,     0,     0,     0],
       ...,
       [    0,     0,     0, ...,     0,     0,     0],
       [    0,     0,     0, ...,     0,     0,     0],
       [    0,     0,     0, ...,     0,     0,     0]], dtype=uint16)

In [42]:
del orders_train_X2

In [43]:
print(X.nbytes/10**9,'G')

2.07835056 G


Very good, we reduced the size of X to 2G!

What about the shape?

In [44]:
X.shape

(12989691, 80)

We need to reshape it to the desired 3-D format:

In [45]:
X = np.reshape(X,(-1,99,80))
X.shape

(131209, 99, 80)

In [46]:
X

array([[[  196, 14084, 12427, ...,     0,     0,     0],
        [  196, 10258, 12427, ...,     0,     0,     0],
        [  196, 12427, 10258, ...,     0,     0,     0],
        ...,
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0]],

       [[32792, 47766, 20574, ...,     0,     0,     0],
        [16797, 47526,  8479, ...,     0,     0,     0],
        [47766, 32792, 20574, ...,     0,     0,     0],
        ...,
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0]],

       [[15349, 21413, 48775, ...,     0,     0,     0],
        [43693,  6808, 24535, ...,     0,     0,     0],
        [26604, 11777, 20754, ...,     0,     0,     0],
        ...,
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,    

Excellent. Now the only remaing step is that for each customer we need to shift the last order to the [-1] position in the orders dimension. Similarly, the second last order to the [-2] position, etc. This necessary to proprely simulate the time dependence of the orders, relative to the next order.

In [47]:
# we need to roll in the 2nd dimension, let's transpose to 3rd dimension and roll on a 1-d array
X2 = X.transpose((0,2,1))

for i,j in np.ndindex(X2.shape[:2]):
    
    if i < 3 and j==0:
        print('-------')
        print('i=',i,'j=',j,'X2[i,j].shape=',X2[i,j].shape)
        print(X2[i,j],'<- before')
    
    if np.any(X2[i,j]): # only non-zero rows, do nothing for all-zeros
        lnz = np.max(np.nonzero(X2[i,j]))
        X2[i,j] = np.roll(X2[i,j], X2[i,j].shape[0]-1-lnz)

        if i < 3 and j==0:
            print('index of last non-zero=',lnz)
            print(X2[i,j],'<- after')
            
X = X2.transpose((0,2,1))

-------
i= 0 j= 0 X2[i,j].shape= (99,)
[  196   196   196   196   196   196   196 12427 49235   196     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0] <- before
index of last non-zero= 9
[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     

In [48]:
X

array([[[    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        ...,
        [12427,   196, 10258, ...,     0,     0,     0],
        [49235, 46149, 25133, ...,     0,     0,     0],
        [  196, 46149, 39657, ...,     0,     0,     0]],

       [[    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        ...,
        [32792, 47209, 22124, ...,     0,     0,     0],
        [42342, 30908,    79, ...,     0,     0,     0],
        [24852, 16589,  1559, ...,     0,     0,     0]],

       [[    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        ...,
        [43693,  6808, 24535, ...,     0,     0,     0],
        [26604, 11777, 20754, ...,     0,    

Ok, looks good.

### 1.4 Prepare y array <a id='1.4'></a>

Let's repeate the same steps for the y array. This one should be much quicker, as it has one less dimension to consider.

In [49]:
size(orders_train_y,3)

'0.004 G'

In [50]:
orders_train_y = orders_train_y.drop('order_number',axis=1)
orders_train_y

Unnamed: 0,user_id,order_id
10,1,1187899
25,2,1492625
30,5,2196797
51,7,525192
55,8,880375
...,...,...
2178537,206199,2585586
2178561,206200,943915
2178567,206203,2371631
2178571,206205,1716008


#### 1.4.1 products_o to products <a id='1.4.1'></a>

For y the merging needs to be done with the product_train dataset. Only 80 first products in each order. Let's pivot to change from floating to fixed size.

In [51]:
products_train = products_train[products_train['add_to_cart_order']<=max_n_products]

In [52]:
def convert_y(y,p):
    y = y.merge(p,on='order_id')
    y = y.pivot(index=['user_id','order_id'],columns='add_to_cart_order',values='product_id').fillna(0)
    return y

In [53]:
orders_train_y = convert_y(orders_train_y,products_train)
orders_train_y.head(50)

Unnamed: 0_level_0,add_to_cart_order,1,2,3,4,5,6,7,8,9,10,...,71,72,73,74,75,76,77,78,79,80
user_id,order_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,1187899,196.0,25133.0,38928.0,26405.0,39657.0,10258.0,13032.0,26088.0,27845.0,49235.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1492625,22963.0,7963.0,16589.0,32792.0,41787.0,22825.0,13640.0,24852.0,45066.0,9387.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2196797,15349.0,19057.0,16185.0,21413.0,20843.0,20114.0,48204.0,40706.0,21616.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,525192,12053.0,47272.0,37999.0,13198.0,43967.0,40852.0,17638.0,29894.0,45066.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,880375,15937.0,5539.0,10960.0,23165.0,22247.0,4853.0,27104.0,7058.0,41259.0,37803.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1094988,27555.0,42347.0,27596.0,8834.0,26604.0,12075.0,8467.0,38988.0,30252.0,18926.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,1822501,29650.0,48720.0,24654.0,10177.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,1827621,27435.0,27086.0,4210.0,47078.0,19934.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,2316178,11042.0,32115.0,28601.0,29615.0,15869.0,37434.0,3808.0,15172.0,8744.0,29509.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,2180313,18534.0,1217.0,12720.0,16797.0,43352.0,4374.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [54]:
orders_train_y.shape

(131209, 80)

In [55]:
size(orders_train_y,3)

'0.098 G'

In [56]:
orders_train_y.min().min(),orders_train_y.max().max()

(0.0, 49688.0)

The max value in the datafame is smaller than 65k, so again, let's choose uint16.

In [57]:
y = orders_train_y.to_numpy(dtype=np.uint16)
y

array([[  196, 25133, 38928, ...,     0,     0,     0],
       [22963,  7963, 16589, ...,     0,     0,     0],
       [15349, 19057, 16185, ...,     0,     0,     0],
       ...,
       [15693, 37188, 21469, ...,     0,     0,     0],
       [27845, 21137, 28745, ...,     0,     0,     0],
       [ 6846,  9405, 24852, ...,     0,     0,     0]], dtype=uint16)

### 1.5 Save X and y to the file <a id='1.5'></a>

In [58]:
X.shape,y.shape

((131209, 99, 80), (131209, 80))

In [59]:
print(X.nbytes/10**9,y.nbytes/10**9,'G')

2.07835056 0.02099344 G


In [60]:
with open('./data/xy_train.pkl', 'wb') as f:
        pickle.dump((X,y), f)

In [61]:
#!ls -lrt ./data/**pkl

## 2. Modeling <a id='2'></a>

Let's prepare the libraries to run the simplest LSTM model.

In [2]:
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import RepeatVector
from keras.layers import TimeDistributed

from keras.callbacks import EarlyStopping
from keras.models import load_model

In [3]:
with open('./data/xy_train.pkl', 'rb') as f: 
        X,y = pickle.load(f)

In [4]:
X.shape,y.shape

((131209, 99, 80), (131209, 80))

In [5]:
print(X.nbytes/10**9,y.nbytes/10**9,'G')

2.07835056 0.02099344 G


Let's split the data, assigning 25% of the training samples for a local model testing.

In [6]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=123)

In [7]:
del X, y

In [8]:
X_train.shape,y_train.shape

((98406, 99, 80), (98406, 80))

In [9]:
X_test.shape,y_test.shape

((32803, 99, 80), (32803, 80))

### 2.1 A simple LSTM <a id='2.1'></a>

Cool. Now let's build the first model. A simple LSTM.

In [10]:
model = Sequential()
model.add(LSTM(80, activation='relu', input_shape=(99, 80),return_sequences=False))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 80)                51520     
Total params: 51,520
Trainable params: 51,520
Non-trainable params: 0
_________________________________________________________________


#### 2.1.1 Regression <a id='2.1.1'></a>

In [11]:
model.compile(optimizer='adam',loss='mse')

In [12]:
%%time
h = model.fit(X_train, y_train,validation_data=(X_test, y_test),epochs=3)#50,callbacks=[callback])

Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 16min 48s, sys: 2min 14s, total: 19min 2s
Wall time: 5min 19s


In [13]:
y_pred_train = model.predict(X_train)
y_pred_test  = model.predict(X_test)

In [14]:
y_pred_train.shape,y_pred_test.shape

((98406, 80), (32803, 80))

In [15]:
#train

In [16]:
y_train[0]

array([27817, 44116, 35050,  5303, 43654, 47626, 45007, 44514, 44098,
         329, 28132,  2164,  8758, 18370, 23734,  4947,  2078, 41329,
       30489, 36695, 41273, 25931, 19053, 28601, 23375, 46720, 27156,
        7628,   691,  9531, 12206, 48697,    45,  7308, 38217,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0],
      dtype=uint16)

In [17]:
y_pred_train[0].astype(int)

array([23747, 42457,     0, 28073,  7846,     0,  9087,     0, 26296,
           0,     0,     0,     0, 18730,     0,     0, 26298, 19102,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,  1803,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0])

In [18]:
#test

In [19]:
y_test[0]

array([ 9166, 49191, 34604,  9826, 33957,  9124,  2103,  7301,  1579,
        9996, 37065, 43789, 38773, 19068,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0],
      dtype=uint16)

In [20]:
y_pred_test[0].astype(int)

array([14948, 19853, 35729, 26982, 12702,     0, 22150,     0,     0,
           0,     0,     0,     0, 18417,     0,     0,  1199,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0, 14755,     0])

Hmm, not even one product in the basket that would agree. Not good yet.

#### 2.1.2 Classification <a id='2.1.2'></a>

In [37]:
model.compile(optimizer='adam',loss='categorical_crossentropy', metrics='accuracy')

In [38]:
%time
h2 = model.fit(X_train, y_train,validation_data=(X_test, y_test),epochs=3)

CPU times: user 6 µs, sys: 1e+03 ns, total: 7 µs
Wall time: 13.1 µs
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [39]:
y_pred_train = model.predict(X_train)
y_pred_test  = model.predict(X_test)

In [40]:
y_pred_train.shape,y_pred_test.shape

((98406, 80), (32803, 80))

In [25]:
#train

In [41]:
y_train[0]

array([27817, 44116, 35050,  5303, 43654, 47626, 45007, 44514, 44098,
         329, 28132,  2164,  8758, 18370, 23734,  4947,  2078, 41329,
       30489, 36695, 41273, 25931, 19053, 28601, 23375, 46720, 27156,
        7628,   691,  9531, 12206, 48697,    45,  7308, 38217,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0],
      dtype=uint16)

In [45]:
y_pred_train[0]

array([0.0000000e+00, 4.2134872e+26, 4.0348615e+26, 3.7858474e+26,
       0.0000000e+00, 3.0835876e+26, 0.0000000e+00, 2.5266079e+26,
       2.2706776e+26, 0.0000000e+00, 0.0000000e+00, 1.6058394e+26,
       0.0000000e+00, 0.0000000e+00, 1.0811509e+26, 9.1862203e+25,
       0.0000000e+00, 0.0000000e+00, 6.7748084e+25, 1.9364758e+00,
       5.2561571e+25, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 3.2255732e+25, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 1.1298065e-01, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       6.9757147e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 5.3478134e-01, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 3.6851361e-01,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e

In [28]:
#test

In [43]:
y_test[0]

array([ 9166, 49191, 34604,  9826, 33957,  9124,  2103,  7301,  1579,
        9996, 37065, 43789, 38773, 19068,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0],
      dtype=uint16)

In [46]:
y_pred_test[0]

array([0.0000000e+00, 4.2134872e+26, 4.0348615e+26, 3.7858474e+26,
       0.0000000e+00, 3.0835876e+26, 0.0000000e+00, 2.5266079e+26,
       2.2706776e+26, 0.0000000e+00, 0.0000000e+00, 1.6058394e+26,
       0.0000000e+00, 0.0000000e+00, 1.0811509e+26, 9.1862203e+25,
       0.0000000e+00, 0.0000000e+00, 6.7748084e+25, 1.9364758e+00,
       5.2561571e+25, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 3.2255732e+25, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 1.1298065e-01, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       6.9757147e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 5.3478134e-01, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 3.6851361e-01,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e

Even worse here.

### 2.2 Stacked LSTM <a id='2.2'></a>

inspired by this post [https://stackabuse.com/solving-sequence-problems-with-lstm-in-keras-part-2/](https://stackabuse.com/solving-sequence-problems-with-lstm-in-keras-part-2/), 

section "Many-to-Many Sequence Problems with Multiple Features":

In [31]:
model2 = Sequential()
model2.add(LSTM(80, activation='relu', input_shape=(99, 80)))
model2.add(RepeatVector(100))
model2.add(LSTM(80, activation='relu', return_sequences=False))
#model2.add(TimeDistributed(Dense(1)))
model2.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 80)                51520     
_________________________________________________________________
repeat_vector (RepeatVector) (None, 100, 80)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 80)                51520     
Total params: 103,040
Trainable params: 103,040
Non-trainable params: 0
_________________________________________________________________


In [32]:
model2.compile(optimizer='adam', loss='mse')

In [33]:
%%time
h3 = model2.fit(X_train, y_train,validation_data=(X_test, y_test),epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 37min 35s, sys: 5min 5s, total: 42min 41s
Wall time: 11min 14s


In [34]:
y_pred_train = model2.predict(X_train)
y_pred_test  = model2.predict(X_test)

In [35]:
y_pred_train[0]#.astype(int)

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan], dtype=float32)

In [36]:
y_pred_test[0]#.astype(int)

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan], dtype=float32)

Not good, need to try more advanced models. And apply them correctly :-)