# Market basket analysis using bayseian network

This project aims to predict which previously purchased products will be in a user’s next order.  

We tackle this by creating a Bayesian Belief network to analyze the behavior of the customers, and predict their next purchase, along with the order in which they are bought in that transaction.  

More details on this problem is provided on the Kaggle challenge page. https://www.kaggle.com/c/instacart-market-basket-analysis/



## Review of past techniques

The Instacart dataset is one of the most popular datasets, and the ML community has attempted numerous techniques to maximise the performance of their algorithm. Prior to the release of the dataset, Instacart used  XGBoost, word2vec and Annoy in production [link](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2). Some of the algorithms attempted by previous works are Gradient boosting machines, XGBoost and Catboost. Some of the more compute intensive algorithms used were ensemble methods and recurrent neural networks(RNNs). An overview is listed [here](https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/36848). 

One of the attempts that caught our attention was the Bayesian network by Mr Jon Jones. It was appealing because it seemed a very intuitive approach. Having used services similar to Instacart before, we realised that there are certain factors that strongly influence our purchase of goods. Therefore, we decided to help further the efforts of Mr Jones.  

This project aims to help implement his solution [Calculate a Prior and Bayes Factors](https://www.kaggle.com/johnoliverjones/calculate-a-prior-and-bayes-factors-0-318)
 and get it running completely.

## Overview
This file uses p(reordered|product_id) derived from order_products__prior data as a **Prior**. This is to be used in Bayesian Updating of our Prior: our_products_prior['prob_reordered'].  

The notion is that after calculating Bayes Factors for each test product purchase the final probability that a product will be reordered is the **Posterior** probability.  Beginning when a product is first purchased (say order k of n total orders) then the **Posterior = BFn x BFn-1 x ... x BFk x Prior**.

Many others here have noticed the correlation between reordered and add_to_cart_order and aisle. I have added an engineered factor I call reorder_count (or count of reordered items in a cart). Using these three variables, I have derived a simple Augmented Naive Bayesian Network as a model to calculate the Bayes Factors for updating.

![Bayesian Network model of reordered][1]

by Jones. A deeper understanding is found in [Bayesian Solution: updating beliefs with 'real' data](https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/36312)

  [1]: http://elmtreegarden.com/wp-content/uploads/2017/07/Augmented-Naive-Bayesian-Network.png

## Downloading the data

The data is a public dataset released by Instacart, a large same-day grocery delivery service.  
We will use the dataset provided at https://www.instacart.com/datasets/grocery-shopping-2017 and store this in our directory that consists this notebook file.

Reading the data and Checking the contents below.

In [1]:
import pandas as pd
import numpy as np
import operator

# reading data

# directory = 'instacart_2017_05_01/'
directory = '/home/rs5788/instacart/'

print('Loading prior orders')
prior_orders = pd.read_csv(directory + 'order_products__prior.csv', dtype={
        'order_id': np.int32,
        'product_id': np.int32,
        'add_to_cart_order': np.int16,
        'reordered': np.int8})

print('Loading orders')
orders = pd.read_csv(directory + 'orders.csv', dtype={
        'order_id': np.int32,
        'user_id': np.int32,
        'eval_set': 'category',
        'order_number': np.int16,
        'order_dow': np.int8,
        'order_hour_of_day': np.int8,
        'days_since_prior_order': np.float32})

print('Loading aisles info')
aisles = pd.read_csv(directory + 'products.csv', engine='c',
                           usecols = ['product_id','aisle_id'],
                       dtype={'product_id': np.int32, 'aisle_id': np.int32})

pd.set_option('display.float_format', lambda x: '%.3f' % x)

print("\n Checking the loaded CSVs")
print("Prior orders:", prior_orders.shape)
print("Orders", orders.shape)
print("Aisles:", aisles.shape)

Loading prior orders
Loading orders
Loading aisles info

 Checking the loaded CSVs
Prior orders: (32434489, 4)
Orders (3421083, 7)
Aisles: (49688, 2)


## Visualizing the tables

In [2]:
prior_orders.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [3]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [4]:
aisles.head()

Unnamed: 0,product_id,aisle_id
0,1,61
1,2,104
2,3,94
3,4,38
4,5,5


In [5]:
# removing all user_ids not in the test set from both files to save memory
# the test users present ample data to make models. (and saves space)

test  = orders[orders['eval_set'] == 'test' ]
user_ids = test['user_id'].values
orders = orders[orders['user_id'].isin(user_ids)]

test.shape

(75000, 7)

## Calculate the initial Informed Prior  
## $$ p(reordered\ |\ product\_id) $$

In [6]:
# Calculate the Prior : p(reordered|product_id)

prior = pd.DataFrame(prior_orders.groupby('product_id')['reordered']     \
                     .agg([('number_of_orders',len),('sum_of_reorders','sum')]))

prior['prior_p'] = (prior['sum_of_reorders']+1)/(prior['number_of_orders']+2) # Informed Prior
# prior['prior_p'] = 1/2  # Flat Prior
# prior.drop(['number_of_orders','sum_of_reorders'], axis=1, inplace=True)

print('Here is The Prior: our first guess of how probable it is that a product be reordered once it has been ordered.')
prior.head()

Here is The Prior: our first guess of how probable it is that a product be reordered once it has been ordered.


Unnamed: 0_level_0,number_of_orders,sum_of_reorders,prior_p
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1852,1136.0,0.613
2,90,12.0,0.141
3,277,203.0,0.731
4,329,147.0,0.447
5,15,9.0,0.588


In [7]:
# merge everything into one dataframe and save any memory space

comb = pd.DataFrame()
comb = pd.merge(prior_orders, orders, on='order_id', how='right')

# slim down comb - 
comb.drop(['eval_set','order_dow','order_hour_of_day'], axis=1, inplace=True)
del prior_orders
del orders

comb = pd.merge(comb, aisles, on ='product_id', how = 'left')
del aisles

prior.reset_index(inplace = True)
comb = pd.merge(comb, prior, on ='product_id', how = 'left')
del prior

print('combined data in DataFrame comb')
comb.head()

combined data in DataFrame comb


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,days_since_prior_order,aisle_id,number_of_orders,sum_of_reorders,prior_p
0,13,17330.0,1.0,0.0,45082,2,1.0,27.0,484.0,323.0,0.667
1,13,27407.0,2.0,0.0,45082,2,1.0,51.0,976.0,373.0,0.382
2,13,35419.0,3.0,0.0,45082,2,1.0,124.0,1244.0,701.0,0.563
3,13,196.0,4.0,0.0,45082,2,1.0,77.0,35791.0,27791.0,0.776
4,13,44635.0,5.0,0.0,45082,2,1.0,51.0,701.0,237.0,0.339


## Building factors
Build the factors needed for a model of probability of reordered. This model forms our
hypothesis H and allows the calculation of each Bayes Factor:  

$$ BF = p(e|H)/(1-p(e|H)) $$  
where e is the test user product buying history. See DAG of model above.  

We discretize reorder count into categories, 9 buckets, being sure to include 0 as bucket. These bins maximize mutual information with ['reordered'].

In [8]:
recount = pd.DataFrame()
recount['reorder_c'] = comb.groupby(comb.order_id)['reordered'].sum().fillna(0)
bins = [-0.1, 0, 2,4,6,8,11,14,19,71]
cat =  ['None','<=2','<=4','<=6','<=8','<=11','<=14','<=19','>19']
recount['reorder_b'] = pd.cut(recount['reorder_c'], bins, labels = cat)
recount.reset_index(inplace = True)

comb = pd.merge(comb, recount, how = 'left', on = 'order_id')
del recount
comb.head(50)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,days_since_prior_order,aisle_id,number_of_orders,sum_of_reorders,prior_p,reorder_c,reorder_b
0,13,17330.0,1.0,0.0,45082,2,1.0,27.0,484.0,323.0,0.667,0.0,
1,13,27407.0,2.0,0.0,45082,2,1.0,51.0,976.0,373.0,0.382,0.0,
2,13,35419.0,3.0,0.0,45082,2,1.0,124.0,1244.0,701.0,0.563,0.0,
3,13,196.0,4.0,0.0,45082,2,1.0,77.0,35791.0,27791.0,0.776,0.0,
4,13,44635.0,5.0,0.0,45082,2,1.0,51.0,701.0,237.0,0.339,0.0,
5,13,26878.0,6.0,0.0,45082,2,1.0,64.0,217.0,124.0,0.571,0.0,
6,13,25783.0,7.0,0.0,45082,2,1.0,64.0,2773.0,1472.0,0.531,0.0,
7,13,41290.0,8.0,0.0,45082,2,1.0,31.0,19692.0,13173.0,0.669,0.0,
8,13,33198.0,9.0,0.0,45082,2,1.0,115.0,42934.0,31500.0,0.734,0.0,
9,13,23020.0,10.0,0.0,45082,2,1.0,77.0,1928.0,1182.0,0.613,0.0,


Discretize 'add_to_cart_order' (atco) into categories, 8 buckets.
These bins maximize mutual information with ['recount'].

In [9]:
bins = [0,2,3,5,7,9,12,17,80]
cat = ['<=2','<=3','<=5','<=7','<=9','<=12','<=17','>17']

comb['atco1'] = pd.cut(comb['add_to_cart_order'], bins, labels = cat)
del comb['add_to_cart_order']
print('comb')
comb.head(50)

comb


Unnamed: 0,order_id,product_id,reordered,user_id,order_number,days_since_prior_order,aisle_id,number_of_orders,sum_of_reorders,prior_p,reorder_c,reorder_b,atco1
0,13,17330.0,0.0,45082,2,1.0,27.0,484.0,323.0,0.667,0.0,,<=2
1,13,27407.0,0.0,45082,2,1.0,51.0,976.0,373.0,0.382,0.0,,<=2
2,13,35419.0,0.0,45082,2,1.0,124.0,1244.0,701.0,0.563,0.0,,<=3
3,13,196.0,0.0,45082,2,1.0,77.0,35791.0,27791.0,0.776,0.0,,<=5
4,13,44635.0,0.0,45082,2,1.0,51.0,701.0,237.0,0.339,0.0,,<=5
5,13,26878.0,0.0,45082,2,1.0,64.0,217.0,124.0,0.571,0.0,,<=7
6,13,25783.0,0.0,45082,2,1.0,64.0,2773.0,1472.0,0.531,0.0,,<=7
7,13,41290.0,0.0,45082,2,1.0,31.0,19692.0,13173.0,0.669,0.0,,<=9
8,13,33198.0,0.0,45082,2,1.0,115.0,42934.0,31500.0,0.734,0.0,,<=9
9,13,23020.0,0.0,45082,2,1.0,77.0,1928.0,1182.0,0.613,0.0,,<=12


These are the children Nodes of reordered:atco, aisle, recount. Build occurrence tables
first, then calculate probabilities. Then merge to add atco into comb.

In [10]:
atco_fac = pd.DataFrame()
atco_fac = comb.groupby(['reordered', 'atco1'])['atco1'].agg(np.count_nonzero).unstack('atco1')

tot = pd.DataFrame()
tot = np.sum(atco_fac,axis=1)

atco_fac = atco_fac.iloc[:,:].div(tot, axis=0)
atco_fac = atco_fac.stack('atco1')
atco_fac = pd.DataFrame(atco_fac)
atco_fac.reset_index(inplace = True)
atco_fac.rename(columns = {0:'atco_fac_p'}, inplace = True)

comb = pd.merge(comb, atco_fac, how='left', on=('reordered', 'atco1'))
comb.head(50)

Unnamed: 0,order_id,product_id,reordered,user_id,order_number,days_since_prior_order,aisle_id,number_of_orders,sum_of_reorders,prior_p,reorder_c,reorder_b,atco1,atco_fac_p
0,13,17330.0,0.0,45082,2,1.0,27.0,484.0,323.0,0.667,0.0,,<=2,0.152
1,13,27407.0,0.0,45082,2,1.0,51.0,976.0,373.0,0.382,0.0,,<=2,0.152
2,13,35419.0,0.0,45082,2,1.0,124.0,1244.0,701.0,0.563,0.0,,<=3,0.074
3,13,196.0,0.0,45082,2,1.0,77.0,35791.0,27791.0,0.776,0.0,,<=5,0.143
4,13,44635.0,0.0,45082,2,1.0,51.0,701.0,237.0,0.339,0.0,,<=5,0.143
5,13,26878.0,0.0,45082,2,1.0,64.0,217.0,124.0,0.571,0.0,,<=7,0.128
6,13,25783.0,0.0,45082,2,1.0,64.0,2773.0,1472.0,0.531,0.0,,<=7,0.128
7,13,41290.0,0.0,45082,2,1.0,31.0,19692.0,13173.0,0.669,0.0,,<=9,0.108
8,13,33198.0,0.0,45082,2,1.0,115.0,42934.0,31500.0,0.734,0.0,,<=9,0.108
9,13,23020.0,0.0,45082,2,1.0,77.0,1928.0,1182.0,0.613,0.0,,<=12,0.126


Calculate other two factors probability tables, then probability and merge into comb

In [11]:
aisle_fac = pd.DataFrame()
aisle_fac = comb.groupby(['reordered', 'atco1', 'aisle_id'])['aisle_id']\
                .agg(np.count_nonzero).unstack('aisle_id')

tot = np.sum(aisle_fac,axis=1)

aisle_fac = aisle_fac.iloc[:,:].div(tot, axis=0)
aisle_fac = aisle_fac.stack('aisle_id')
aisle_fac = pd.DataFrame(aisle_fac)
aisle_fac.reset_index(inplace = True)
aisle_fac.rename(columns = {0:'aisle_fac_p'}, inplace = True)

comb = pd.merge(comb, aisle_fac, how = 'left', on = ('aisle_id','reordered','atco1'))
comb.head(50)

Unnamed: 0,order_id,product_id,reordered,user_id,order_number,days_since_prior_order,aisle_id,number_of_orders,sum_of_reorders,prior_p,reorder_c,reorder_b,atco1,atco_fac_p,aisle_fac_p
0,13,17330.0,0.0,45082,2,1.0,27.0,484.0,323.0,0.667,0.0,,<=2,0.152,0.004
1,13,27407.0,0.0,45082,2,1.0,51.0,976.0,373.0,0.382,0.0,,<=2,0.152,0.003
2,13,35419.0,0.0,45082,2,1.0,124.0,1244.0,701.0,0.563,0.0,,<=3,0.074,0.001
3,13,196.0,0.0,45082,2,1.0,77.0,35791.0,27791.0,0.776,0.0,,<=5,0.143,0.011
4,13,44635.0,0.0,45082,2,1.0,51.0,701.0,237.0,0.339,0.0,,<=5,0.143,0.004
5,13,26878.0,0.0,45082,2,1.0,64.0,217.0,124.0,0.571,0.0,,<=7,0.128,0.003
6,13,25783.0,0.0,45082,2,1.0,64.0,2773.0,1472.0,0.531,0.0,,<=7,0.128,0.003
7,13,41290.0,0.0,45082,2,1.0,31.0,19692.0,13173.0,0.669,0.0,,<=9,0.108,0.014
8,13,33198.0,0.0,45082,2,1.0,115.0,42934.0,31500.0,0.734,0.0,,<=9,0.108,0.015
9,13,23020.0,0.0,45082,2,1.0,77.0,1928.0,1182.0,0.613,0.0,,<=12,0.126,0.008


Last factor is reorder_count_factor   

In [12]:
   
recount_fac = pd.DataFrame()
recount_fac = comb.groupby(['reordered', 'atco1', 'reorder_b'])['reorder_b']\
                    .agg(np.count_nonzero).unstack('reorder_b')

tot = pd.DataFrame()
tot = np.sum(recount_fac,axis=1)

recount_fac = recount_fac.iloc[:,:].div(tot, axis=0)
recount_fac.stack('reorder_b')
recount_fac = pd.DataFrame(recount_fac.unstack('reordered').unstack('atco1')).reset_index()
recount_fac.rename(columns = {0:'recount_fac_p'}, inplace = True)

comb = pd.merge(comb, recount_fac, how = 'left', on = ('reorder_b', 'reordered', 'atco1'))
recount_fac.head(50)

Unnamed: 0,reorder_b,reordered,atco1,recount_fac_p
0,<=11,0.0,<=12,0.179
1,<=11,0.0,<=17,0.199
2,<=11,0.0,<=2,0.044
3,<=11,0.0,<=3,0.037
4,<=11,0.0,<=5,0.042
5,<=11,0.0,<=7,0.057
6,<=11,0.0,<=9,0.092
7,<=11,0.0,>17,0.115
8,<=11,1.0,<=12,0.226
9,<=11,1.0,<=17,0.111


We use the factors in comb + the prior_p to update a posterior for each product purchased.

In [13]:
p = pd.DataFrame()
p = (comb.loc[:,'atco_fac_p'] * comb.loc[:,'aisle_fac_p'] * comb.loc[:,'recount_fac_p'])
p.reset_index()
comb['p'] = p

comb.head(30)

Unnamed: 0,order_id,product_id,reordered,user_id,order_number,days_since_prior_order,aisle_id,number_of_orders,sum_of_reorders,prior_p,reorder_c,reorder_b,atco1,atco_fac_p,aisle_fac_p,recount_fac_p,p
0,13,17330.0,0.0,45082,2,1.0,27.0,484.0,323.0,0.667,0.0,,<=2,0.152,0.004,0.359,0.0
1,13,27407.0,0.0,45082,2,1.0,51.0,976.0,373.0,0.382,0.0,,<=2,0.152,0.003,0.359,0.0
2,13,35419.0,0.0,45082,2,1.0,124.0,1244.0,701.0,0.563,0.0,,<=3,0.074,0.001,0.303,0.0
3,13,196.0,0.0,45082,2,1.0,77.0,35791.0,27791.0,0.776,0.0,,<=5,0.143,0.011,0.259,0.0
4,13,44635.0,0.0,45082,2,1.0,51.0,701.0,237.0,0.339,0.0,,<=5,0.143,0.004,0.259,0.0
5,13,26878.0,0.0,45082,2,1.0,64.0,217.0,124.0,0.571,0.0,,<=7,0.128,0.003,0.218,0.0
6,13,25783.0,0.0,45082,2,1.0,64.0,2773.0,1472.0,0.531,0.0,,<=7,0.128,0.003,0.218,0.0
7,13,41290.0,0.0,45082,2,1.0,31.0,19692.0,13173.0,0.669,0.0,,<=9,0.108,0.014,0.192,0.0
8,13,33198.0,0.0,45082,2,1.0,115.0,42934.0,31500.0,0.734,0.0,,<=9,0.108,0.015,0.192,0.0
9,13,23020.0,0.0,45082,2,1.0,77.0,1928.0,1182.0,0.613,0.0,,<=12,0.126,0.008,0.17,0.0


## Filing Bayess Factors for intermediate orders
We now split into three dataframes. Two of them are reordered == 1 and == 0
This is done because Bayes Factor (BF) is calculated differently for each case.  
We then append this to a table called 'comb_last' which will be using for modelling Bayes Factor depending on whether the product was ordered or not.

In [14]:

# Calculate bf0 for products when first purchased aka reordered=0
comb0 = pd.DataFrame()
comb0 = comb[comb['reordered']==0]
comb0.loc[:,'first_order'] = comb0['order_number']
# now every product that was ordered has a posterior in usr.
comb0.loc[:,'beta'] = 1
comb0.loc[:,'bf'] = (comb0.loc[:,'prior_p'] * comb0.loc[:,'p']/(1 - comb0.loc[:,'p'])) # bf1
# Small 'slight of hand' here. comb0.bf is really the first posterior and second prior.

# Calculate beta and BF1 for the reordered products
comb1 = pd.DataFrame()
comb1 = comb[comb['reordered']==1]

comb1.loc[:,'beta'] = (1 - .05*comb1.loc[:,'days_since_prior_order']/30)
comb1.loc[:,'bf'] = (1 - comb1.loc[:,'p'])/comb1.loc[:,'p'] # bf0


comb_last = pd.DataFrame()
comb_last = pd.concat([comb0, comb1], axis=0).reset_index(drop=True)

comb_last = comb_last[['reordered', 'user_id', 'order_id', 'product_id','reorder_c','order_number',
                       'bf','beta','atco_fac_p', 'aisle_fac_p', 'recount_fac_p']]
comb_last = comb_last.sort_values((['user_id', 'order_number', 'bf']))

pd.set_option('display.float_format', lambda x: '%.6f' % x)
comb_last.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,reordered,user_id,order_id,product_id,reorder_c,order_number,bf,beta,atco_fac_p,aisle_fac_p,recount_fac_p
1941951,0.0,3,1374495,24810.0,0.0,1,0.000173,1.0,0.108048,0.015029,0.191657
1941952,0.0,3,1374495,32402.0,0.0,1,0.000217,1.0,0.125739,0.01571,0.169924
1941947,0.0,3,1374495,39190.0,0.0,1,0.000339,1.0,0.142619,0.016031,0.258537
1941950,0.0,3,1374495,39922.0,0.0,1,0.000547,1.0,0.108048,0.051583,0.191657
1941944,0.0,3,1374495,17668.0,0.0,1,0.000647,1.0,0.152094,0.020466,0.359063


In [15]:
first_order = pd.DataFrame()
first_order = comb_last[comb_last.reordered == 0]
first_order.rename(columns = {'order_number':'first_o'}, inplace = True)
first_order.loc[:,'last_o'] = comb_last.groupby(['user_id'])['order_number'].transform(max)
first_order = first_order[['user_id','product_id','first_o','last_o']]

comb_last = pd.merge(comb_last, first_order, on = ('user_id', 'product_id'), how = 'left')
comb_last.head()

#com = pd.DataFrame()
#com = comb_last[(comb_last.user_id == 3) & (comb_last.first_o < comb_last.order_number)]
#com.groupby([('order_id', 'product_id', 'order_number')])['bf'].agg(np.sum).head(50)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,reordered,user_id,order_id,product_id,reorder_c,order_number,bf,beta,atco_fac_p,aisle_fac_p,recount_fac_p,first_o,last_o
0,0.0,3,1374495,24810.0,0.0,1,0.000173,1.0,0.108048,0.015029,0.191657,1,12
1,0.0,3,1374495,32402.0,0.0,1,0.000217,1.0,0.125739,0.01571,0.169924,1,12
2,0.0,3,1374495,39190.0,0.0,1,0.000339,1.0,0.142619,0.016031,0.258537,1,12
3,0.0,3,1374495,39922.0,0.0,1,0.000547,1.0,0.108048,0.051583,0.191657,1,12
4,0.0,3,1374495,17668.0,0.0,1,0.000647,1.0,0.152094,0.020466,0.359063,1,12


## Visualizing next steps
$$ Bayes\ Factor\ (BF) = \frac{p\ (e\ |\ reordered)}{p\ (e)} $$  
The probability of evidence p(e) is considered fixed so that a very simple relationship emerges: Posterior ~= p(e | reordered) x Prior!!! where ~= is "proportional to". This is a very simple recursive method to Update a Prior. 
$$ Posterior\ ~=\ BFn\ *\ BFn-1\ *\ ...\ *\ BF1\ *\ Prior $$

To get the final list of possible products a user may reorder, we get a full list of the orders and items ordered. Each product begins with some Prior and this Prior is Updated with each order.  

This helpful spreadsheet shows four orders and the calculations for p(reordered) (aka posterior) as an example.
We now try to emulate this for the entire dataset.
![Sample](http://elmtreegarden.com/wp-content/uploads/2017/03/Process-on-four-orders.gif)  

Visualizing the same on our dataset. Let's consider user_id == 3. The bayes factors for the orders placed are calculated in the above cells. We now find a way to emulate intermediate bayes factors.  

First we obtain the bayes factors for the first orders of each product.

In [16]:
temp = pd.pivot_table(comb_last[(comb_last.user_id == 3) & (comb_last.first_o == comb_last.order_number)],
                     values = 'bf', index = ['user_id', 'product_id'],
                     columns = 'order_number', dropna=False)
temp.head(10)

Unnamed: 0_level_0,order_number,1,2,3,4,5,6,7,8,10
user_id,product_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3,248.0,,5.3e-05,,,,,,,
3,1005.0,,,,,,,,,6.6e-05
3,1819.0,,,,0.000195,,,,,
3,7503.0,,,0.000166,,,,,,
3,8021.0,,0.000186,,,,,,,
3,9387.0,0.003135,,,,,,,,
3,12845.0,,,,0.000139,,,,,
3,14992.0,,,,,,0.000719,,,
3,15143.0,0.000984,,,,,,,,
3,16797.0,0.0022,,,,,,,,


We see in the above pivot table that only orders where the products exist contain probabilities of events. We now fill this with bayes factors.  
Padding this value to add bayes factor where there is no reorder (bf0)

In [17]:
temp = temp.fillna(method='pad', axis=1).fillna(1)
temp.head(10)

Unnamed: 0_level_0,order_number,1,2,3,4,5,6,7,8,10
user_id,product_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3,248.0,1.0,5.3e-05,5.3e-05,5.3e-05,5.3e-05,5.3e-05,5.3e-05,5.3e-05,5.3e-05
3,1005.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,6.6e-05
3,1819.0,1.0,1.0,1.0,0.000195,0.000195,0.000195,0.000195,0.000195,0.000195
3,7503.0,1.0,1.0,0.000166,0.000166,0.000166,0.000166,0.000166,0.000166,0.000166
3,8021.0,1.0,0.000186,0.000186,0.000186,0.000186,0.000186,0.000186,0.000186,0.000186
3,9387.0,0.003135,0.003135,0.003135,0.003135,0.003135,0.003135,0.003135,0.003135,0.003135
3,12845.0,1.0,1.0,1.0,0.000139,0.000139,0.000139,0.000139,0.000139,0.000139
3,14992.0,1.0,1.0,1.0,1.0,1.0,0.000719,0.000719,0.000719,0.000719
3,15143.0,0.000984,0.000984,0.000984,0.000984,0.000984,0.000984,0.000984,0.000984,0.000984
3,16797.0,0.0022,0.0022,0.0022,0.0022,0.0022,0.0022,0.0022,0.0022,0.0022


We now get the bayes factors for when reorders are 1(bf1), so we can update the earlier dataframe with.

In [18]:
pd.pivot_table(comb_last[comb_last.first_o <= comb_last.order_number],
                              values = 'bf', index = ['user_id', 'product_id'],
                              columns = 'order_number').head(10)

Unnamed: 0_level_0,order_number,1,2,3,4,5,6,7,8,9,10,...,90,91,92,93,94,95,96,97,98,99
user_id,product_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
3,248.0,,5.3e-05,,,,,,,,,...,,,,,,,,,,
3,1005.0,,,,,,,,,,6.6e-05,...,,,,,,,,,,
3,1819.0,,,,0.000195,,5170.703379,6002.580877,,,,...,,,,,,,,,,
3,7503.0,,,0.000166,,,,,,,,...,,,,,,,,,,
3,8021.0,,0.000186,,,,,,,,,...,,,,,,,,,,
3,9387.0,0.003135,,,1140.546997,207.747036,431.317144,517.759206,,,,...,,,,,,,,,,
3,12845.0,,,,0.000139,,,,,,,...,,,,,,,,,,
3,14992.0,,,,,,0.000719,325.882378,,,,...,,,,,,,,,,
3,15143.0,0.000984,,,,,,,,,,...,,,,,,,,,,
3,16797.0,0.0022,,,,,,239.540761,,330.398809,,...,,,,,,,,,,


Finally, update to the bf1 values to the earlier padded table.

In [19]:
temp.update(pd.pivot_table(comb_last[comb_last.first_o <= comb_last.order_number],
                              values = 'bf', index = ['user_id', 'product_id'],
                              columns = 'order_number'))
temp.head(10)

Unnamed: 0_level_0,order_number,1,2,3,4,5,6,7,8,10
user_id,product_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3,248.0,1.0,5.3e-05,5.3e-05,5.3e-05,5.3e-05,5.3e-05,5.3e-05,5.3e-05,5.3e-05
3,1005.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,6.6e-05
3,1819.0,1.0,1.0,1.0,0.000195,0.000195,5170.703379,6002.580877,0.000195,0.000195
3,7503.0,1.0,1.0,0.000166,0.000166,0.000166,0.000166,0.000166,0.000166,0.000166
3,8021.0,1.0,0.000186,0.000186,0.000186,0.000186,0.000186,0.000186,0.000186,0.000186
3,9387.0,0.003135,0.003135,0.003135,1140.546997,207.747036,431.317144,517.759206,0.003135,0.003135
3,12845.0,1.0,1.0,1.0,0.000139,0.000139,0.000139,0.000139,0.000139,0.000139
3,14992.0,1.0,1.0,1.0,1.0,1.0,0.000719,325.882378,0.000719,0.000719
3,15143.0,0.000984,0.000984,0.000984,0.000984,0.000984,0.000984,0.000984,0.000984,0.000984
3,16797.0,0.0022,0.0022,0.0022,0.0022,0.0022,0.0022,239.540761,0.0022,0.0022


## Interating through the dataset
We now add the above Bayes Factor procedure to calculate what the spreadsheet explains through all of the 200,000+ users in the dataset. Once that is calculated, we multiply all the factors along the products axis to calculate the posterior.

In [20]:
import logging
logging.basicConfig(filename='bayes.log',level=logging.DEBUG)
logging.debug("Started Posterior calculations")
print("Started Posterior calculations")

pred = pd.DataFrame(columns=['user_id', 'product_id'])
# comb_last_temp = pd.DataFrame()
# com = pd.DataFrame()

for uid in comb_last.user_id.unique():
    if uid % 1000 == 0:
        print("Posterior calculated until user %d" % uid)
        logging.debug("Posterior calculated until user %d" % uid)
    
#     del comb_last_temp
    comb_last_temp = pd.DataFrame()
    comb_last_temp = comb_last[comb_last['user_id'] == uid].reset_index()
    
#     del com
    com = pd.DataFrame()
    com = pd.pivot_table(comb_last_temp[comb_last_temp.first_o == comb_last_temp.order_number],
                         values = 'bf', index = ['user_id', 'product_id'],
                         columns = 'order_number', dropna=False)
    com = com.fillna(method='pad', axis=1).fillna(1)
    com.update(pd.pivot_table(comb_last_temp[comb_last_temp.first_o <= comb_last_temp.order_number],
                              values = 'bf', index = ['user_id', 'product_id'],
                              columns = 'order_number'))

    com.reset_index(inplace=True)
    com['posterior'] = com.product(axis=1)
    
    pred = pred.append(com.sort_values(by=['posterior'], ascending=False).head(10)    \
                       .groupby('user_id')['product_id'].apply(list).reset_index())    

print("Posterior calculated for all users")
logging.debug("Posterior calculated for all users")
pred = pred.rename(columns={'product_id': 'products'})
pred.head()

Started Posterior calculations
Posterior calculated until user 5000
Posterior calculated until user 7000
Posterior calculated until user 8000
Posterior calculated until user 16000
Posterior calculated until user 22000
Posterior calculated until user 24000
Posterior calculated until user 29000
Posterior calculated until user 30000
Posterior calculated until user 31000
Posterior calculated until user 34000
Posterior calculated until user 35000
Posterior calculated until user 39000
Posterior calculated until user 40000
Posterior calculated until user 42000
Posterior calculated until user 43000
Posterior calculated until user 46000
Posterior calculated until user 47000
Posterior calculated until user 56000
Posterior calculated until user 57000
Posterior calculated until user 61000
Posterior calculated until user 66000
Posterior calculated until user 69000
Posterior calculated until user 70000
Posterior calculated until user 72000
Posterior calculated until user 73000
Posterior calculated u

Unnamed: 0,user_id,products
0,3.0,"[39190.0, 18599.0, 21903.0, 47766.0, 9387.0, 2..."
0,4.0,"[26576.0, 25623.0, 21573.0, 25146.0, 37646.0, ..."
0,6.0,"[21903.0, 25659.0, 8424.0, 49401.0, 38293.0, 4..."
0,11.0,"[27959.0, 8309.0, 35948.0, 14947.0, 17794.0, 1..."
0,12.0,"[10863.0, 7076.0, 13176.0, 14992.0, 49683.0, 3..."


## Preparing for submission
We get the recent order for the user from the test dataset and query the order ID column along with the product IDs. We then format the product IDs as per the submission sample in the kaggle competition rules and export the same to a CSV.

In [21]:
pred = pred.merge(test, on='user_id', how='left')[['order_id', 'products']]
pred['products'] = pred['products'].apply(lambda x: [int(i) for i in x])    \
                    .astype(str).apply(lambda x: x.strip('[]').replace(',', ''))
pred.head()

Unnamed: 0,order_id,products
0,2774568,39190 18599 21903 47766 9387 22035 43961 1005 ...
1,329954,26576 25623 21573 25146 37646 19057 22199 1776...
2,1528013,21903 25659 8424 49401 38293 45007 11068 27521...
3,1376945,27959 8309 35948 14947 17794 13176 20383 24799...
4,1356845,10863 7076 13176 14992 49683 37687 17794 48364...


In [22]:
pred.to_csv('predictions.csv', index=False)