## Project Phase 2 - Data Preprocessing

From the previous phase, we realized that the probability of reorder for Produce, Dairy Eggs, Beverages and Snacks is the highest. We'll see if our models can prove this. Moreover, since we used Google Colab, the work had to split across multiple notebooks for the different tasks. However, since we have not attached a seperate document, we are explaining everything that we are doing alongside the code in the notebook

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split;
%matplotlib inline

data_dir = 'data/'

### Data Preprocessing for Model Training

In [2]:
aisles = pd.read_csv(data_dir+'aisles.csv')
departments = pd.read_csv(data_dir+'departments.csv')

products = pd.read_csv(data_dir+'products.csv')
orders = pd.read_csv(data_dir+'orders.csv')

order_products_prior = pd.read_csv(data_dir+'order_products__prior.csv')
print(order_products_prior.shape)
print(orders.head())
# We have already seen what they look like

(32434489, 4)
   order_id  user_id eval_set  order_number  order_dow  order_hour_of_day  \
0   2539329        1    prior             1          2                  8   
1   2398795        1    prior             2          3                  7   
2    473747        1    prior             3          3                 12   
3   2254736        1    prior             4          4                  7   
4    431534        1    prior             5          4                 15   

   days_since_prior_order  
0                     NaN  
1                    15.0  
2                    21.0  
3                    29.0  
4                    28.0  


In [3]:
orders_prior = orders[orders['eval_set'] == 'prior']
# Reducing size of dataset so it can run on our laptops/Colab
orders_prior, _ = train_test_split(orders_prior, test_size=0.7)
print(orders_prior.shape)
orders_train = orders[orders['eval_set'] == 'train']
orders_train, _ = train_test_split(orders_train, test_size=0.7)
print(orders_train.shape)
orders_test = orders[orders['eval_set'] == 'test']
orders_test, _ = train_test_split(orders_test, test_size=0.7)
print(orders_test.shape)

(964462, 7)
(39362, 7)
(22500, 7)


In [4]:
orders, _ = train_test_split(orders, test_size=0.7)
print(orders.shape)

(1026324, 7)


Going to merge everything into one big dataframe

In [5]:
all_data = pd.merge(products, aisles, on='aisle_id')
all_data = pd.merge(all_data, departments, on='department_id')

all_orders = pd.merge(order_products_prior, orders, on='order_id')
all_data = pd.merge(all_data, all_orders, on='product_id')
print(all_data.shape)

(9737270, 15)


In [6]:
print(all_data.head())
all_data.to_pickle('all_data.pkl')

   product_id                product_name  aisle_id  department_id  \
0           1  Chocolate Sandwich Cookies        61             19   
1           1  Chocolate Sandwich Cookies        61             19   
2           1  Chocolate Sandwich Cookies        61             19   
3           1  Chocolate Sandwich Cookies        61             19   
4           1  Chocolate Sandwich Cookies        61             19   

           aisle department  order_id  add_to_cart_order  reordered  user_id  \
0  cookies cakes     snacks      9273                 30          0    50005   
1  cookies cakes     snacks     11140                  1          1    63782   
2  cookies cakes     snacks     14668                 13          1   106519   
3  cookies cakes     snacks     19479                  5          0   110984   
4  cookies cakes     snacks     19879                 12          0   142388   

  eval_set  order_number  order_dow  order_hour_of_day  days_since_prior_order  
0    prior       

### Feature Selection

In [7]:
all_columns = list(all_data.columns)
features = [column for column in all_columns 
            if column != 'reordered' if column != 'product_name']
print(features)

X = all_data[features]
Y = all_data['reordered']

print(X.head())
print()
print(Y.head())

['product_id', 'aisle_id', 'department_id', 'aisle', 'department', 'order_id', 'add_to_cart_order', 'user_id', 'eval_set', 'order_number', 'order_dow', 'order_hour_of_day', 'days_since_prior_order']
   product_id  aisle_id  department_id          aisle department  order_id  \
0           1        61             19  cookies cakes     snacks      9273   
1           1        61             19  cookies cakes     snacks     11140   
2           1        61             19  cookies cakes     snacks     14668   
3           1        61             19  cookies cakes     snacks     19479   
4           1        61             19  cookies cakes     snacks     19879   

   add_to_cart_order  user_id eval_set  order_number  order_dow  \
0                 30    50005    prior             1          1   
1                  1    63782    prior             4          1   
2                 13   106519    prior            18          0   
3                  5   110984    prior             2          2 

In [8]:
# I will be getting rid of string variables since they are not important and each
# already has an ID associated with it
relevant_features = [feature for feature in features 
                        if feature != 'eval_set'
                        if feature != 'aisle'
                        if feature != 'department']

# This is done for the days_since_prior_order column
X = X.fillna(0)
X = X[relevant_features]
print(relevant_features)
print(X.shape)
print(X.dtypes)

['product_id', 'aisle_id', 'department_id', 'order_id', 'add_to_cart_order', 'user_id', 'order_number', 'order_dow', 'order_hour_of_day', 'days_since_prior_order']
(9737270, 10)
product_id                  int64
aisle_id                    int64
department_id               int64
order_id                    int64
add_to_cart_order           int64
user_id                     int64
order_number                int64
order_dow                   int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object


Getting rid of all string variables is important over here since they do not provide us with any extra information regarding the data. We could have easily one-hot-encoded them to use them as categorical variables, however, product_name is useless when product_id is available. Thus we chose to simply remove these variables

In [9]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

skb = SelectKBest(f_classif,k='all').fit(X,Y)
scores = skb.scores_
all_features = X.columns.values
sort_index = np.argsort(scores)[::-1]
ranked_features = []
print ("Features in order of importance: ")
for x in sort_index:
    print ("Score of ",all_features[x]," is ",scores[x])
    ranked_features.append(all_features[x])
print (all_features)

Features in order of importance: 
Score of  order_number  is  1010198.3543936191
Score of  add_to_cart_order  is  176875.69895921985
Score of  department_id  is  15341.135016103908
Score of  days_since_prior_order  is  5973.363510578231
Score of  order_hour_of_day  is  4789.690540638999
Score of  order_dow  is  486.3513015426374
Score of  aisle_id  is  144.44622792698704
Score of  product_id  is  131.87103173926747
Score of  user_id  is  16.03905835234687
Score of  order_id  is  3.8480886129524037
['product_id' 'aisle_id' 'department_id' 'order_id' 'add_to_cart_order'
 'user_id' 'order_number' 'order_dow' 'order_hour_of_day'
 'days_since_prior_order']


As we can say here, the score for order_id and user_id is extremely less, especially as compared to the other variables, so we will be removing those from our features

In [12]:
best_features = ranked_features[0:8]
print(best_features)
X_best = X[best_features]
print(X_best.shape)

X_best.to_pickle('X.pkl')

['order_number', 'add_to_cart_order', 'department_id', 'days_since_prior_order', 'order_hour_of_day', 'order_dow', 'aisle_id', 'product_id']
(9737270, 8)


In [13]:
Y.to_pickle('Y.pkl')

print(Y.shape)

(9737270,)


We are converting the final dataframes into a pickle file as we go along. This is so that when we run our models on Google Colab we can upload just the pickle file to our Google Drive in order to test our models since our laptops do not have enough computing power to run the models

Another thing to note is that we have not used PCA in this case. This is because while PCA might make our model a little more accurate, it will also turn it into a "black-box", in the sense that if we wanted to gauge what variables were driving the result - i.e. what things are important if we want a certain product to be reordered, we could not have done that since PCA gives us completely new dimensions for the dataset. Moreover, PCA is also not very useful here since the number of features is already very less.