# Python version of [Fabienvs](https://www.kaggle.com/fabienvs/instacart-xgboost-starter-lb-0-3791).
To see his R code click [here.](https://www.kaggle.com/fabienvs/instacart-xgboost-starter-lb-0-3791)

# Instacart

Lets get an understanding of what our problem is:

## What is our target variable

We have to predict the list of products in users next orders.  
it is mentioned that users next orders will only contain products in prior. 
So, For each product we need to check weather the user will reorder that product or not.
Our target variable is **reordered** which is either 0 or 1.  
So, This is a **classification problem**.

## Building Train and Test data

Train and Test data is not properly divided.  
So, we have to create Train and Test data using **eval_set** vatiable in **orders** dataset.

## 1. Lets begin with loading the required libraries and loading our data

In [1]:
import os
import glob
from os import listdir
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
os.chdir('../input')
os.getcwd()

In [3]:
for i in glob.glob('*.csv'):
    print(i)

In [4]:
aisles = pd.read_csv('aisles.csv')
departments = pd.read_csv('departments.csv')
order_products__prior = pd.read_csv('order_products__prior.csv')
order_products__train = pd.read_csv('order_products__train.csv')
orders = pd.read_csv('orders.csv')
products = pd.read_csv('products.csv')


## 2. Getting to know our Data Sets

In [5]:
print('__Dimensions of our Data Sets__')
print('aisles'.ljust(30),aisles.shape)
print('departments'.ljust(30),departments.shape)
print('order_products__prior'.ljust(30),order_products__prior.shape)
print('order_products__train'.ljust(30),order_products__train.shape)
print('orders'.ljust(30),orders.shape)
print('products'.ljust(30),products.shape)

In [6]:
print('aisle Overview:')
aisles.head()

### creating a function to get aisle when aisle_id is passed

In [7]:
def get_aisle(id):
    return aisles[aisles['aisle_id'] == id].iloc[0,1]

In [8]:
print('departments Overview:')
departments.head()

## creating a function to get aisle when aisle_id is passed

In [9]:
def get_department(id):
    return departments[departments['department_id'] == id].iloc[0,1]

## Now let's take a look at Products data set

In [10]:
print('products overview')
products.head()

Here, we can add **aisles** and **departmrnts** data sets to **products** data set using **aisle_id** and **department_id**

In [11]:
products['aisle'] = np.vectorize(get_aisle)(products['aisle_id'])
products['department'] = np.vectorize(get_department)(products['department_id'])

In [12]:
print('Products overview after adding aisle and department columns')
products.head()

In [13]:
del (aisles, departments)

### orders data set

In [14]:
print('orders Overview')
orders.head()

**Orders** data set will show us the information of **orders given by users and their sequence**.

## Let's Know which Hour of the Day People place most of the orders

In [15]:
which_hour = orders.groupby(['order_hour_of_day'])['order_id'].count().reset_index()
plt.figure(figsize = (12,8))
sns.barplot(which_hour.order_hour_of_day, which_hour.order_id)
plt.xlabel('Hour of the Day', fontsize = 12)
plt.ylabel('Orders Count', fontsize = 12)
plt.title('Which hour more orders are placed', fontsize = 12)
plt.show()
del which_hour

This shows that most of the orders will be placed during **8 AM to 6 PM**.

In [16]:
which_day = orders.groupby(['order_dow'])['order_id'].count().reset_index()
plt.figure(figsize = (12,8))
sns.barplot(which_day.order_dow, which_day.order_id)
plt.xlabel('Day of the Week', fontsize = 12)
plt.ylabel('Orders Count', fontsize = 12)
plt.title('Which Day more orders are placed', fontsize = 12)
plt.show()
del which_day

This shows that there will be **less orders in mid week days** and **more orders in weekends**.

**eval_set** variable in **Orders Data Frame** is our key to separate **Train**, **Test** an **Prior** data.

In [17]:
orders['eval_set'].value_counts()

## Plotting Number of orders present in Prior, Train and Test

In [18]:
counts = pd.DataFrame(orders.eval_set.value_counts())

plt.figure(figsize = (12,8))
sns.barplot(counts.index, counts.eval_set)
plt.xlabel("type", fontsize = 12)
plt.ylabel("Number of Orders", fontsize = 12)
plt.title("counts of prior,train,test", fontsize = 12)
plt.show()
del counts

This Bar plot shows **Number of orders** present in each of **prior**, **train** and **test**.

In [19]:
orders.groupby(['eval_set'])['user_id'].nunique()

### From this we can say:  
* Total users     = 206209  
* Users in Train  = 131209  
* Users in Test   = 75000

## Plotting Number of Users present in Prior, Train and Test

In [20]:
user_counts = pd.DataFrame(orders.groupby(['eval_set'])['user_id'].nunique())

plt.figure(figsize = (12,8))
sns.barplot(user_counts.index,user_counts.user_id)
plt.xlabel("type", fontsize = 12)
plt.ylabel("Number of Users", fontsize = 12)
plt.title("counts of prior,train,test", fontsize = 12)
plt.show()
del user_counts

In [21]:
print('order_products__prior Overview:')
order_products__prior.head()

In [22]:
print('order_products__prior Overview:')
order_products__prior.head()


## 3. Missing Value analysis

In [23]:
orders.isnull().sum()

**days_since_prior_order** variable in **orders** data set has **206209 null values** because those are users **initial orders**

In [24]:
order_products__prior.isnull().sum()

In [25]:
order_products__train.isnull().sum()

In [26]:
products.isnull().sum()

So, there are no missing values except in orders['days_since_prior_order'].  
But, we should not remove these missing values because they represent **that particualr order is the users first order** and we can use this information later. 

## 4. Feature Engineering  

### Adding User ID to order_products_train

In [27]:
order_products_train = order_products__train.merge(orders[['order_id','user_id']], on = 'order_id', how = 'left')

In [28]:
order_products_train.head()

## Getting all the information of prior data we have together.

In [29]:
orders_products = orders.merge(order_products__prior, on = 'order_id', how = 'inner')

In [None]:
del order_products__prior

In [None]:
prd = orders_products.sort_values(['user_id', 'order_number', 'product_id'])
prd['product_time'] = prd.groupby(['user_id','product_id'])['order_number'].cumcount(ascending=True)

In [1]:
prd.head()

## Creating new and useful features from prior data

In [None]:
def uni(a):
    return a.nunique()
def zer(a):
    return sum(a==0)
def one(a):
    return sum(a==1)
prd = prd.groupby('product_id').agg({'order_id':uni, 'reordered':sum, 'product_time':[zer,one]}).reset_index()
prd.columns = [' '.join(col).strip() for col in prd.columns.values]

In [None]:
prd.rename(columns = {'order_id uni':'prod_orders',
                      'product_time zer':'prod_first_orders',
                      'product_time one':'prod_second_orders',
                      'reordered sum':'prod_reorders'},inplace = True)

In [48]:
prd['prod_reorder_probability'] = prd.prod_second_orders / prd.prod_first_orders
prd['prod_reorder_times'] = 1 + prd.prod_reorders / prd.prod_first_orders
prd['prod_reorder_ratio'] = prd.prod_reorders / prd.prod_orders

In [49]:
prd.drop(prd[['prod_reorders','prod_first_orders','prod_second_orders']], axis=1, inplace=True)

In [51]:
prd.head()

# Creating Users Data

In [52]:
users = orders[orders['eval_set'] == 'prior']

In [53]:
def mean(a):
    return a.mean()

users = users.groupby(['user_id']).agg({'order_number':max,'days_since_prior_order':[sum,mean]}).reset_index()

users.columns = [' '.join(col).strip() for col in users.columns.values]
users.rename(columns = {'order_number max':'user_orders',
                        'days_since_prior_order sum':'user_period',
                        'days_since_prior_order mean':'user_mean_days_since_prior'},inplace = True)

In [54]:
users.head()

In [55]:
def count(a):
    return a.count()
def unique(a):
    return a.nunique()
def equal(a):
    return sum(a == 1)
def grater(a):
    return sum(a > 1)

us = orders_products.groupby(['user_id']).agg({'order_id':count,'reordered':equal,'order_number':grater,'product_id':unique}).reset_index()

us['user_reorder_ratio'] = us.reordered/us.order_number
us.drop(us[['reordered','order_number']], axis = 1, inplace = True)
us.rename(columns = {'order_id':'user_total_products','product_id':'user_distinct_products'}, inplace = True)

In [56]:
us.head()

In [57]:
users = users.merge(us, on = 'user_id', how = 'inner')
users['user_average_basket'] = users.user_total_products / users.user_orders

In [58]:
users.head()

In [59]:
us = orders[orders['eval_set'] != 'prior']
us = us[['user_id', 'order_id', 'eval_set', 'days_since_prior_order']]
us.rename({'days_since_prior_order':'time_since_last_order'}, inplace = True)

In [60]:
users = users.merge(us, on = 'user_id', how = 'inner')

In [61]:
del us

# Creating Database with all information we have

In [62]:
def count(a):
    return a.count()
def first(a):
    return a.min()
def last(a):
    return a.max()
data = orders_products.groupby(['user_id','product_id']).agg({'order_id':count,
                                                              'order_number':[first,last],
                                                              'add_to_cart_order':mean}).reset_index()
data.columns = [' '.join(col).strip() for col in data.columns.values]

In [63]:
data.head()

In [64]:
data.rename(columns = {'order_id count':'up_orders',
                       'add_to_cart_order mean':'up_average_cart_position',
                       'order_number first':'up_first_order',
                       'order_number last':'up_last_order'}, inplace = True)

In [65]:
data.head()

In [66]:
data = data.merge(prd, on = 'product_id', how = 'inner')
data = data.merge(users, on = 'user_id', how = 'inner')

In [67]:
data.head()

In [68]:
data['up_order_rate'] = data.up_orders / data.user_orders
data['up_orders_since_last_order'] = data.user_orders - data.up_last_order

In [69]:
data = data.merge(order_products_train[['user_id','product_id','reordered']], on = ['user_id','product_id'], how = 'left')

In [70]:
data.head()

In [71]:
del (order_products__train, prd, users)

we have created the data set with all the feature information we have. Now, let's create Test and Train Data sets from it.

## Creating Train and Test data sets

In [72]:
train = data[data['eval_set'] == 'train']

In [73]:
train.drop(train[['eval_set','user_id','product_id','order_id']], axis = 1, inplace = True)

In [74]:
train = train.fillna({'reordered':0})

In [75]:
train.head()

In [76]:
test = data[data['eval_set'] == 'test']

In [77]:
test = test.drop(test[['eval_set','user_id','reordered']], axis = 1)

In [78]:
test.head()

In [79]:
del data

In [80]:
train.columns

In [81]:
test.columns

## Preparing Model
### Converting Train and Test into arrays

In [82]:
ind_columns = train.drop(train[['reordered']], axis = 1).columns

In [83]:
train_ind = train.as_matrix(ind_columns)

In [84]:
train_dep = train.as_matrix(['reordered'])

In [85]:
test_columns = test.drop(test[['order_id','product_id']], axis = 1).columns

In [86]:
test_ind = test.as_matrix(test_columns)

## Decision Tree

In [87]:
from sklearn import tree

In [88]:
clf = tree.DecisionTreeClassifier()

In [89]:
clf = clf.fit(train_ind,train_dep)

In [90]:
prediction = clf.predict(test_ind)

In [91]:
prediction

In [92]:
test['reordered'] = prediction

In [93]:
test.head()

In [94]:
result_file = test[test['reordered'] == 1]
result_file = result_file.groupby(['order_id'])['product_id'].unique().reset_index()
result_file = result_file.rename(columns = {'product_id':'products'})

In [95]:
result_file.head()

In [96]:
final_result = pd.DataFrame({'order_id':orders[orders['eval_set'] == 'test'].order_id})
final_result = final_result.merge(result_file, on = 'order_id', how = 'left')
final_result = final_result.sort_values(['order_id'])
final_result = final_result.fillna({'products':'None'})

### Saving the final result

In [97]:
final_result.to_csv('insta_result.csv', index = False)