### Nitin Shukla
#### The competetion aims at predicting which previously purchased products will be in a user’s next order. For further reading please visit:
<https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2>

The dataset consists of information about 3.4 million grocery orders, distributed across 6 csv files. Let's play with them!

1. aisles.csv
2. departments.csv
3. order_products_prior.csv
4. order_products_train.csv
5. orders.csv
6. products.csv

In [None]:
# Common Imports
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

### Import Data

In [None]:
# op stands for order_products
orders_df = pd.read_csv('../input/orders.csv')
op_train_df = pd.read_csv('../input/order_products__train.csv')
op_prior_df = pd.read_csv('../input/order_products__prior.csv')
aisles_df = pd.read_csv('../input/aisles.csv')
products_df = pd.read_csv('../input/products.csv')
departments_df = pd.read_csv('../input/departments.csv')

### Peak or Peek into the data ? :-/
#### 1. Orders_df:
This file tells to which set (prior, train, test) an order belongs.

In [None]:
print (orders_df.shape)
orders_df.head()

We know the eval_set consists of three sets (prior, train, test). Wanna check their counts...?

In [None]:
count_eval = orders_df.eval_set.value_counts()

#Plot it
plt.figure(figsize=(8,6))
sns.barplot(count_eval.index, count_eval.values, alpha = 0.9)
plt.xlabel('Eval_Set', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.show()

There are no clear blueprints to be discovered in history that can help us shape the future as we wish. Each historical event is a **_unique_** congeries of factors, people, or chronology. — Margaret McMillan, Dangerous Games, 2008

The word **_unique_** is of concern to us in the present context. Let's see!

In [None]:
def unique_count(column):
    return len(np.unique(column))

grouped_df = orders_df.groupby('eval_set')['user_id'].aggregate(unique_count)

In [None]:
grouped_df

In [None]:
# Plot it
plt.figure(figsize=(8,6))
sns.barplot(grouped_df.index, grouped_df.values, alpha=0.9)
plt.title('Number of unique customers', fontsize = 12)
plt.xlabel('Eval_Set', fontsize=12)
plt.ylabel('Count', fontsize = 12)
plt.show()

**When do customers buy?**
- Hour of day
- Day of week

In [None]:
count_hour = orders_df.order_hour_of_day.value_counts()

# Plot it
fig = plt.figure(figsize=(12,6))
sns.barplot(count_hour.index, count_hour.values, alpha=0.9, color='b')
plt.xlabel('order_hour_of_day', fontsize=14)
plt.ylabel('count', fontsize=14)
plt.xticks(rotation='vertical')
plt.show()

The majority of orders are done by customers who have access to internet between **8 am and 6 pm**. The highest count being at 10 am.

In [None]:
count_day = orders_df.order_dow.value_counts()

# Plot it
fig = plt.figure(figsize=(8,6))
sns.barplot(count_day.index, count_day.values, alpha=0.9, color='pink')
plt.xlabel('day_of_week', fontsize=14)
plt.ylabel('count', fontsize=14)
plt.xticks(rotation='vertical')
plt.show()

We can easily notice the effect of day of the week. Most orders are on days 0 and 1.

**How often do they order again?**

In [None]:
count_day = orders_df.days_since_prior_order.value_counts()

# Plot it
fig = plt.figure(figsize=(12,6))
sns.barplot(count_day.index, count_day.values, alpha=0.9, color='r')
plt.xlabel('days_since_prior_order', fontsize=14)
plt.ylabel('count', fontsize=14)
plt.xticks(rotation='vertical')
plt.show()

#### 2. OP_train_df:
This file tells which products were purchased in each order, **'reordered'** indicates that the customer has a previous order that contains the product.

In [None]:
print (op_train_df.shape)
op_train_df.head()

Don't you think we must take a peek at how many items are in the orders?

In [None]:
grouped_df = op_train_df.groupby('order_id')['add_to_cart_order'].aggregate('max').reset_index()

# Plot it
count = grouped_df.add_to_cart_order.value_counts()
fig = plt.figure(figsize=(15,8))
sns.barplot(count.index, count.values, alpha=0.9)
plt.xlabel('Numer of items in the order_id')
plt.ylabel('Number of occurences')
plt.xticks(rotation='vertical')
plt.show()

It's clear that customers most often order around 5 items.

**Man's Search for Meaning** is a 1946 book by Viktor Frankl chronicling his experiences as an Auschwitz concentration camp inmate during World War II. He intensively describes his psychotherapeutic method, which involved identifying a purpose in life to feel positively about, and then immersively imagining that outcome. According to Frankl, the way a prisoner imagined the future affected his longevity. 

At the time of the his death in 1997, the book had sold over 10 million copies and had been translated into 24 languages thus was one of the **_bestsellers_** of that time.

The word **_bestsellers_** is of concern to us in the present context. Let's see!

In [None]:
# Merge the products_df to op_train_df in order to get product names
grouped_df = pd.merge(op_train_df, products_df, on='product_id', how='left')

In [None]:
grouped_df.head()

In [None]:
# Count the number of times each product was bought
new_df = grouped_df.groupby('product_name')['product_id'].aggregate('count')

In [None]:
# Plotting the top 10 bestsellers
plt.figure(figsize=(8,6))
new_df.nlargest(10).plot(kind='bar', color='r')
plt.xlabel('product_name', fontsize=14)
plt.ylabel('count', fontsize=14)
plt.title('Bestsellers', fontsize=14)
plt.xticks(fontsize=12)
plt.show()

**Reordered or Not? Let's check!**

In [None]:
grouped_df.reordered.value_counts()

In [None]:
op_train_df.reordered.astype(float).sum()/op_train_df.shape[0]

**59.85%** of the items are reordered in the train set.

In [None]:
# Plot it
plt.figure(figsize=(8,6))
sns.countplot(grouped_df.reordered)
plt.xlabel('reordered', fontsize=14)
plt.ylabel('count', fontsize=14)
plt.show()

**Proportion Reordered**

In [None]:
new_df = grouped_df.copy()

In [None]:
# Count the number of times each product was reordered
product_reordered_df = new_df.groupby('product_name')['reordered'].aggregate('sum').reset_index()

In [None]:
# Count the number of times each product was bought
product_bought_df = grouped_df.groupby('product_name')['product_id'].aggregate('count').reset_index()

In [None]:
proportion_reordered_df = pd.merge(product_reordered_df, product_bought_df, on='product_name', how='left')

In [None]:
proportion_reordered_df['proportion_reordered'] = (proportion_reordered_df['reordered']/
                                                  proportion_reordered_df['product_id'])

In [None]:
proportion_reordered_df = proportion_reordered_df[proportion_reordered_df['product_id']>40]

In [None]:
proportion_reordered_df.drop(['reordered', 'product_id'], axis=1, inplace=True)
pr_df = proportion_reordered_df.nlargest(10, 'proportion_reordered')
pr_df

These 10 products have the highest probability of being reordered.

In [None]:
# Plot it
plt.figure(figsize=(8,6))
sns.barplot(x=pr_df.product_name, y=pr_df.proportion_reordered, color='r', alpha=0.9)
plt.xlabel('product_name', fontsize=10)
plt.xticks(rotation='vertical', fontsize=12)
plt.ylabel('proportion_reordered', fontsize=12)
plt.title('Probability of being reordered', fontsize=14)
plt.show()

**STAY TUNED!**