# Instacart Exploratory Data Analysis
> To gather a data driven intuition about Instacart's customers and business. 
> Prework to building a deep learning recommendation system

* This notebook is inspired by and draws from Serigne

## Table of Contents
- [Average Order Amount](#AOA)
- [Most Popular Products](#MPP)
- [Reordered Products](#ROP)
- [Ordering Behavior](#OB)
    *  [Time of Day](#ToD)
    *  [Day of Week](#DoW)
    *  [Rate of Reorder](#RoR)
- [Products, Departments, & Aisles Oh My!](#PDA)
- [Departments with The Most Products](#DMP)
- [Most Products Per Aisle](#MPPA)
- [Aisle's with The Most Products (Top 10)](#AMP)
- [Best Selling Department](#BSD)
- [Best Selling Aisles in Best Selling Department](#BSABSD)
- [Best Selling Aisles Overall](#BSAO)
- [What's Next?](#WN)

In [None]:
import numpy as np 
import pandas as pd 
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from subprocess import check_output
print(check_output(['ls', '../input']).decode('utf8'))

In [None]:
import glob, re
dfs = {re.search('/([^/\.]*)\.csv', fn).group(1):
      pd.read_csv(fn) for fn in glob.glob('../input/*.csv')}
for k, v in dfs.items(): locals()[k] = v

<a name="AOA"></a>
## Average Order Amount 
Let's start by getting an idea of how many products customers order on average

In [None]:
opp = order_products__prior
opp.tail()

In [None]:
opt = order_products__train
opt.tail()

In [None]:
opa = pd.concat([opp, opt], axis = 0) # 'order products all'
opa.tail()

In [None]:
# sanity check
print(len(opp))
print(len(opt))
print(len(opa))
print(opa.isnull().sum())

Now we can take a look at the distribution of number of products ordered

In [None]:
order_total = opa.groupby('order_id')['add_to_cart_order'].max().reset_index()
order_total = order_total.add_to_cart_order.value_counts()[:20] #nlargest(20)

sns.set_style('whitegrid')
f, ax = plt.subplots(figsize=(15,12))
plt.xticks(rotation = 'vertical')
sns.barplot(order_total.index, order_total.values)

plt.xlabel('Number of Orders', fontsize = 14) 
plt.ylabel('Number of Products in Order', fontsize = 14)
#plt.xlabel('Number of Products in Order', fontsize=14)
plt.show()

From the bar chart we can see the average is 5 but with order amounts of 3-8 all totaling over 200k orders. From this we can deduce we'll want to recommend our customers somewhere in the range of 3-8 products

<a name='MPP'></a>
## Most Popular Products 
Now that we know how many products to recommend our customers. We need to figure out what those products will be. 

Like any recommendation system, for a new customer we won't know what they like so we can recommend them products based off of what they've bought previously. Instead we'll take a marketing approach and recommend what's popular since people tend to buy what they see other people buying. 

So what are our most popular items? 

In [None]:
pop = opa.groupby('product_id')['add_to_cart_order'].aggregate({'total_ordered': 'count'}).reset_index()
pop = pop.merge(products[['product_id', 'product_name']], how = 'left', on = ['product_id'])
pop = pop.sort_values(by = 'total_ordered', ascending = False)[:10]
pop

These top items look a lot like my grocery list. Just replace whole milk with pea milk!

Let's visualize the data to get a better idea of how our top 10 compare.

In [None]:
pop = pop.groupby(['product_name']).sum()['total_ordered'].sort_values(ascending = False)

f, ax = plt.subplots(figsize=(12,10))
sns.set_style('darkgrid')
sns.barplot(pop.index, pop.values)

plt.xticks(rotation = 'vertical')
plt.ylabel('Number of Orders', fontsize = 14)
plt.xlabel('Most Popular Products', fontsize = 14)
plt.show()

<a name='ROP'></a>
## Reordered Products 
Next we'll want to understand which products are most likely to be reordered so we can recommend products customers are likely to enjoy and buy again. Let's begin by making sure reordering is a normal consumer decision to begin with. For all we know our customers enjoy novelty and look to buy new things often. 

In [None]:
reorder_ratio = opa.groupby('reordered')['product_id'].agg({'total_products': 'count'}).reset_index()
reorder_ratio['ratio'] = reorder_ratio['total_products'].apply(lambda x: x / reorder_ratio['total_products'].sum())
reorder_ratio

~60% of our total orders are reorders. In other words, the bulk of our money is made in customers reordering products they like. Let's get a visual on that

In [None]:
reorder_ratio = reorder_ratio.groupby(['reordered']).sum()['total_products'].sort_values(ascending = False)

f, ax = plt.subplots(figsize = (5,8))
sns.set_style('whitegrid')
sns.barplot(reorder_ratio.index, reorder_ratio.values, palette = 'RdBu')

plt.xlabel('New Item/Reordered', fontsize = 14)
plt.ylabel('Total Number of Orders', fontsize = 14)
plt.ticklabel_format(style = 'plain', axis = 'y')
plt.show()


That's a solid 6 million orders difference! It's safe to say reordered products deserve our attention. Let's take a look at the products most likely to be reordered. Since we're looking for a percentage value here it can easily skew us to high reorder rate items that were only ordered a couple times so I'm going to set a minimum total order amount at 75. 
I tried setting the min to 1000 but 9/10 of the results were milk which isn't very useful information as store owners already know milk is a top repeat purchase item and milk is so commonplace the customer would likely buy milk regardless of if we recommended it to them. 
75 was the final min number because it was the point at which milk stopped showing up in the top results. 
Ultimately this number doesn't matter as we'll aim to recommend the product the customer is most likely to buy regardless of how many other people bought it. A sale's a sale!

In [None]:
product_rr = opa.groupby('product_id')['reordered'].agg({'reorder_total': sum, 'order_total': 'count'}).reset_index()
product_rr['reorder_probability'] = product_rr['reorder_total'] / product_rr['order_total']
product_rr = product_rr.merge(products[['product_name', 'product_id']], how = 'left', on = 'product_id')
product_rr = product_rr[product_rr.order_total > 75].sort_values(['reorder_probability'], ascending = False)[:10]
product_rr

In [None]:
product_rr = product_rr.sort_values('reorder_probability', ascending = False)

plt.subplots(figsize = (12, 10))
sns.set_style('darkgrid')
sns.barplot(product_rr.product_name, product_rr.reorder_probability)
plt.ylim([.85, .95])
plt.xticks(rotation = 'vertical')
plt.xlabel('Most Reordered Products')
plt.ylabel('Reorder Probability')
plt.show()

<a name='OB'></a>
## Ordering Behavior

Now that we have a good feel about our top products, let's shift focus to how our customers tend to order

### Time of Day <a name='ToD'></a>

We'll start by understanding the time of day orders occur.

In [None]:
time_of_day = orders.groupby('order_hour_of_day')['order_id'].agg('count').reset_index()

f, ax = plt.subplots(figsize = (15, 10))
sns.barplot(time_of_day['order_hour_of_day'], time_of_day['order_id'])
plt.xlabel('Orders by Hour', fontsize = 14)
plt.ylabel('Total Orders', fontsize = 14)
plt.show()

Considering the most popular times of day fall between 9am and 5pm I'm guessing our customers mostly shop at work. We'll verify if my guess is correct and mor specifically  what day of the week most shopping occurs next. 
### Day of Week <a name='DoW'></a>

In [None]:
grouped = orders.groupby('order_dow')['order_id'].agg('count').reset_index()

f, ax = plt.subplots(figsize = (15,10))
sns.set_style('whitegrid')
current_palette = sns.color_palette('colorblind')
sns.set_palette(current_palette)
sns.barplot(x = 'order_dow', y = 'order_id', data = grouped)
plt.xlabel('Day of Week')
plt.ylabel('Total Orders')
plt.show()

Unfortunately, we don't have information on how Instacart labeled their day of week data so we're not sure what days these numbers line up with. Based off the graph, two days of the week dominate though. 

From my experience running an online store, Monday and Tuesday were always the highest sales days. That inference would also line up with the hourly chart above. People want to put off their work at the beginning of the week and do their online shopping instead. 

However, those two days could also be the weekend since people behave differently on a week day versus a weekend and 9-5 being the hottest hours to buy could just be because those are the most common hours people are on their computers. 

We'll tentatively conclude the weekend is when most people buy their groceries. 
<a name='RoR'></a>
### Rate of Reorder 

Next we'll examine how often a typical customer orders from us

In [None]:
grouped = orders.groupby('days_since_prior_order')['order_id'].agg('count').reset_index()

from matplotlib.ticker import FormatStrFormatter
f, ax = plt.subplots(figsize = (15,19))
sns.barplot(x = 'days_since_prior_order', y = 'order_id', data = grouped)
ax.xaxis.set_major_formatter(FormatStrFormatter('%.0f'))
plt.xlabel('Days to Reorder')
plt.ylabel('Total Orders')
plt.show()

Customers on average order once a week. Assuming 30 is not an aggregate meaning 30+ we can also conclude if customers are not weekly shoppers with us they are monthly shoppers. 

The last insight we can mine from 'orders' is how many orders a typical customer makes. 
>Put more strategically, after how many orders can we expect a customer to churn? 

In [None]:
grouped = orders.groupby('user_id')['order_id'].agg('count')
grouped = grouped.value_counts()

f, ax = plt.subplots(figsize=(15, 12))
sns.barplot(grouped.index, grouped.values)
plt.xlabel('Number of Orders')
plt.ylabel('Number of Customers')
plt.show()

The most likely time a new customer is going to churn is after the 4th order. Perhaps we should give them something free if they make a 5th order. Maybe combine that with an online punch card that awards them again on their 10th order. 
As you can see from the 100 bar at the far right, if we can keep them around long enough, the shopping experience will become a habit and customers will continue to buy from us; on average every 7 days as our earlier chart showed.
<a name='PDA'></a>
## Products, Departments & Aisles Oh  My!
At this point we have a good idea of our customers buying behavior and favorite products. 

Now, we'll move the focus inward and explore our store inventory. In which department do we offer the most products? Which aisle generates the most sales? We'll cover these questions and more right after we merge our products, departments, and aisles tables together into one nice and clean dataframe. 

In [None]:
items = products.merge(departments, how='left', on='department_id')
items = items.merge(aisles, how='left', on='aisle_id')
items.head()

<a name='DMP'></a>
## Departments with The Most Products

In [None]:
grouped = items.groupby('department')['department_id'].agg({'total_products': 'count'}).reset_index()
grouped['percent_of_inv'] = grouped['total_products'].apply(lambda x: x / grouped['total_products'].sum())
grouped = grouped.sort_values(by='total_products', ascending=False)
grouped.head()

In [None]:
f, ax = plt.subplots(figsize=(12,10))
sns.barplot(x='department', y='total_products', data=grouped)
plt.xticks(rotation='vertical')
plt.xlabel('Department', fontsize=14)
plt.ylabel('Number of Products', fontsize=14)
plt.show

Given the variety of affordable personal care items and snack items out there. It makes sense that we store a greater variety of those type of items more than anything else.
<a name='MPPA'></a>
## Most Products Per Aisle
So we know which departments have more products than others, but what are those products? Let's generate plots of each department to see which aisles in each department we stock the most. We'll stick to our top 2 departments so this doesn't get unwieldy to look at. 

In [None]:
g1 = items.groupby(["department", "aisle"], as_index=False).size().reset_index(name="count")
g2 = g1.loc[g1['department'] == 'personal care']
g2 = g2.sort_values(by='count', ascending=False)[:5]
current_palette = sns.color_palette('colorblind')
sns.set_palette(current_palette)
sns.barplot(x='aisle', y='count', data=g2)
plt.xticks(rotation='vertical')
plt.title('Personal Care Department')
plt.show()

In [None]:
g2 = g1.loc[g1['department'] == 'snacks']
g2 = g2.sort_values(by='count', ascending=False)[:5]
current_palette = sns.color_palette('colorblind')
sns.set_palette(current_palette)
sns.barplot(x='aisle', y='count', data=g2)
plt.xticks(rotation='vertical')
plt.title('Snacks Department')
plt.show()

Vitamins and chocolate! It's true there are a lot of varieties in both. 

<a name='AMP'></a>
## Aisle's with The Most Products (Top 10)

Now let's remove the top department criteria and look at aisle product count independent of department. 

In [None]:
grouped = items.groupby('aisle')['product_id'].agg({'total_products': 'count'}).reset_index()
grouped['ratio'] = grouped['total_products'].apply(lambda x: x /grouped['total_products'].sum())
grouped = grouped.sort_values(by='total_products', ascending=False)[:10]
grouped

In [None]:
f, ax = plt.subplots(figsize=(12,10))
sns.barplot(x='aisle', y='total_products', data=grouped)
plt.xticks(rotation='vertical')
plt.show()

<a name='BSD'></a>
## Best Selling Department
We might carry a lot of personal care items but does that make it the most profitable department?

We don't have data on revenue so we can't say which department is the most profitable, but by looking at total orders made by department we can at least know which ones are generating the most sales. 

In [None]:
orders.head()

In [None]:
orders.eval_set.value_counts()

In [None]:
order_products = orders[['user_id', 'order_id']].merge(opa[['order_id', 'product_id']],
                                                 how='inner', on='order_id')
order_products = order_products.merge(items, how='inner', on='product_id')

grouped = order_products.groupby('department')['order_id'].agg({'total_orders': 'count'}).reset_index()
grouped['ratio'] = grouped['total_orders'].apply(lambda x: x / grouped['total_orders'].sum())
grouped = grouped.sort_values(by='total_orders', ascending=False)
grouped

In [None]:
f, ax = plt.subplots(figsize=(12,10))
sns.barplot(x='department', y='total_orders', data=grouped)
ax.yaxis.set_major_formatter(FormatStrFormatter('%.0f'))
plt.xticks(rotation='vertical')
plt.xlabel('Departments', fontsize=14)
plt.ylabel('Number of Orders', fontsize=14)
plt.show()

Happy to see produce as the most ordered department by a long margin. But what produce are we really talking about here? Let's break it down by aisle.
<a name='BSABSD'></a>
## Best Selling Aisles in Best Selling Department

In [None]:
g1 = order_products.groupby(['department', 'aisle'], as_index=False).size().reset_index(name="count")
g2 = g1.loc[g1['department'] == 'produce']
g2 = g2.sort_values(by='count', ascending=False)[:5]
g2

In [None]:
current_palette = sns.color_palette('colorblind')
sns.set_palette(current_palette)
sns.barplot(x='aisle', y='count', data=g2)
plt.xticks(rotation='vertical')
plt.title('Top Produce Aisles')
plt.show()

<a name='BSAO'></a>
## Best Selling Aisles Overall
Our last exploration will be into the top aisles independent of department.

In [None]:
grouped = order_products.groupby("aisle")["order_id"].aggregate({'Total_orders': 'count'}).reset_index()
grouped['Ratio'] = grouped["Total_orders"].apply(lambda x: x /grouped['Total_orders'].sum())
grouped.sort_values(by='Total_orders', ascending=False, inplace=True )
grouped.head(10)

In [None]:
grouped  = grouped.groupby(['aisle']).sum()['Total_orders'].sort_values(ascending=False)[:15]

f, ax = plt.subplots(figsize=(12, 15))
plt.xticks(rotation='vertical')
sns.barplot(grouped.index, grouped.values)
plt.ylabel('Number of Orders', fontsize=13)
plt.xlabel('Aisles', fontsize=13)
plt.show()

Wow, fresh fruit and vegetables make up 20% of all orders. Good job customers!

<a name='WN'></a>
## What's Next?
That concludes our exploratory data analysis of Instacart's customers and business. 

Our next kernel will be on applying these insights and the merged dataframes we've built to a neural network system for recommending what customers should order next. Should be fun!