> This is a copy of my blog post [Recreating the Instacart
> Chart](https://onefortheroad.github.io/tutorial/pandas/2017/05/24/recreating-instacart-chart/)
> where I recreated the Instacart Products by Hour chart.  I'm reposting
> it here as a Kaggle kernel as a nod to Nigel Carpenter's original
> [post](https://www.kaggle.com/nigelcarpenter/recreating-the-products-by-hour-chart/)
> recreating the same chart in R.  Thanks Nigel!

# Introduction
On May 3, 2017, Instacart released its first public dataset, **"[The Instacart Online Grocery Shopping Dataset 2017](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)"**.  Amazing!  Over **3 million** Instacart grocery orders from more than **200,000** users!  Take a look at their blog [post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) for details on this data science bonanza.

One thing that immediately caught my eye was the really cool chart about halfway through the article.  Here's the original:

![Original Instacart Chart][1]

It shows popular products purchased earliest in the day (green) and latest in the day (red).  Funny to see that 24 of the 25 latest ordered products are all ice cream.

Let's see if we can recreate this chart!


  [1]: https://cdn-images-1.medium.com/max/1000/1*wKfV6OV-_1Ipwrl7AjjSuw.png

# 1. Import Libraries

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

pd.set_option('display.width', 1000)

# 2. Read Data
You can download the original data files [here](https://www.instacart.com/datasets/grocery-shopping-2017).

To keep things simple (and reduce memory usage) when reading the data files, I kept only the columns required for the chart.

In [None]:
path = '../input/'

csv_orders = os.path.join(path, 'orders.csv')
csv_order_products_prior = os.path.join(path, 'order_products__prior.csv')
csv_order_products_train = os.path.join(path, 'order_products__train.csv')
csv_products = os.path.join(path, 'products.csv')

df_orders = pd.read_csv(csv_orders, usecols=['order_id', 'eval_set', 'order_hour_of_day'])
df_order_products_prior = pd.read_csv(csv_order_products_prior, usecols=['order_id', 'product_id'])
df_order_products_train = pd.read_csv(csv_order_products_train, usecols=['order_id', 'product_id'])
df_products = pd.read_csv(csv_products, usecols=['product_id', 'product_name'])

# 3. Clean & Organize
The dataset is split into multiple files (see the [data dictionary](https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b) for file and content descriptions).  So first a little clean-up:

In [None]:
# remove any rows referring to the test set
df_orders = df_orders[df_orders.eval_set != 'test']

# drop the eval_set column
df_orders = df_orders.drop(['eval_set'], axis=1)

# concatenate the _prior and _train datasets
df_order_products = pd.concat([df_order_products_prior, df_order_products_train])

# expand every order_id with the list of product_ids in that order_id
df = df_orders.merge(df_order_products, on='order_id')
print(df.head())

# 4. Reduce the Problem Size

There are almost 50,000 unique `product_id` in our dataset.  We can reduce the size of this dataset without affecting our chart by just keeping the most common products.  I originally chose to keep the top 5,000 products, but after seeing the original source code (see the [Resources](#Resources)) I changed it to the top 2,000.

In [None]:
## Keep only the top 2000 products
top_products = pd.DataFrame({'total_count': df.groupby('product_id').size()}).sort_values('total_count', ascending=False).reset_index()[:2000]
top_products = top_products.merge(df_products, on='product_id')
print(top_products.head())

# 5. Product and Hour of Day Distribution
For each of the top 2,000 products, we need to calculate the mean hour of the day that product was purchased in.  In other words, we need to figure at what time each line in the chart peaks.  From there, we can then find the products with the earliest and latest "peaks" for our chart.

In [None]:
# keep only observations that have products in top_products
df = df.loc[df['product_id'].isin(top_products.product_id)]

For each `product_id`, count how many orders were placed at each hour and what % this count represents.

In [None]:
product_orders_by_hour = pd.DataFrame({'count': df.groupby(['product_id', 'order_hour_of_day']).size()}).reset_index()
product_orders_by_hour['pct'] = product_orders_by_hour.groupby('product_id')['count'].apply(lambda x: x/x.sum()*100)
print(product_orders_by_hour.head(24))

Finally, we calculate the mean hour for each product:

In [None]:
mean_hour = pd.DataFrame({'mean_hour': product_orders_by_hour.groupby('product_id').apply(lambda x: sum(x['order_hour_of_day'] * x['count'])/sum(x['count']))}).reset_index()
print(mean_hour.head())

# 6. Morning and Afternoon Products

With our calculations, we are ready to find out which products belong in our chart.

In [None]:
morning = mean_hour.sort_values('mean_hour')[:25]
morning = morning.merge(df_products, on='product_id')
print(morning.head())

In [None]:
afternoon = mean_hour.sort_values('mean_hour', ascending=False)[:25]
afternoon = afternoon.merge(df_products, on='product_id')
print(afternoon.head())

Fantastic!  Our morning and afternoon product list matches up with the Instacart chart.

# 7. Plotting
Let's break our `product_orders_by_hour` into morning and afternoon groups for ease of plotting.

In [None]:
morning_pct = product_orders_by_hour.merge(morning, on='product_id').sort_values(['mean_hour', 'order_hour_of_day'])
afternoon_pct = product_orders_by_hour.merge(afternoon, on='product_id').sort_values(['mean_hour', 'order_hour_of_day'], ascending=False)

Next we'll get the names of each product in the morning and afternoon groups in order to recreate the product list in the original chart.  For some reason I had to change one of the product names to match the name in the Instacart chart.

In [None]:
# get list of morning and afteroon product names
morning_product_names = list(morning_pct['product_name'].unique())
morning_product_names = '\n'.join(morning_product_names)
afternoon_product_names = list(afternoon_pct['product_name'].unique())
afternoon_product_names = '\n'.join(afternoon_product_names)

# hack to remove 'Variety Pack' from Orange & Lemon Flavor Variety Pack Sparkling Fruit Beverage
morning_product_names = morning_product_names.replace('Variety Pack ', '')

# 8. Final Result

In [None]:
# Figure Size
fig, ax = plt.subplots(figsize=(12, 8))

# Plot
morning_pct.groupby('product_id').plot(x='order_hour_of_day', 
                                       y='pct', 
                                       ax=ax, 
                                       legend=False,
                                       alpha=0.2,
                                       aa=True,
                                       color='darkgreen',
                                       linewidth=1.5,)
afternoon_pct.groupby('product_id').plot(x='order_hour_of_day', 
                                         y='pct', 
                                         ax=ax, 
                                         legend=False,
                                         alpha=0.2,
                                         aa=True,
                                         color='red',
                                         linewidth=1.5,)

# Aesthetics
# Margins
plt.margins(x=0.5, y=0.05)

# Hide spines
for spine in ax.spines.values():
    spine.set_visible(False)

# Labels
label_font_size = 14
plt.xlabel('Hour of Day Ordered', fontsize=label_font_size)
plt.ylabel('Percent of Orders by Product', fontsize=label_font_size)

# Tick Range
tick_font_size = 10
ax.tick_params(labelsize=tick_font_size)
plt.xticks(range(0, 25, 2))
plt.yticks(range(0, 16, 5))
plt.xlim([-2, 28])

# Vertical line at noon
plt.vlines(x=12, ymin=0, ymax=15, alpha=0.5, color='gray', linestyle='dashed', linewidth=1.0)

# Text
text_font_size = 8
ax.text(0.01, 0.95, morning_product_names,
        verticalalignment='top', horizontalalignment='left',
        transform=ax.transAxes,
        color='darkgreen', fontsize=text_font_size)
ax.text(0.99, 0.95, afternoon_product_names,
        verticalalignment='top', horizontalalignment='right',
        transform=ax.transAxes,
        color='darkred', fontsize=text_font_size);

# Conclusion
In recreating the Instacart chart, I had a lot of fun and learned a bunch about `matplotlib`, and I hope you did too.  Many thanks to **Instacart** for releasing this dataset and to **Nigel Carpenter** for sharing his recreation of this chart.  It's nice to see other's workflows, and I learned a bit of **R** too.  I find I quite like the `magrittr %>%` style of writing.

Until next time, keep exploring!

### Resources
- [The Instacart Online Grocery Shopping Dataset 2017](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) blog post
- Instacart's Jeremy Stanley's [source code](https://gist.github.com/jeremystan/b3be353189dd0a8053e4a4b36991694a)
- Nigel Carpenter's Kaggle [kernel](https://www.kaggle.com/nigelcarpenter/recreating-the-products-by-hour-chart) recreating the chart in R

### Citation
> "The Instacart Online Grocery Shopping Dataset 2017", Accessed from [https://www.instacart.com/datasets/grocery-shopping-2017](https://www.instacart.com/datasets/grocery-shopping-2017) on May 17, 2017
{:.blockquote}