# Market Basket EDA (Exploratory Data Analysis)
-----
### Goal :
1. Given all these dataset, what informations can we get to understand our business better ? in 'Explore' step. [Reference](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart)
2. Given past orders of each customers, which product will be their next order ? [Reference](https://www.kaggle.com/uomislab/instacart-xgboost-gridsearch-notebook/)

### Why ?
1. To understand our customer better
2. Given user_id, we can show these products in their 'recommendation page'
    - Again, we already have this 'Recommender in our company', hence ignored

In [None]:
# basic packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
pd.options.mode.chained_assignment = None  # default='warn'

# garbage collector to free up memory
import gc
gc.enable()

# remove warnings
import warnings
warnings.filterwarnings('ignore')


# ori, as in this original variable is to be preserved 
ori_aisles = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/aisles.csv')
ori_dept = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/departments.csv')
ori_order_prod_prior = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/order_products__prior.csv')
ori_order_prod_train = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/order_products__train.csv')
ori_orders = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/orders.csv')
ori_products = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/products.csv')

# Obtain
* Data given from kaggle

In [None]:
# load data from original
aisles = ori_aisles
depts = ori_dept
order_prod_prior = ori_order_prod_prior
order_prod_train = ori_order_prod_train
orders = ori_orders
products = ori_products

# get shape of each df
print(f" aisles : {aisles.shape} \n depts : {depts.shape} \n order_prod_prior : {order_prod_prior.shape} \n order_prod_train : {order_prod_train.shape} \n orders : {orders.shape} \n products : {products.shape}")

# Scrub
* assumed clean

# Explore
1. Which day & at which hour does.....
     * customer purchase our products the most?
     * customer AVG REORDERED our products the most?

2. How many days does the customers usually come back and buy again from us? (and out of all these orders, how many of them are reorders (returning customers) ?)

3. How many products are there in a single order ?
4. Which products that....
    * customer purchased the most ? 
    * customer AVG REORDERED the most ?
5. Which aisles that....
    * customer purchased the most ? 
    * customer AVG REORDERED the most ?
6. Which depts that....
    * customer purchased the most ? 
    * customer AVG REORDERED the most ?

## #1

In [None]:
# Which day & at which hour does customer purchase our products the most?


# group data by order_dow, order_hour_of_day, and count order_number
grouped_df = orders.groupby(['order_dow', 'order_hour_of_day'])['order_number'].aggregate('count').reset_index()
grouped_df.head(3)

# turn it to heatmap-suited format, by pivot it
grouped_df = grouped_df.pivot('order_dow', 'order_hour_of_day', 'order_number')
grouped_df.head(3)

# display result in heatmap 
plt.figure(figsize=(12,6))
sns.heatmap(grouped_df)
plt.title("Frequency COUNT ORDERS of Day of week vs Hour of day")
plt.show()


# Saturday 12.00-16.00 & Sun 9.00-12.00 has most orders 

In [None]:
# same procedure as count orders above, except we change 'count order_number' to 'mean reordered'



# merge customer past orders (order_prod_prior) & order in-depth details (orders)
orderprodprior_orders = pd.merge(order_prod_prior, orders, on='order_id', how='left')


# group data by order_dow, order_hour_of_day, and get average of reordered
grouped_df = orderprodprior_orders.groupby(['order_dow', 'order_hour_of_day'])['reordered'].aggregate('mean').reset_index()
grouped_df.head(3)

# turn it to heatmap-suited format, by pivot it
grouped_df = grouped_df.pivot('order_dow', 'order_hour_of_day', 'reordered')
grouped_df.head(3)

# display result in heatmap 
plt.figure(figsize=(12,6))
sns.heatmap(grouped_df)
plt.title("Frequency AVG REORDERED of Day of week vs Hour of day")
plt.show()


# highest on Sunday between 6am to 9am (nice)
# in general, for any days, highest between 5am to 9am
# interpretation : 0.66 means, 66% of all orders are actually reorders (returning customers)

## #2

In [None]:
# How many days does the customers usually come back and buy again from us? (and out of all these orders, how many of them are reorders (returning customers) ?)

# dats_since_prior_order column given, hence just plot
plt.figure(figsize=(12,8))
sns.countplot(x='days_since_prior_order', data=orders, color=color[9])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Days since prior order', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency by days since last order")
plt.show()

# Customers usually come back to us once every 7 days or 30 days



print(order_prod_prior.reordered.sum() / order_prod_prior.shape[0])
# And out of all those orders, 58.97% are reorders (returning customers)

## #3

In [None]:
# How many products are there in a single order ?


# group data by each orders (order_id), and aggregate 'Max' (add_to_cart_order is cumulative, hence Max)
grouped_df = order_prod_prior.groupby("order_id")["add_to_cart_order"].aggregate("max").reset_index()

# count how many times does a certain amount of order size occur
cnt_srs = grouped_df.add_to_cart_order.value_counts()

# display result
plt.figure(figsize=(20,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8)
plt.ylabel('No of occurences', fontsize=12)
plt.xlabel('No of each order size', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Basket Size on each order")
plt.show()


# 10 products per orders. with the most is 5. Note that it dropped exponentially after 10
# same result too in training data

## #4

In [None]:
# Which product that bought the most ? 

# merge past orders, product details, aisles details, dept details to one dataframe
orderprodprior_products_aisles_depts = pd.merge(order_prod_prior, products, on='product_id', how='left')
orderprodprior_products_aisles_depts = pd.merge(orderprodprior_products_aisles_depts, aisles, on='aisle_id', how='left')
orderprodprior_products_aisles_depts = pd.merge(orderprodprior_products_aisles_depts, depts, on='department_id', how='left')
orderprodprior_products_aisles_depts.head(3)

# display table that has most orders
cnt_srs = orderprodprior_products_aisles_depts['product_name'].value_counts().head(15)
plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[9])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Product Name', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Products that customer purchased the most")
plt.show()

# fruits (bananas, strawberries) & vegetables (spinach, onions, zucchini)

In [None]:
# which products that customer AVG REORDERED our product the most ?


# get avg reordered per product_name
grouped_df = orderprodprior_products_aisles_depts.groupby(["product_name"])["reordered"].aggregate("mean").reset_index()

# sort from highest to smallest, and get top15 only
grouped_df = grouped_df.sort_values('reordered', ascending=False).head(15)


plt.figure(figsize=(12,8))
sns.barplot(grouped_df['product_name'].values, grouped_df['reordered'].values, alpha=0.8, color=color[2])
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Prod Names', fontsize=12)
plt.title("Products wise reorder ratio", fontsize=15)
plt.xticks(rotation='vertical')
plt.show()

# Completely different. Top 3 are vege wrappers, pads, energy shots, chocolate bar. No fruits & veges at all in top 15

## #5

In [None]:
# Which aisles that customer purchased the most ? 
cnt_srs = orderprodprior_products_aisles_depts['aisle'].value_counts().head(15)
plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[9])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Aisle', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Aisles that customer purchased the most")
plt.show()

# As expected from #4, fruits & vege

In [None]:
# which aisles that customer AVG REORDERED our product the most ?


# get avg reordered per aisle
grouped_df = orderprodprior_products_aisles_depts.groupby(["aisle"])["reordered"].aggregate("mean").reset_index()

# sort from highest to smallest, and get top15 only
grouped_df = grouped_df.sort_values('reordered', ascending=False).head(15)


plt.figure(figsize=(12,8))
sns.barplot(grouped_df['aisle'].values, grouped_df['reordered'].values, alpha=0.8, color=color[2])
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Aisles', fontsize=12)
plt.title("Aisles wise reorder ratio", fontsize=15)
plt.xticks(rotation='vertical')
plt.show()

# fruits & vege might be highest in qty, but reordering wise, milk & sparkling water at top, (fruits at 3rd, vegetables aren't even at top15)

## #6

In [None]:
# Which depts that customer purchased the most ? 
cnt_srs = orderprodprior_products_aisles_depts['department'].value_counts().head(15)
plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[9])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Departments', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Depts that customer purchased the most")
plt.show()

# Top3 are Produce, dairy egg, snacks

In [None]:
# Which depts that customer AVG REORDERED our product the most ?

# get avg reordered per dept
grouped_df = orderprodprior_products_aisles_depts.groupby(["department"])["reordered"].aggregate("mean").reset_index()

# sort from highest to smallest, and get top15 only
grouped_df = grouped_df.sort_values('reordered', ascending=False).head(15)


plt.figure(figsize=(12,8))
sns.barplot(grouped_df['department'].values, grouped_df['reordered'].values, alpha=0.8, color=color[2])
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Depts', fontsize=12)
plt.title("Depts wise reorder ratio", fontsize=15)
plt.xticks(rotation='vertical')
plt.show()

# Top3 are Dairy Eggs, Beverages, Produce

# Model
* ignored, as we focus on data analysis only

# iNterpret

1. Which day & at which hour does.....
     * customer purchase our products the most?
         * Saturday 12.00-16.00 & Sun 9.00-12.00 has most orders 
         
     * customer AVG REORDERED our products the most?
        * highest on Sunday between 6am to 9am (nice)
        * in general, for any days, highest between 5am to 9am
        * interpretation : 0.66 means, 66% of all orders are actually reorders (returning customers)

2. How many days does the customers usually come back and buy again from us? (and out of all these orders, how many of them are reorders (returning customers) ?)
    * Customers usually come back to us once every 7 days or 30 days
    * And out of all those orders, 58.97% are reorders (returning customers)

3. How many products are there in a single order ?
    * 10 products per orders. with the most is 5. Note that it dropped exponentially after 10
    
4. Which products that....
    * customer purchased the most ? 
        * Fruits (bananas, strawberries) & vegetables (spinach, onions, zucchini)
        
    * customer AVG REORDERED the most ?
        * Completely different than in qty wise. Top 3 are vege wrappers, pads, energy shots, chocolate bar. No fruits & veges at all in top 15
        
5. Which aisles that....
    * customer purchased the most ? 
        * Fruits & Vege
        
    * customer AVG REORDERED the most ?
        * fruits & vege might be highest in qty, but reordering wise, milk & sparkling water at top, (fruits at 3rd, vegetables aren't even at top15)
        
6. Which depts that....
    * customer purchased the most ? 
        * Top3 are Produce, dairy egg, snacks
        
    * customer AVG REORDERED the most ?
        * Top3 are Dairy Eggs, Beverages, Produce, quite similar as qty

## Call-To-Action
Jump to Call-To-Action part that can be found in this [article](https://dwihdyn.github.io/journals/7-retail-eda.html)