

This is a simple exploratory study - Instacart Market Basket Analysis


Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.    

Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.  

An example of Association Rules  

Assume there are 100 customers  
10 of them bought milk, 8 bought butter and 6 bought both of them.  
bought milk => bought butter  
support = P(Milk & Butter) = 6/100 = 0.06  
confidence = support/P(Butter) = 0.06/0.08 = 0.75  
lift = confidence/P(Milk) = 0.75/0.10 = 7.5  
Note: this example is extremely small. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.  

for more information kindly take a look at this article explaining an UCI Machine Learning repository data [link](https://towardsdatascience.com/a-gentle-introduction-on-market-basket-analysis-association-rules-fa4b986a40ce)

For those who do not have much hands-on experience in python, can go to this link to work through an amazing tutorial to understand Association Rules and Market Basket Analysis in XLminer [link](https://www.analyticsvidhya.com/blog/2014/08/effective-cross-selling-market-basket-analysis/)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
color = sns.color_palette()
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

Typically, In most of the problems in Kaggle, we are given with 1 or 2 files to deal with. However, this problem has muliple files associated with it and took few minutes for me to understand the problem thoroughly. Nevertheless, I will break down the problem piece by piece and will try to explain the reader in the most comprehensive and easy-to-follow manner.   

Kaggle states the problem as follows:

Instacart’s data science team plays a big part in providing this delightful shopping experience. Currently they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session. Recently, Instacart open sourced this data - see their blog post on 3 Million Instacart Orders, Open Sourced.  

In this competition, Instacart is challenging the Kaggle community to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order. 

This can be simplified as, if a customer named "Adam" has bought the product "Apple" in his previous orders, what are the chances that he might buy the product "Apple" again in his next orders. Or, Predict the probability that Adam (decision : buy / dont buy) tries or buys the product "Apple" for the first time in his next order based on his previous buying behaviour / Transactions. 
 

In [None]:
#Reading the files
orders = pd.read_csv("../input/orders.csv")
departments = pd.read_csv("../input/departments.csv")
products = pd.read_csv("../input/products.csv")
order_prod_train = pd.read_csv("../input/order_products__train.csv")
order_prod_prior = pd.read_csv("../input/order_products__prior.csv")
aisles = pd.read_csv("../input/aisles.csv")


I would like to explain the relationship between the files and how different columns mean to us. Having a good knowledge in RDBMS concepts would help anyone understand this. Firstly, I would like to like to explain the files "orders.csv", "order_products__train.csv", "orders_products__prior.csv".

Basically, these transactions are structured as follows:  

*  A customer has a unique customer id and under each customer (or customer ID) , there can be multiple orders (or order ID's) with each order has its own unique id.

* An order may contain one or more products and each product has its own unique product ID. Also, A product can be bought multiple number of times by multiple number of customers.Hence, a product or product ID can occur in multiple orders under multiple customers. Hope this makes sense!!!!


In [None]:
datafiles = [orders, departments, products, order_prod_train, order_prod_prior,aisles]

for var in datafiles:
    print(var.info())
    print("-"*100)
    




Let's Take a look at the orders.csv (already loaded into dataframe "orders")

In [None]:
orders.head(10)

As you can see that the orders.csv file contains the all the data pertaining to an order such as the user who have placeed the order, the day of the week the order was placed,order_hour_of_the_day, eval set,and days_since_last_order.. By the time, you should have already noticed that order_id is not arranged in any specific sorting (ascending or descent order from the numerical perspective). However, the rows are sorted as per th user_id (customer) and again the rows are further sorted using the order_number

ORDER_NUMBER COLUMN:-
You may get confused with the order_id and order number. I will explain the difference below using the kaggle data description:-

*"The dataset for this competition is a relational set of files describing customers' orders over time. The goal of the competition is to predict which products will be in a user's next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between **4 and 100 of their orders**, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders. For more information, see the blog post accompanying its public release."*

It might be a bit confusing. what is "4 and 100 of their orders". Let's put it this way!!!

Just for the sake of better understanding and explanation,, I am assuming that Instacart only has 100 customers (**ASSUMPTION**). Out of these 100 customers, there might be *25 customers* who might have totally made** 4 orders** throughout his entire shopping history with Instacart. Another set of *25 customers* might have made **15 orders** totally, and another *25 customers* might have made **60 orders**, and the last set of* 25 customers* might have made **100 orders** totally  

Here the minimum is 4 and the maximum is 100. Now we can translate this to our real dataset. 
In our dataset, no Instacart customers have made less than 4 orders or have made more than 100 orders throughtout their shopping history with Instacart. Let's verify if that's true with our dataset

EVAL_SET COLUMN:

eval_set has 3 values (categorical values namely - prior, train, test). The latest (last) order of every customer was taken and split into train and test. See the below example for brevity

In [None]:
orders.head(40)

Kindly note the row indices 10, 25, 38. You can see those 3 different rows corresponds to 3 different user id (customer) and they are the last or the latest order. Rest other rows corresponding to any user goes into the prior dataset.

Now, you would have guessed what the "order_products__train.csv" and "order_products__prior.csv" would contain. 

In [None]:
#order_products__prior.csv contains all the prior data
order_prod_prior.head()

In [None]:
#order_products__train.csv conatins all the train data
order_prod_train.head()

**Caution: Don't get confused with the 3 previous generated tables (notice the difference between order_id and user_id).**

I am going to use one of the most powerful yet simple functions in pandas - pd.groupby(). Inorder to know more about groupby and its related methods kindly look into the documentation. [link](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) 

In [None]:
import IPython
 
# Grouping by one factor
df_id = orders.groupby('user_id')
 
# Getting all methods from the groupby object:
attr = [method for method in dir(df_id)
 if callable(getattr(df_id, method)) & ~method.startswith('_')]
 
# Printing the result
print(IPython.utils.text.columnize(attr))

There are 59 methods that you could call on a groupby object!!!!! I am super excited to get started!!!! Let's Dive deep!!!!

we will first verify the kaggle's statement whether the no of orders cap ranges from 4 to 100

In [None]:
#group by "user_id" and take the max value of "order_number" column for each group in user_id and aggregate it.
cap = orders.groupby("user_id")["order_number"].aggregate(np.max).reset_index()
cap = cap.order_number.value_counts()
sns.set_style("whitegrid")
plt.figure(figsize=(15,12))
sns.barplot(cap.index, cap.values)
plt.ylabel('Frequency', fontsize=12)
plt.xlabel('Max order number', fontsize=12)
plt.xticks(rotation= 90)
plt.show()

The distribution seems to be right skewed!!!. no of customers drops as the maximum order number increases.

In [None]:
#Let's check how many customers are there totally
print("There are {} customers".format(sum(orders.groupby("eval_set")["user_id"].nunique().values[1:])))

In [None]:
cap = orders.groupby("eval_set").size()
plt.figure(figsize = (10,7))
sns.barplot(cap.index, cap.values, palette = "coolwarm")
plt.ylabel("No of Orders", fontsize = 14)
plt.xlabel("Dataset",fontsize = 14)
plt.title("number of unique orders in different dataset", fontsize = 16)
plt.show()

It can be noticed that prior datset has the most orders because the partition is done in such a way that only the last order of customers are split into train and test, whereas, rest all the orders goes into prior dataset  

Let's start the customer buying patterns with respect to time and day

<font size = "5"> **Day of the week - Shopping Behaviour** </font>

In [None]:
cap = orders.groupby("order_dow")["order_id"].size()
#sns.barplot(cap.index, cap.values, color = color[9], alpha = 0.8)
plt.figure(figsize=(10,8))
ax = sns.barplot(cap.index, cap.values, alpha=0.8, color=color[9])
ax.set_xlabel('Day of the week', fontsize = 10)
ax.set_ylabel('Total number of orders placed during the day',fontsize = 10)
ax.set_xticklabels(["Sat", "Sun", "Mon","Tue","Wed","Thu","Fri"], fontsize=10)
plt.show()

<font size = "3">During Saturdays and Sundays, the volume of orders placed are considerably higher than other day, with volume of orders placed during Wednesdays being the lowest!!!</font>

<font size = "5"> **Hour of the Day Break down - Order Pattern** </font>

In [None]:
cap = orders.groupby("order_hour_of_day")["order_id"].size()
#sns.barplot(cap.index, cap.values, color = color[9], alpha = 0.8)
plt.figure(figsize=(10,8))
ax = sns.barplot(cap.index, cap.values, alpha=0.8, color=color[0])
ax.set_xlabel('Hour of the Day', fontsize = 12)
ax.set_ylabel('Hourly total number of orders', fontsize = 12)
ax.set_title('Hour of the day - Order Pattern', fontsize = 14)
plt.show()

<font size = "4"> **Days since last order ** - Let's visualize the ordering pattern of customers </font>

In [None]:
plt.figure(figsize = (12,9))
sns.countplot("days_since_prior_order",data = orders, color = color[2])
plt.ylabel('Frequency', fontsize=14)
plt.xlabel('Days since prior order', fontsize=14)
plt.xticks(rotation = 90)
plt.title("Order Patterns of customers", fontsize=15)
plt.show()

One of the best skill that anyone could gain by having profound interest in data visualization is the art of storytelling with the data. In the above chart,Overall trend goes downward as the days increases.However, you can see the bars at 7.0, 14.0, 21.0, 28.0 has a sudden surge in the frequency of orders. This might be very easy to interpret as most people replenish their stock, once in a week, or once in 2 weeks or once in a month ( in this case, you can notice the bar at 30.0 days). 

However, one dominant insight that we could grab out of the chart is that*** most customer place their orders once in 7 days and once in 30 days. ***

<font size = "5">Hour of the day Vs Day of the week HEATMAP</font>

Though we have generated separate vizzes for the Hour of the day and Day of the week, it would be interesting to know how volume of orders move across the hours throughout the weekdays.

In [None]:
df = orders.groupby(["order_dow", "order_hour_of_day"])["order_number"].agg("count").reset_index(name = "count")
pivoted_df = df.pivot('order_dow', 'order_hour_of_day', 'count')
#pivoted_df.head()
plt.figure(figsize=(14,8))
sns.heatmap(pivoted_df)
plt.title("Frequency of Orders - Day of week Vs Hour of day")
plt.ylabel("Day of the week", fontsize = 14)
plt.xlabel("Hour of the Day", fontsize = 14)
plt.show()

Woaw!!! There's so much traffic and huge number of orders being placed on Saturday noon to Saturday Evenings and during Sunday Mornings!!!. It is very obvious that most people would like to shop during weekends after tiring weekdays.

To proceed our exploratory journey with other 3 data files, it is very useful to see the whole picture rather than seeing them as train, prior etc. Hence, Now to facilitate the EDA, I am going to combine both "order_products__prior.csv" and "order_products__prior.csv" data files into one single dataframe

In [None]:
concat = pd.concat([order_prod_prior, order_prod_train])

In [None]:
#checking and validating the concatenation of the two dataset
len(concat)

In [None]:
len(order_prod_prior) + len(order_prod_train)

Let's now go ahead with merging the data files products.csv, aisles.csv, departments.csv with the concatenated dataset(order_train and order_prior). While merging the dataset, there are various types of joins such as inner join, outer join, left join, right join etc. However, we have to be wary while making the join as some rows may get deleted or ignored (for example, inner join will only keep the rows that are mutually present in both the datsets. Here, I ma going to merge with left join as we want to retain all the rows in the concat and simultaneously fill with the information from other datasets. For more information on joins, kindly look into this [http://www.sql-join.com/sql-join-types](http://www.sql-join.com/sql-join-types)

In [None]:
#left joining the dataset so that none of the rows in the concat dataset gets deleted or ignored during the join
concat = pd.merge(concat, products, on='product_id', how='left')
concat = pd.merge(concat, aisles, on='aisle_id', how='left')
concat = pd.merge(concat, departments, on='department_id', how='left')

Now we have two solid and complete datasets to work on and are able to see the clear picture of how datasets and variables within them are interconnected to each other

In [None]:
orders.head()

In [None]:
concat.head()

Firstly, we will find out what products customers are buying and replenish frequently.

In [None]:
cap = concat.groupby("product_name").size().sort_values(ascending = False)[:20]
print(cap)

In [None]:
plt.figure(figsize= (14,8))
sns.barplot(cap.index, cap.values, color=color[1])
plt.ylabel("Frequency", fontsize = 12)
plt.xlabel("Products", fontsize = 12)
plt.xticks(rotation = 90, fontsize = 12)
plt.show()

Most frequently purchased products are Perishables (Fruits, Vegetables, Diary products)  
As we can expect from the above diagrams, the most visited aisles and departments would be the ones with perishables

In [None]:
cap = concat.groupby("aisle").size().sort_values(ascending = False)[:20]
print(cap)

In [None]:
plt.figure(figsize= (14,8))
sns.barplot(cap.index, cap.values, color=color[3])
plt.ylabel("Frequency", fontsize = 12)
plt.xlabel("Products", fontsize = 12)
plt.xticks(rotation = 90, fontsize = 12)
plt.show()

In [None]:
cap = concat.groupby("department").size().sort_values(ascending = False)[:20]
print(cap)

In [None]:
plt.figure(figsize= (14,8))
sns.barplot(cap.index, cap.values, color=color[8])
plt.ylabel("Frequency", fontsize = 12)
plt.xlabel("Products", fontsize = 12)
plt.xticks(rotation = 90,fontsize = 12)
plt.show()

Next, it would be great to see which department dominates the Instacart by variety of products!!! 

In [None]:
concat.head()

In [None]:
items  = pd.merge(left =pd.merge(left=products, right=departments, how='left'), right=aisles, how='left')
items.head()
group_val = items.groupby("department")["product_id"].count().sort_values(ascending = False)
plt.figure(figsize=(12,12))
labels = (np.array(group_val.index))
sizes = (np.array((group_val / group_val.sum())*100))
plt.pie(sizes, labels=labels, 
        autopct='%1.1f%%', startangle=200)
plt.title("Departments distribution", fontsize=15)
plt.show()

Personal care department is the one which is dominating and has wide range of cosmetics products stacked in the department!!!. 

Mostly in any product based or service based organization, most priority and resources on its most revenue generating products or line of services. Though we dont know the price and the profit margins of these products, we assume the revenue goes up as the number of products sold grows up!!!!

In [None]:
users_flow = orders[['user_id', 'order_id']].merge(concat[['order_id', 'product_id']],how='inner', left_on='order_id', right_on='order_id')
users_flow = users_flow.merge(items, how='inner', left_on='product_id',right_on='product_id')
grouped = users_flow.groupby("department")["order_id"].count().reset_index(name = "Total_orders")


In [None]:
grouped.sort_values(by = "Total_orders",ascending=False, inplace=True)
grouped = grouped.reset_index(drop = True)
plt.figure(figsize=(12,8))
sns.pointplot(grouped['department'], grouped['Total_orders'].values, alpha=0.8, color=color[0])
plt.ylabel('Sales', fontsize=12)
plt.xlabel('Department', fontsize=12)
plt.title("Departmentwise Sales", fontsize=15)
plt.xticks(rotation='vertical')
#plt.show()

All the food and related products are the forefront drivers of the Instacart revenue. Without Perishables, I suppose Instacart's success would perish!!!!!

In [None]:
orders.head()

In [None]:
concat.head()

<font size = "5">In Progress</font>

In [None]:
concat_orders = pd.merge(orders, concat, on = "order_id", how = "right")

In [None]:
concat_orders.head(20)