# Data Exploration - Instacart
###### Problem Statement:
To find which products the consumer will reorder from InstaCart (Online grocery store)

###### Data:
Over 3 million grocery orders from more than 200,000 Instacart users was given.

###### List of files :
1.Orders <br />
2.order_products_prior <br />
3.order_products_train <br />
4.products <br />
5.aisles <br />
6.departments <br />

Here, <br />
Prior data contains all the orders for each user except their last order <br />
Last order for each user is split into train and test data <br  />
So, X will be product in order_products_prior and y will be product in order_products_train during training process.

Thereby, We need to predict the products the consumer will reorder by using order_products_test set.

***Lets import necessary python libraries ***

In [None]:
import numpy as np #linear Algebra
import pandas as pd #Data Exploration 
import seaborn as sns #Data Visualization
import matplotlib.pyplot as plt #Data Visualization - base Library

In [None]:
#show graph outputs
%matplotlib inline 
sns.set_style("whitegrid") #seaborn Styling

*** Lets import data ***

In [None]:
df_Orders = pd.read_csv('../input/orders.csv')
df_Prior = pd.read_csv('../input/order_products__prior.csv')
df_Train = pd.read_csv('../input/order_products__train.csv')
df_Products = pd.read_csv('../input/products.csv')
df_aisles = pd.read_csv('../input/aisles.csv')
df_dept = pd.read_csv('../input/departments.csv')

<i> Well, All files are imported</i>

In [None]:
df_Orders.info()

<b><i>So there are 3.4 Million Orders</i></b>

In [None]:
df_Prior.info()

<b><i> df_Train will also contain the same number of columns but the last order of each user </i></b>

In [None]:
df_Prior_merge  = df_Orders.merge(df_Prior,on='order_id')

In [None]:
df_Prior_merge.head()

<i> Now we have every product in every order by every customer (in hierarchy Customer > order > products) with day of week (dow), hour, day since last order and product ID , importantly reordered or not</i>

In [None]:
df_Prior_merge.nunique()

<b><i>There are 2 Lakh users </i></b>

***Lets find total number of order by each user***

In [None]:
df_Prior_merge.columns

In [None]:
order_count = df_Prior_merge.groupby('user_id')['order_id'].count()

In [None]:
order_count.sort_values(ascending=False).head()

In [None]:
plt.figure(figsize=(12,4))
sns.set_palette("viridis")
sns.distplot(order_count,kde=False,bins=100)
plt.title("Total no. of Order by each user", fontsize=14)
plt.xlabel('orders', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.show()

***Looks like most users ordered between 1 -500 times[](http://)***

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='order_number',data=df_Prior_merge)
plt.title("Count of each order_number", fontsize=14)
plt.xlabel('order_number', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

In [None]:
df_Prior_merge.columns

 ***Now Lets check week and hours ***

In [None]:
dow = df_Prior_merge['order_dow'].value_counts().reset_index()

In [None]:
plt.figure(figsize=(12,4))
sns.barplot(x='index',y='order_dow',data=dow)
plt.title("Total orders on day of week", fontsize=14)
plt.xlabel('day of week', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.show()

***Looks like most orders are placed on Sunday*** (Assuming 0-sun to 6-sat)

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='order_hour_of_day',data=df_Prior_merge)
plt.title("Total order on hours of day", fontsize=14)
plt.xlabel('hours of day', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.show()

***Most orders are placed between 10 AM and 4PM***

In [None]:
df_time_matrix = df_Prior_merge.groupby(['order_dow','order_hour_of_day']).count().reset_index().pivot('order_hour_of_day','order_dow','order_id')

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(df_time_matrix,cmap='viridis')
plt.title("Distribution of order over week", fontsize=14)
plt.xlabel('day of week', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.show()

Next we have days_since_prior_order column

In [None]:
prior_order_freq = df_Prior_merge['days_since_prior_order'].value_counts()

In [None]:
plt.figure(figsize=(13,6))
sns.barplot(x=prior_order_freq.index, y= prior_order_freq.values)
plt.title("Time interval between Orders", fontsize=14)
plt.xlabel('date', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.show()

***Looks like most of the orders are placed on end of 1st week or end of a month ***

*** Now lets merge Aisles and Departments with prior dataset***

In [None]:
df_Prior_merge  = df_Prior_merge.merge(df_Products,on='product_id')

In [None]:
df_Prior_merge = df_Prior_merge.merge(df_dept,how='left', on='department_id')

In [None]:
df_Prior_merge = df_Prior_merge.merge(df_aisles,how='left', on='aisle_id')

In [None]:
df_Prior_merge.head()

In [None]:
product_count = df_Prior_merge['product_id'].nunique()
department_count = df_Prior_merge['department_id'].nunique()
asile_count = df_Prior_merge['aisle_id'].nunique()
print("So there are %d products from %d depatments and %d aisle" %(product_count,department_count, asile_count))

*** Lets check 10 most ordered products ***

In [None]:
df_Prior_merge['product_name'].value_counts().head(10)

*** Fruits are the most  ordered product***

In [None]:
plt.figure(figsize=(13,6))
sns.countplot(x='department', data=df_Prior_merge)
plt.title("Department wise Orders", fontsize=14)
plt.xlabel('Department', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

In [None]:
df_Prior_merge['aisle'].value_counts().head(10)

*** People are so conscious about their Health :)***

*** Hope This will be helpful to beginners, I will keep updating..*** <br /> 




*** Please post your suggestion  ***<br />