This course extends Intermediate Python for Data Science to provide a stronger foundation in data visualization in Python. The course provides a broader coverage of the Matplotlib library and an overview of Seaborn (a package for statistical graphics). Topics covered include customizing graphics, plotting two-dimensional arrays (e.g., pseudocolor plots, contour plots, images, etc.), statistical graphics (e.g., visualizing distributions & regressions), and working with time series and image data.

Instacart competition's data is used.

**Objective of this competition:**

In this competition, Instacart is challenging the Kaggle community to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order. They’re not only looking for the best model, Instacart’s also looking for machine learning engineers to grow their team.

The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.

For each user, 4 and 100 of their orders are given, with the sequence of products purchased in each order

**Import libraries**

In [65]:
import matplotlib.pyplot as plt 
import pandas as pd
import numpy as np
import seaborn as sns
color = sns.color_palette()

%matplotlib inline

Data loading:

In [66]:
from subprocess import check_output
print(check_output(['ls','../input']).decode('utf8'))
#print(check_output)

Read data from the above list of csv files to a dataframe:

In [67]:
df_aisles = pd.read_csv('../input/aisles.csv',low_memory=False)
df_departments = pd.read_csv('../input/departments.csv', low_memory=False)
df_order_product_prior = pd.read_csv('../input/order_products__prior.csv',low_memory=False)
df_order_product_train = pd.read_csv('../input/order_products__train.csv',low_memory=False)
df_orders = pd.read_csv('../input/orders.csv',low_memory=False)
df_products = pd.read_csv('../input/products.csv',low_memory=False)


Always remember to tabular data in tabular form to get ab insight of the data

Lets check the orders.

In [68]:
df_orders.head()

The .agg() method can be used with a tuple or list of aggregations as input. When applying multiple aggregations on multiple columns, the aggregated DataFrame has a multi-level column index.


In [69]:
crs = df_orders.groupby('user_id')[['order_number','order_dow']].agg({'order_number':'max','order_dow':'sum'})
crs.head()

In [70]:
df_order_product_prior.head()

In [71]:
df_order_product_train.head()

Inference: 

The order table has all the list of orders with details such as - 'order_hour_of_day' - on which hour of the day the order was made,  'days_since_prior_order' - days since the previous order, 'order_id' - order ID and so on...

df_order_product_prior (vs) df_order_product_train dataset :

Both the datset has same columns. 

The order table has a columns "eval_set" which tells us as to which of the three datasets (prior, train or test) the given row goes to. 

The below EDA will give you a clear understanding of how the order_prior, order_train, order_test is distributed. 

In [72]:
order_src = df_orders['eval_set'].value_counts()
order_src

Equivalent

In [73]:
order_src = df_orders['eval_set'].value_counts()
#order_src.head()
sns.barplot(order_src.index,order_src.values)
plt.title('EDA on order dataset to count rows in each orders dataset',fontsize=14)
plt.xlabel('eval set',fontsize=12)
plt.ylabel('Number of Occurrences',fontsize=12)
plt.show()

Lets try to find the patter of orders made:

1. Find on which day of the week the order surges and vice versa by plotting a graph using sns.countplot -  'order_dow' is the day of week.
2. Similary, Let find on what time of the day the order surges or goes down. This is also quite predicatble based on the region . Eg : People are more likely to shop in the morning than in the midnight,  right from buying milk to sugar (Indian buying pattern) - order_hour_of_day

In [74]:
plt.figure(figsize=(10,5))
sns.countplot(x='order_dow',data=df_orders)
plt.title('1. Order by week day',fontsize=14)
plt.xlabel('week day',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.show()

0 and 1 could be the weekends, The spike in count suggests that it could be sat and sun respectively, yes ! people shop more on weekends. 

In [75]:
plt.figure(figsize=(10,5))
sns.countplot(x='order_hour_of_day',data=df_orders)
plt.title('2. Order by time of the day',fontsize=14)
plt.xlabel('Time of the day',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.show()

So, the above graph implies the buying pattern , it is quite visible that the orders are majorly made during the day between 9 AM to 5 PM. (9 - 17)

No it sort of makes sense, that the orders surge in the day time and on saturdays and sunday. But lets see the distribution of the both in one graph using a heatmap. 

groupby: with titanic dataset , for those who have not done the titanic competition, you may be less familiar with the columns names

//Group titanic by 'pclass'
by_class = titanic.groupby('pclass')

//Aggregate 'survived' column of by_class by count
count_by_class = by_class['survived'].count()

//Print count_by_class
print(count_by_class)

//Group titanic by 'embarked' and 'pclass'
by_mult = titanic.groupby(['embarked','pclass'])

// Aggregate 'survived' column of by_mult by count
count_mult = by_mult['survived'].count()

// Print count_mult
print(count_mult)

**PIVOT**  example:  Please fork the notebook to view the dataframe and sample code in a readable way. 

**users DataFrame: before pivoting **

weekday    city  visitors  signups
0     Sun  Austin       139        7
1     Sun  Dallas       237       12
2     Mon  Austin       326        3
3     Mon  Dallas       456        5

print(users)
//Pivot the users DataFrame: visitors_pivot
visitors_pivot = users.pivot(index='weekday',columns='city',values='visitors')

//Print the pivoted DataFrame
print(visitors_pivot)

**users DataFrame after pivoting**

city     Austin  Dallas
weekday                
Mon         326     456
Sun         139     237



In [76]:
grouped_df_groupby = df_orders.groupby(["order_dow", "order_hour_of_day"])
grouped_df = grouped_df_groupby["order_number"].aggregate("count").reset_index()
print(grouped_df)
print('-----------------------------AFTER PIVOT-------------------------------------')
grouped_df = grouped_df.pivot('order_dow', 'order_hour_of_day', 'order_number')
print(grouped_df)
plt.figure(figsize=(10,5))
sns.heatmap(grouped_df)
plt.title("Frequency of Day of week Vs Hour of day")
plt.show()

So, the orders have surged on saturday between 12:30 to 16 and sunday between 8 to 11:30 

Now, lets explore the '**days_since_prior_order** column.....

I would like to know the time interval between the order since the prior order (confusing isn't !). So, I have this habit of not stocking enough milk for the week as its a pershiable good, so i always go the near by supermarket to buy milk like every single morning, so the interval between my day 1 purchase at the shop and the day 2 puchase at the shop is what we are trying to explore.

The idea is to explore every single column!!

In [77]:
plt.figure(figsize=(17,5))
sns.countplot(x='days_since_prior_order',data=df_orders)
plt.title('DAY SINCE THE PRIOR ORDER',fontsize=14)
plt.xlabel('No of days since the prior order',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.show()

Observation:  7 , 30, 8 , 9, 14, 21, 28 --- observable patterns lah

*  Once in 7 days -- may be toilet rolls
*  Once in 30 days -- solvents, detergents may be condoms, who knows....
*  Once in 14 days -- ??? i am leaving it to your choice 


Olrite lads,,, we have this column reorder in our prior and train dataset...

what can we do abou it ?? Lets have a look !

In [78]:
df_order_product_prior.head()

In [79]:
df_order_product_train.head()

In [80]:
df_order_product_prior['reordered'].sum()

In [81]:
 df_order_product_prior.shape

In [82]:
df_order_product_prior['reordered'].sum() /df_order_product_prior.shape[0] * 100

Implies, the reorder percentage in prior dataset is 59%

In [83]:
df_order_product_train['reordered'].sum()/df_order_product_train.shape[0] * 100

Prior dataset and train datset has 59 % of reorders in an order.

In [84]:
df_order_product_prior.head()

lets see the number of products bought in each order.

add_to_cart_order - Product added to cart 

order id - ID of each order

In [91]:
grouped_df = df_order_product_train.groupby('order_id')['add_to_cart_order'].aggregate('max').reset_index()
cnt_srs = grouped_df.add_to_cart_order.value_counts()

plt.figure(figsize=(20,8))
sns.barplot(cnt_srs.index,cnt_srs.values,alpha=0.5)
plt.title('Number of products bought in each order',fontsize=14)
plt.xlabel('No of products',fontsize=12)
plt.ylabel('No of occurance',fontsize=12)
plt.show()