#Instacart Market Data Analysis

The Instacart Data Analysis provides data about the transactions of customers over time. It has multiple data sets which are categorized and more importantly, related.  This research aims to find the relations among these datasets, focusing on simple findings of each data set.

Research Questions:

1. Which are the top 10 aisles where the most people and least people get products from?
2. The peak hours of orders per day
3. The day when most people order at peak time based on the previous finding
4. Which day do people order the most

#1. Which are the top 10 aisles where the most people and least people get products from?

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [2]:
aisles = pd.read_csv('../input/aisles.csv')
aisles = aisles.sort_values('aisle_id')
aisles.head(5)

In [3]:
orders = pd.read_csv('../input/orders.csv')

orders = orders[(orders.eval_set == 'prior')]

In [4]:
products = pd.read_csv('../input/products.csv')
products = products.sort_values('aisle_id')
products.head(5)

In [5]:
product_with_aisle = pd.merge(products, aisles, on='aisle_id')
product_with_aisle.head(5)

In [6]:
order_products_prior = pd.read_csv('../input/order_products__prior.csv')
order_products_prior = order_products_prior.sort_values('order_id')
order_products_prior.head(5)

In [7]:
product_aisle_order = pd.merge(product_with_aisle, order_products_prior, on='product_id')
product_aisle_order.head(5)

In [8]:
aisle_table = product_aisle_order[['aisle_id', 'aisle']]
aisle_table = aisle_table.groupby('aisle')[['aisle']].count().sort_values(['aisle'], ascending=False)
aisle_table.head(10)

Top 10 aisles:

In [9]:
aisle_table.head(10).plot(kind = 'barh').invert_yaxis()

Lowest 10 aisles

In [10]:
aisle_table.tail(10).plot(kind = 'barh').invert_yaxis()

#2. The peak hours of orders of day

In [13]:
ohod = pd.DataFrame(orders.groupby('order_hour_of_day')['order_hour_of_day'].count())
ohod = ohod.sort_values(['order_hour_of_day'], ascending = False)
ohod

In [14]:
ohod.sort_index().plot(legend = None)

#3 The day when most people order at peak time based on the previous finding

In [15]:
day = pd.DataFrame(orders[(orders.order_hour_of_day == 10)].groupby('order_dow')['order_dow'].count())
day = day.sort_values(['order_dow'], ascending = False)
day

On the previous findings, the peak hours of day was 10:00am. The plot below shows the day in which most people order at the peak hour.

In [16]:
day.sort_index().plot(legend = None)

#4. Which day do people order the most

In [17]:
oppdow = orders.groupby('order_dow')['order_dow'].count()
oppdowarr = []
oppdowarr.append({'order_dow': 0, 'Count': oppdow[0]})
oppdowarr.append({'order_dow': 1, 'Count': oppdow[1]})
oppdowarr.append({'order_dow': 2, 'Count': oppdow[2]})
oppdowarr.append({'order_dow': 3, 'Count': oppdow[3]})
oppdowarr.append({'order_dow': 4, 'Count': oppdow[4]})
oppdowarr.append({'order_dow': 5, 'Count': oppdow[5]})
oppdowarr.append({'order_dow': 6, 'Count': oppdow[6]})

oppdowdf = pd.DataFrame(oppdowarr)
oppdowdf = oppdowdf[['order_dow', 'Count']]
oppdowdf.set_index('order_dow', inplace = True)
oppdowdf

In [18]:
oppdowdf.sort_index().plot(legend = None)