# Table of Contents

## 1. Data Import and Checks
## 2. Data Visualization: Prices and Day of Week
## 3. Data Sampling
## 4. Data Visualization: Day of Week, Income, Dependants, Hour of Day
## 5. Data Export

# 1. Data Import and Checks

In [None]:
# import libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import scipy
import seaborn as sns

In [None]:
# create path
path = r'C:\Users\18602\Documents\Data Analytics\Data Immersion\Month 4\Instacart Basket Analysis'

In [None]:
# import dataset 'customers' 
df = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_large_4_9.pkl'))

In [None]:
df.columns

In [None]:
df.shape

# 2. Data Visualizations: Prices and Day of Week

In [None]:
# create bar chart
bar = df['order_day_of_week'].value_counts().sort_index().plot.bar(color = ['red','pink','orange','yellow','green','blue','purple'])

In [None]:
# save bar chart
bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'bar_orders_dow.png'))

In [None]:
# create histogram
hist = df['prices'].plot.hist(bins = 10)

In [None]:
# save histogram
hist.figure.savefig(os.path.join(path, '04 Analysis','Visualizations', 'bar_orders_dow.png'))

In [None]:
# create scatterplot
scat = sns.scatterplot(x = 'prices', y = 'prices',data = df)

In [None]:
# save scatterplot
scat.figure.savefig(os.path.join(path, '04 Analysis','Visualizations', 'bar_orders_dow.png'))

# 3. Data Sampling

In [None]:
# create a random sampling assigning each value true or false
np.random.seed(4)
dev = np.random.rand(len(df)) <= 0.7

In [None]:
# store 70% of the dtata in the label 'big' and 30% in the label 'small'
big = df[dev]
small = df[~dev]

In [None]:
# check length
len(df)

In [None]:
# check length
len(big) + len(small)

In [None]:
# create a random sampling assigning each value true or false
np.random.seed(3)
dev = np.random.rand(len(df)) <= .7

In [None]:
# create a df for small sample of data
df_2 = small[['order_day_of_week','prices']]

# 4. Data Visualizations: Day of Week, Income, Dependants, Hour of Day

In [None]:
# create line chart for prices over day of week
line = sns.lineplot(data = df_2, x = 'order_day_of_week',y = 'prices')

In [None]:
# save chart
line.figure.savefig(os.path.join(path, '04 Analysis','Visualizations', 'bar_orders_dow.png'))

In [None]:
# create histogram for order hour of day 
hist2 = df['order_hour_of_day'].plot.hist(bins = 23)

In [None]:
# save chart
hist2.figure.savefig(os.path.join(path, '04 Analysis','Visualizations', 'bar_orders_dow.png'))

This histogram demonstrates the quantity of orders in each hour of the day. The bar starts at 0 (which is midnight) and continues up until 23 (11pm). From this,we can see orders peak around 10am.

In [None]:
# save bar chart
bar2.figure.savefig(os.path.join(path, '04 Analysis','Visualizations', 'bar_orders_dow.png'))

In [None]:
# create a df for small sample of data
df_3 = small[['order_hour_of_day','prices']]

In [None]:
# create line chart for order hour of day and price
line2 = sns.lineplot(data = df_3, x = 'order_hour_of_day', y = 'prices')

It looks like the averge price for items drops during our busiest time of the day (10am) and peaks at a much slower time of the day (8pm)

In [None]:
# save line chart
line2.figure.savefig(os.path.join(path, '04 Analysis','Visualizations', 'bar_orders_dow.png'))

In [None]:
# create a df for small sample of data
df_4 = small[['n_dependants','age']]

In [None]:
line3 = sns.lineplot(data = df_4, x = 'age', y = 'n_dependants')

In [None]:
# save line chart
line3.figure.savefig(os.path.join(path, '04 Analysis','Visualizations', 'bar_orders_dow.png'))

This chart is not particularly clear in terms of correlation. Doing a correlation coefficient analysis would probably be more precise and clear, but there is no obvious connection and in terms of average dependents and age. One issue is slight fluctuations are too dramatic for this chart to necessarily make sense. Plotting a line in a scatterplot would be more helpful.

In [None]:
# create a df for small sample of data
df_5 = small[['age','income']]

In [None]:
# see if there is a connection between age and income
line4 = sns.lineplot(data = df_5, x = 'income', y = 'age')

There does seem to be a increase in income aka spending power as age goes up, In the 20s, the highest spending point is in the 400,000 range but when we get to the 40s it increases to 600,000

# 5. Export Data

In [None]:
# save line chart
line4.figure.savefig(os.path.join(path, '04 Analysis','Visualizations', 'bar_orders_dow.png'))

In [None]:
# export to pickle
df.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_large_4_8.pkl'))