# Table of Content
- [Import Libraries](#import-libraries)
- [Resources](#resources)
- [Load Data Sets](#load-data-sets)
- [Data Wrangling - Orders Data](#data-wrangling---orders-data)
- [Explore Orders Data](#explore-orders-data)
- [Data Wrangling - Departments Data](#data-wrangling---departments-data)
- [Explore Products Data](#explore-products-data)
- [Export Wrangled Data Sets](#export-wrangled-data-sets)


## Import Libraries [#](#table-of-content)

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd


## Resources [#](#table-of-content)

In [2]:
# project folder
project_path = Path(r"C:\Users\vynde\Desktop\CareerFoundry Data Analytics\Data Immersion - 4 Python Fundamentals for Data Analysts\Instacart_Basket_Analysis")

# resource folders
original_data_path = project_path / "02_Data" / "Original_Data"
prepared_data_path = project_path / "02_Data" / "Prepared_Data"

# input files
oders_data_path = original_data_path / "orders.csv"
departments_data_path = original_data_path / "departments.csv"

# output files
orders_wrangled_data_path = prepared_data_path / "orders_wrangled.csv"
departments_wrangled_data_path = prepared_data_path / "departments_wrangled.csv"

## Load Data Sets [#](#table-of-content)

Orders

In [3]:
# import orders data set
df_ords = pd.read_csv(original_data_path / "orders.csv")
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


Products

In [4]:
# import products data set
df_prods = pd.read_csv(original_data_path / "products.csv", index_col=False)
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


Departments

In [5]:
# import departments
df_dep = pd.read_csv(departments_data_path, index_col=False)
df_dep.head()

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing


## Data Wrangling - Orders Data [#](#table-of-content)

Dropping columns

In [6]:
df_ords = df_ords.drop(columns = ['eval_set'])

Renaming columns

In [7]:
df_ords.rename(columns = {'order_dow' : 'orders_day_of_week'}, inplace = True)

Check for more unecessary variables

In [8]:
df_ords.dtypes

order_id                    int64
user_id                     int64
order_number                int64
orders_day_of_week          int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

Change data types if needed

In [9]:
df_ords['days_since_prior_order'] = df_ords['days_since_prior_order'].astype('Int64')  # can handle nan
df_ords.dtypes

order_id                  int64
user_id                   int64
order_number              int64
orders_day_of_week        int64
order_hour_of_day         int64
days_since_prior_order    Int64
dtype: object

## Explore Orders Data [#](#table-of-content)

Busiest hour for placing orders

In [10]:
# get counts of hours
df_ords.order_hour_of_day.value_counts().head(3)

10    288418
11    284728
15    283639
Name: order_hour_of_day, dtype: int64

>The busiest order hour is 10 am.

## Data Wrangling - Departments Data [#](#table-of-content)

Reshaping dataframe

In [11]:
# transpose and reset header of departments data frame
df_dep_t = df_dep.T
new_header = df_dep_t.iloc[0]
df_dep_t.reset_index(drop=True, inplace=True)
df_dep_t_new = df_dep_t[1:]
df_dep_t_new.columns = new_header
df_dep_t_new.head()

department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol


Data dictionary for departments in products data - What's department_id 4

In [12]:
# create data dictionary
data_dict = df_dep_t_new.to_dict('index')
data_dict[4]

{'department': 'produce'}

In [13]:
# create data dictionary V2
data_dict2 = {int(dep_id): dep for dep_id, dep in zip(df_dep.columns[1:], df_dep.values[0][1:])}
data_dict2[4]

'produce'

>department_id 4 refers to produce department

## Explore Products Data [#](#table-of-content)

Data subset containing only breakfeast items

In [14]:
# subset containing only breakfast products
df_breakfast_prods = df_prods[df_prods.department_id==14]
df_breakfast_prods

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
27,28,Wheat Chex Cereal,121,14,10.1
33,34,,121,14,12.2
67,68,"Pancake Mix, Buttermilk",130,14,13.7
89,90,Smorz Cereal,121,14,3.9
210,211,Gluten Free Organic Cereal Coconut Maple Vanilla,130,14,3.6
...,...,...,...,...,...
49330,49326,Cereal Variety Fun Pack,121,14,9.1
49395,49391,Light and Fluffy Buttermilk Pancake Mix,130,14,2.0
49547,49543,Chocolate Cheerios Cereal,121,14,10.8
49637,49633,Shake 'N Pour Buttermilk Pancake Mix,130,14,14.2


Data subset containing products used for dinner parties (products from departments: alcohol, deli, beverages, and meat/seafood)

In [15]:
# subset containing alcohol, deli, beverages, and meat/seafood
df_dinner_party_prods = df_prods[df_prods.department_id.isin([5, 20, 7, 12])]
df_dinner_party_prods

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
6,7,Pure Coconut Water With Orange,98,7,4.4
9,10,Sparkling Orange Juice & Prickly Pear Beverage,115,7,8.4
10,11,Peach Mango Juice,31,7,2.8
16,17,Rendered Duck Fat,35,12,17.1
...,...,...,...,...,...
49676,49672,Cafe Mocha K-Cup Packs,26,7,6.5
49679,49675,Cinnamon Dolce Keurig Brewed K Cups,26,7,14.0
49680,49676,Ultra Red Energy Drink,64,7,14.5
49686,49682,California Limeade,98,7,4.3


Total counts of last subset data

> 7650

Examine strange data for customer with user_id 1

In [16]:
# subset for user 1
df_ords[df_ords.user_id=="1"]

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order


>order 8 and 9 are only 2 hours apart

Basic stats of user 1

In [20]:
# basic order stats for user 1
df_ords[df_ords.user_id==1].describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,11.0,11.0,11.0,11.0,11.0,10.0
mean,1923450.0,1.0,6.0,2.636364,10.090909,19.0
std,1071950.0,0.0,3.316625,1.286291,3.477198,9.030811
min,431534.0,1.0,1.0,1.0,7.0,0.0
25%,869017.0,1.0,3.5,1.5,7.5,14.25
50%,2295261.0,1.0,6.0,3.0,8.0,19.5
75%,2544846.0,1.0,8.5,4.0,13.0,26.25
max,3367565.0,1.0,11.0,4.0,16.0,30.0


>The user has made 11 orders.
The orders were made from Mondays to Thursdays between 7am and 4pm.
The average days between 2 orders is 19 days which is also very close to the median of 19.5


## Export Wrangled Data Sets [#](#table-of-content)

Wrangled Orders Data

In [18]:
df_ords.to_csv(orders_wrangled_data_path)

Wrangled Departments Data

In [19]:
df_dep_t_new.to_csv(departments_wrangled_data_path)