Table of Contents:  
1. Import libraries and data.
2. Sort products into categories: low-range product, mid-range-product, high-range.
3. Clean the data.
4. If-statements with the loc()function.
5. If-statements with for-loops.
6. Identify the 2 busiest and the 2 slowest days.
7. Identify the busiest hours of the days.
8. Export updated dataframe.  

1. Import libraries and data

In [3]:
# Import libraries
import pandas as pd
import os

In [1]:
# Define path for Python:
path = r'/Users/samlisik/Documents/Instacart Basket Analysis'

In [4]:
# Import the ords_prods_merge dataframe
df_ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge.pkl'))

In [5]:
# Create a subset of the ords_prods_merge dataframe

In [7]:
df = df_ords_prods_merge[:1000000]

In [14]:
df.shape

(1000000, 16)

2. Sort products into categories: low-range product, mid-range-product, high-range

In [32]:
# Define a function
def price_label(row):
    if row['prices'] <= 5:
        return 'Low-range product'
    elif (row['prices'] > 5) and (row['prices'] <= 15):
        return 'Mid-range product'
    elif row['prices'] > 15:
        return 'High-range product'
    else:
        return 'Not enough data'

In [33]:
# Apply the function
df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [15]:
df['price_range'].value_counts(dropna = False)

price_range
Mid-range product     672548
Low-range product     314087
High-range product     12412
Not enough data          953
Name: count, dtype: int64

In [17]:
# Check for the most expensive product within the subset
df['prices'].max()

99999.0

99999.0 is an outlier or error value that needs to be cleaned.

3. Clean the data

In [62]:
import numpy as np
# Setting the upper bound to 25 keeps all realistic prices while removing errors.
df = df[df['prices'] <= 25]

In [63]:
df['prices'].max()

25.0

4. If-statements with the loc()function

In [28]:
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High range product'

In [29]:
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid range product'

In [30]:
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low range product'

In [31]:
df['price_range_loc'].value_counts(dropna = False)

price_range_loc
Mid range product     672548
Low range product     314087
High range product     12255
Name: count, dtype: int64

In [38]:
# Use loc() on the entire dataframe
df_ords_prods_merge.loc[df_ords_prods_merge['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [41]:
df_ords_prods_merge.loc[(df_ords_prods_merge['prices'] <= 15) & (df_ords_prods_merge['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [43]:
df_ords_prods_merge.loc[df_ords_prods_merge['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [44]:
df_ords_prods_merge['price_range_loc'].value_counts(dropna = False)

price_range_loc
Mid-range product     21861997
Low-range product     10126366
High-range product      417521
NaN                      30357
Name: count, dtype: int64

5. If-statements with for-loops

In [45]:
# Find out the busiest day for orders
df_ords_prods_merge['orders_day_of_week'].value_counts(dropna =False)

orders_day_of_week
0    6210030
1    5666177
6    4500536
2    4218024
5    4209718
3    3844342
4    3787414
Name: count, dtype: int64

In [47]:
# Create a new column busiest_day with 3 values: "Busiest day", "Least busy", "Regularly busy"
result = []

for value in df_ords_prods_merge["orders_day_of_week"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

In [49]:
df_ords_prods_merge['busiest_day'] = result

In [50]:
df_ords_prods_merge['busiest_day'].value_counts(dropna=False)

busiest_day
Regularly busy    22438797
Busiest day        6210030
Least busy         3787414
Name: count, dtype: int64

6. Identify the 2 busiest and the 2 slowest days

In [52]:
# Use a for-loop to create a new column (similar to the previous busiest_day column):
result = []

for value in df_ords_prods_merge['orders_day_of_week']:
    if value in [0, 1]:
        result.append('Busiest days')
    elif value in [3, 4]:
        result.append('Slowest days')
    else:
        result.append('Regularly busy')

In [54]:
# Add the new column
df_ords_prods_merge['busiest_days_v2'] = result

In [55]:
# Check frequencies
df_ords_prods_merge['busiest_days_v2'].value_counts()

busiest_days_v2
Regularly busy    12928278
Busiest days      11876207
Slowest days       7631756
Name: count, dtype: int64

Observations:

The label “Busiest days” corresponds to the two days with the highest number of orders (Sunday and Monday).

The label “Slowest days” corresponds to the two days with the lowest number of orders (Wednesday and Thursday).

All other days are labeled as “Regularly busy”, representing mid-range order activity.

This confirms that the new column correctly summarizes the order activity across all days of the week.

No missing values (NaN) are present, which ensures every row in the dataframe has been categorized.

7. Identify the busiest hours of the days

In [57]:
# Create an empty list to store the results
busiest_period_of_day = []

In [58]:
# Loop through each value in order_hour_of_day
for value in df_ords_prods_merge['order_hour_of_day']:
    if value in [10, 11, 15, 16]:  # Example: busiest hours (10-11am, 3-4pm)
        busiest_period_of_day.append('Most orders')
    elif value in [7, 8, 12, 13, 17, 18]:  # Example: average hours
        busiest_period_of_day.append('Average orders')
    else:  # all other hours
        busiest_period_of_day.append('Fewest orders')

In [59]:
# Assign list to new column
df_ords_prods_merge['busiest_period_of_day'] = busiest_period_of_day

In [60]:
# Check frequencies
df_ords_prods_merge['busiest_period_of_day'].value_counts(dropna=False)

busiest_period_of_day
Average orders    11624059
Most orders       10705629
Fewest orders     10106553
Name: count, dtype: int64

8. Export updated dataframe

In [61]:
df_ords_prods_merge.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_v2.pkl'))