## Table of Contents:
#### 01. Importing Libraries
#### 02. Importing the merged dataframe
#### 03. Using If-statements with User-Defined functions
#### 04. Using If-statements with the loc( ) function
#### 05. Using If-statements with For-Loops
#### 06. New column for the Busiest Days of the week
#### 07. New column for the Busiest Period of the day
#### 08. Exporting the dataframe as a Pickle file

## 01. Importing libraries

In [1]:
# importing the libraries

import pandas as pd
import numpy as np
import os

## 02. Importing the merged dataframe

In [2]:
# defining the path

path=r'/Users/sanju/Documents/Jul 2023 Instacart Basket Analysis/02 Data'

In [3]:
# importing the 'orders_products_merged.pkl' dataframe

df_ords_prods_merged=pd.read_pickle(os.path.join(path,'Prepared Data','orders_products_merged.pkl'))

In [4]:
# creating a subset of 1st one million rows

df=df_ords_prods_merged[:1000000]

In [5]:
# checking the shape of the subset 

df.shape

(1000000, 14)

## 03. Using If-statements with User-Defined functions

In [6]:
# defining the function

def price_label(row):
    if row['prices']<=5:
        return 'Low-range product'
    elif (row['prices']>5)and (row['prices']<=15):
        return 'Mid-range product'
    elif row['prices']>15:
        return 'High-range product'
    else: return 'Not enough data'

In [7]:
# applying the function to the subset

df['price_range']=df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range']=df.apply(price_label, axis=1)


In [8]:
# checking the frequency

df['price_range'].value_counts(dropna=False)

Mid-range product    756450
Low-range product    243550
Name: price_range, dtype: int64

In [9]:
# checking the maximum value of 'prices' column in the subset

df['prices'].max()

14.800000190734863

## 04. Using If-statements with the loc( ) function

In [10]:
# creating the conditions for the subset

df.loc[df['prices']>15,'price_range_loc']='High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices']>15,'price_range_loc']='High-range product'


In [11]:
df.loc[(df['prices']<=15) & (df['prices']>5),'price_range_loc']='Mid-range product'

In [12]:
df.loc[df['prices']<=5,'price_range_loc']='Low-range product'

In [13]:
# checking the frequency

df['price_range_loc'].value_counts(dropna=False)

Mid-range product    756450
Low-range product    243550
Name: price_range_loc, dtype: int64

In [15]:
# using the loc() function to the entire dataframe

df_ords_prods_merged.loc[df_ords_prods_merged['prices']>15,'price_range_loc']='High-range product'

In [16]:
df_ords_prods_merged.loc[(df_ords_prods_merged['prices']<=15) & (df_ords_prods_merged['prices']>5),'price_range_loc']='Mid-range product'

In [17]:
df_ords_prods_merged.loc[df_ords_prods_merged['prices']<=5,'price_range_loc']='Low-range product'

In [18]:
# checking the frequency

df_ords_prods_merged['price_range_loc'].value_counts(dropna=False)

Mid-range product     21860860
Low-range product     10126321
High-range product      417678
Name: price_range_loc, dtype: int64

## 05. Using If-statements with For-Loops

In [19]:
# checking the frequency of the 'orders_day_of_week' column

df_ords_prods_merged['orders_day_of_week'].value_counts(dropna=False)

0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: orders_day_of_week, dtype: int64

In [20]:
# summarizing how busy each day of the week is

result=[]
for value in df_ords_prods_merged['orders_day_of_week']:
    if value == 0:
        result.append('Busiest Day')
    elif value == 4:
        result.append('Least Busy')
    else:
        result.append('Regularly Busy')

In [21]:
result

['Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Least Busy',
 'Least Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Least Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Least Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Least Busy',
 'Least Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Least Busy',
 'Regularly Busy',
 'Busiest Day',
 'Regularly Busy',
 'Reg

In [22]:
# adding the result to the dataframe

df_ords_prods_merged['busiest_day']=result

In [23]:
# checking the frequency

df_ords_prods_merged['busiest_day'].value_counts(dropna=False)

Regularly Busy    22416875
Busiest Day        6204182
Least Busy         3783802
Name: busiest_day, dtype: int64

## 06. New column for the Busiest Days of the week

In [25]:
# creating a new for-loop statement for the new requirement

result_new=[]
for value in df_ords_prods_merged['orders_day_of_week']:
    if (value == 0) or (value == 1):
        result_new.append('Busiest Day')
    elif (value == 4) or (value == 3):
        result_new.append('Least Busy')
    else:
        result_new.append('Regularly Busy')

In [26]:
# adding the above result to the dataframe as 'busiest_days' column

df_ords_prods_merged['busiest_days']=result_new

In [27]:
# checking the frequency

df_ords_prods_merged['busiest_days'].value_counts(dropna=False)

Regularly Busy    12916111
Busiest Day       11864412
Least Busy         7624336
Name: busiest_days, dtype: int64

In [28]:
# checking the columns

df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days
0,2539329,1,1,2,8,,196,1,0,both,Soda,77,7,9.0,Mid-range product,Regularly Busy,Regularly Busy
1,2398795,1,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly Busy,Least Busy
2,473747,1,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly Busy,Least Busy
3,2254736,1,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least Busy,Least Busy
4,431534,1,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least Busy,Least Busy


## 07. New column for the Busiest Period of the day

In [29]:
# checking the frequency of the 'order_hour_of_day' column

df_ords_prods_merged['order_hour_of_day'].value_counts(dropna=False)

10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: order_hour_of_day, dtype: int64

#### grouping the 24hours of the day as follows-
    Most orders - 10,11,14,15,13,12,16,9
    Average orders - 17,8,18,19,20,7,21,22
    Fewest orders - 23,6,0,1,5,2,4,3

In [30]:
# creating a for-loop with the above criteria

hour=[]
for value in df_ords_prods_merged['order_hour_of_day']:
    if value in[10,11,14,15,13,12,16,9]:
        hour.append('Most orders')
    elif value in[17,8,18,19,20,7,21,22]:
        hour.append('Average orders')
    else:
        hour.append('Fewest orders')

In [31]:
# adding the above result to the dataframe as 'busiest_period_of_day' column

df_ords_prods_merged['busiest_period_of_day']=hour

In [32]:
# checking the frequency

df_ords_prods_merged['busiest_period_of_day'].value_counts(dropna=False)

Most orders       21118071
Average orders     9997651
Fewest orders      1289137
Name: busiest_period_of_day, dtype: int64

## 08. Exporting the dataframe as a Pickle file

In [33]:
df_ords_prods_merged.to_pickle(os.path.join(path,'Prepared Data','orders_products_merged_derived.pkl'))