# Deriving New Variables

## This script includes the following points:

1. Importing Libraries and Data
2. Creating a flag - If Statement
3. If Statement with Loc function
4. For Loop Function
5. Create a column  "Busiest days" - busiest and slowest days 
6. Check the values for accuracy
7. Busiest hours of the day
8. Frequency - Busiest hour of the day
9. Export new data frame - pickle format

## 01. Importing Libraries and Data

In [1]:
# importing libraries

import pandas as pd
import numpy as np
import os

In [2]:
# define path

path = r'/Users/robson/Desktop/CareerFoundry/Data Immersion/Achivement 4/19-04-2024 Instacart Basket Analysis'

In [3]:
#import dataframe

df_ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'prepared_data', 'ords_prods_merge.pkl'))

## 02. Creating a Flag - If Statement

### This  user-defined function searches through the entire dataframe and then determines where to set the filters (axis = 1)

In [4]:
# Import only the first 1,000,000 rows 

df = df_ords_prods_merge[:1000000]

In [5]:
# check the amount of rows loaded

df.shape

(1000000, 15)

In [6]:
# define a function 

def price_label(row):

  if row['prices'] <= 5:
    return 'Low range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range product'
  else: return np.nan

In [7]:
# apply the function

df['price_range'] = df.apply(price_label, axis=1)

df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,both,Mid-range product
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product


In [8]:
# check frequency

df['price_range'].value_counts(dropna = False)

price_range
Mid-range product    756450
Low range product    243550
Name: count, dtype: int64

In [9]:
# max price in the dataframe

df['prices'].max()

14.8

## 03. If Statement with Loc function

### the loc() method runs much faster, since applies the conditional filters before searching through the dataframe

##### Applying the loc function in the first one million rows 

In [10]:
# function to turn all values above 15 into 'High-range product'

df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


In [11]:
# function to turn all values equal or below 15 and above 5 into 'Mid-range product'

df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [12]:
# function to turn all values equal or below 5 into 'Low-range product'

df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [13]:
# check frequency of the new column

df['price_range_loc'].value_counts(dropna = False)

price_range_loc
Mid-range product    756450
Low-range product    243550
Name: count, dtype: int64

##### Applying the loc function on the entire dataframe

In [14]:
# function to turn all values above 15 into 'High-range product' in the whole dataframe

df_ords_prods_merge.loc[df_ords_prods_merge['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [15]:
# function to turn all values equal or below 15 and above 5 into 'Mid-range product' in the whole dataframe

df_ords_prods_merge.loc[(df_ords_prods_merge['prices'] <= 15) & (df_ords_prods_merge['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [16]:
# function to turn all values equal or below 5 into 'Low-range product' in the whole dataframe

df_ords_prods_merge.loc[df_ords_prods_merge['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [17]:
# check frequency in the whole dataframe

df_ords_prods_merge['price_range_loc'].value_counts(dropna = False)

price_range_loc
Mid-range product     21860860
Low-range product     10126321
High-range product      417678
Name: count, dtype: int64

## 04. For Loop Function

In [18]:
# example of a for loop

for x in range (30, 45):
    print ('My age is %d' % (x))

My age is 30
My age is 31
My age is 32
My age is 33
My age is 34
My age is 35
My age is 36
My age is 37
My age is 38
My age is 39
My age is 40
My age is 41
My age is 42
My age is 43
My age is 44


In [19]:
# To be able to create a new column that summarize how busy each day of the week it is, we need to print the frequency of order by dow

df_ords_prods_merge['order_day_of_week'].value_counts()

order_day_of_week
0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: count, dtype: int64

In [20]:
# a for loop will create a new column with the following information, regarding of how busy each day of week is: "busiest day", "least busy", "regularly busy"

result = []

for value in df_ords_prods_merge["order_day_of_week"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")


In [21]:
# check result

result

['Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Reg

In [22]:
# copy the results into a new column

df_ords_prods_merge['busiest_day'] = result

In [23]:
# check frequency of busiest days

df_ords_prods_merge['busiest_day'].value_counts(dropna = False)

busiest_day
Regularly busy    22416875
Busiest day        6204182
Least busy         3783802
Name: count, dtype: int64

In [24]:
# check if the frequency match

df_ords_prods_merge['order_day_of_week'].value_counts(dropna = False)

order_day_of_week
0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: count, dtype: int64

## 05 - Create a column called "Busiest days" that show the two busiest and slowest days 

In [25]:
# Drop the old column "Busiest day"

df_ords_prods_merge.drop(columns = ['busiest_day'], inplace = True)

In [26]:
# Check the frequency of the column 'order_day_of_week' to see which value fit in the two busiest and slowest days
# in this case the two busiest are 0 and 1; the two slowest are 4 and 3. 

df_ords_prods_merge['order_day_of_week'].value_counts(dropna = False)

order_day_of_week
0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: count, dtype: int64

In [27]:
# create a function to find both values 

result = []

for x in df_ords_prods_merge['order_day_of_week']:
    if x in [0,1]:
        result.append('Busiest Days')
    elif x in [3,4]:
        result.append('Slowest Days')
    else: 
        result.append('Regularly Busy')

In [28]:
result

['Regularly Busy',
 'Slowest Days',
 'Slowest Days',
 'Slowest Days',
 'Slowest Days',
 'Regularly Busy',
 'Busiest Days',
 'Busiest Days',
 'Busiest Days',
 'Slowest Days',
 'Busiest Days',
 'Regularly Busy',
 'Regularly Busy',
 'Busiest Days',
 'Busiest Days',
 'Regularly Busy',
 'Regularly Busy',
 'Slowest Days',
 'Slowest Days',
 'Slowest Days',
 'Slowest Days',
 'Slowest Days',
 'Slowest Days',
 'Busiest Days',
 'Busiest Days',
 'Busiest Days',
 'Regularly Busy',
 'Regularly Busy',
 'Busiest Days',
 'Regularly Busy',
 'Regularly Busy',
 'Busiest Days',
 'Regularly Busy',
 'Regularly Busy',
 'Regularly Busy',
 'Slowest Days',
 'Regularly Busy',
 'Slowest Days',
 'Busiest Days',
 'Busiest Days',
 'Regularly Busy',
 'Slowest Days',
 'Slowest Days',
 'Regularly Busy',
 'Regularly Busy',
 'Busiest Days',
 'Busiest Days',
 'Regularly Busy',
 'Busiest Days',
 'Busiest Days',
 'Slowest Days',
 'Regularly Busy',
 'Busiest Days',
 'Busiest Days',
 'Busiest Days',
 'Busiest Days',
 'Busiest 

In [29]:
# add the new column 'busies_days' with the result got through the formula

df_ords_prods_merge['busiest_days'] = result

In [30]:
# check if the column was correct added

df_ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_days
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,both,Mid-range product,Regularly Busy
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Slowest Days
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Slowest Days
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Slowest Days
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Slowest Days


## 06. Check the values for accuracy 

In [31]:
# check if the values match with the frequency below. 

df_ords_prods_merge['busiest_days'].value_counts(dropna = False)

busiest_days
Regularly Busy    12916111
Busiest Days      11864412
Slowest Days       7624336
Name: count, dtype: int64

In [32]:
# check the frequency 

df_ords_prods_merge['order_day_of_week'].value_counts(dropna = False)

order_day_of_week
0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: count, dtype: int64

## 07. Busiest hour of the day

In [33]:
# check the frequency of the column order_hour_of_day to determine which time will fit into the Most Orders, Average Orders and Fewest Orders category.

df_ords_prods_merge['order_hour_of_day'].value_counts(dropna = False)

order_hour_of_day
10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: count, dtype: int64

In [34]:
# Now that each one was identified, each one will be separated in three equal parts. 

result_hour = []

for hour in df_ords_prods_merge['order_hour_of_day']:
    if hour in [10, 11, 14, 15, 13, 12, 16, 9]:
        result_hour.append('Most orders')
    elif hour in [17, 8, 18, 19, 20, 7, 21, 22]:
        result_hour.append('Average orders')
    else: 
        result_hour.append('Fewest orders')

In [35]:
result_hour

['Average orders',
 'Average orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most ord

In [36]:
# add the new column 'busies_period_of_day' with the result_hour got through the if loop

df_ords_prods_merge['busiest_period_of_day'] = result_hour

In [37]:
# check if the column was correct added

df_ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_days,busiest_period_of_day
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,both,Mid-range product,Regularly Busy,Average orders
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Slowest Days,Average orders
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Slowest Days,Most orders
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Slowest Days,Average orders
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Slowest Days,Most orders


## 08. Frequency - Busiest hour of the day

In [38]:
# frequency for the new column

df_ords_prods_merge['busiest_period_of_day'].value_counts(dropna = False)

busiest_period_of_day
Most orders       21118071
Average orders     9997651
Fewest orders      1289137
Name: count, dtype: int64

In [39]:
# size of the new column for comparison

df_ords_prods_merge['busiest_period_of_day'].size

32404859

In [40]:
# size of the order_hour_of day for comparison

df_ords_prods_merge['order_hour_of_day'].size

32404859

##### the sum of all the values is the same in both columns

## 09. Export dataframe in pickle format

In [41]:
df_ords_prods_merge.to_pickle(os.path.join(path, '02 Data', 'prepared_data', 'ords_prods_merge_derived.pkl'))