# Implement an If-Statements with User-Defined Functions in IC project

# 1. If you haven’t done so already, complete the instructions in the Exercise for creating the “price_label” and “busiest_day” columns.

Sort your products into price ranges--you’d likely employ different strategies to sell a 2 dollar candy bar than you would a 15 dollar box of Belgian chocolates! It makes sense to categorize your products into price range groups for easy filtering.

You can do this by creating a flag that sorts products in your ords_prods_merged dataframe according to price. Products within different ranges can be given different flags, which are stored within a new column. You’ll need to write a user-defined function to create and assign these flags.

Importing libraries and dataframe; and to avoid any potential memory/processing issues work with a subset of the df

In [1]:
# Import libraries
import pandas as pd 
import numpy as np
import os

In [2]:
# Defining path
path = r'/Users/renataherrera/Documents/CF RH 2023-2024/CF DATA IMMERSION/CF RH A4 PYTHON/RH_PYTHON_Instacart Basket Analysis'

In [3]:
path

'/Users/renataherrera/Documents/CF RH 2023-2024/CF DATA IMMERSION/CF RH A4 PYTHON/RH_PYTHON_Instacart Basket Analysis'

In [4]:
# import ords_prods_merged df
df_ords_prods_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged.pkl'))

In [5]:
# create subset of the df of the first one million rows
# a number after the colon; indicating that the subset should include everything from the beginning of the dfup to that number
df = df_ords_prods_merged[:1000000]

In [6]:
# Check output of one million rows
df.shape

(1000000, 15)

In [7]:
# Check import and df merged output
df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2539329,1,prior,1,2,8,,196,1,0,Soda,77,7,9.0,both
1,2398795,1,prior,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both
2,473747,1,prior,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both
3,2254736,1,prior,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both
4,431534,1,prior,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both


The complete code for every condition you want to include for your filters--

You need to start by defining it using the def syntax at the beginning of the code.

Following this is the name you want to give your new function: price_label. In the parentheses is row, which is a standard argument telling the function to look at each row within the dataframe

Finally, everything’s finished off with a colon. The colon separates the head, where you provide the name and argument(s) for your function, from the body, which is what the function will actually do.



TIP: INDENTATION--

Indentation is a built-in Python feature, but understanding visual hierarchy will help you determine what code belongs with what, especially within large blocks of code.

You can see that the body of the code follows a hierarchy, with some lines indented further than others. 
The body code, for instance, is all indented two spaces to the right beneath the head; this indicates that these lines belong WITHIN the code that came before it. 
And then, within the body code, some lines are indented even further. For instance, the return 'Low-range product' code is indented further than the line above, indicating that it belongs WITHIN the if-statement.

In [8]:
# If-statements with user-defined functions
# define your price_label function
# and apply it to the subset you just created
def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

# You now have a function called price_label that will essentially apply a string label to every row within your dataframe, designating it as a low-, mid-, or high-range product based on its price

# As a result, the colon after the first line translates to “then,” making the entire statement read in plain language 

# If the value within the “prices” column within the given row is less than or equal to 5, then return the string “Low-range product.”

# Or else, if the value within the “prices” column within the given row is greater than 5 and less than or equal to 15, then return the string “Mid-range product.”

# Or else, if the value of the “prices” column within the given row is greater than 15, then return the string “High-range product.”

# Or else, return the string “Not enough data.”

NOTES:

If-statements tell a function to determine “if” something is true and, if so, to perform some operation. In the code above, you can see the line: if row['prices'] <= 5. This translates to “if the value in the ‘prices’ column within the given row is less than or equal to 5.”

Right after this line comes the operation to be performed: return 'Low-range product', which translates to “return the string ‘Low-range product.’” 

Next, new condition--
the next line has the same indentation as the first if-statement, which lets you know this is a new condition

The main difference is the use of elif instead of the simple if; this stands for “else if,” and it’s used to add additional conditions to your if-statement. Let’s try translating this into a normal sentence again: “else if the value in the ‘prices’ column within the given row is greater than 5 and the value in the ‘prices’ column within the given row is less than or equal to 15.” There are two conditions stacked into one: the price needs to be greater than five but less than or equal to 15.

After this comes the colon, which, if you’ll remember, acts as the “then,” followed by: “return the string ‘Mid-range product.’”

This continues with another elseif, for prices greater than 15, before ending with a simple else. This else encompasses any other possible situations that fall outside the criteria set forth by the three if and elseif statements. Think of it as a bucket to catch anything that leaks all the way through:

In this case, the else is important as it will catch missing values (which haven’t been addressed anywhere in your conditions). L

Now that you’ve defined it, you need to use it (just like you used functions like merge() or define() in the past). Let’s use it on your df dataframe now:

In [9]:
# creating a new price_range column within the df 
# and designating it as the location for your labels
# define your price_label function
# and apply it to the subset you just created
df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


the red warning message--
rather than an outright error, in this case, it’s simply informing you that there could be something problematic about the way you executed your procedure. While the nature of what it’s telling you is a bit more complex than what you need to understand right now, it does give you a possible alternative to try: the loc() method you learned earlier

On the right side of the equals sign comes the code that runs your new function: df.apply(price_label, axis=1), which tells Python to apply the price_label function on axis=1. This axis = 1 stands for “rows,” so this code essentially tells Python to apply the function to all rows within the dataframe. Conversely, axis = 0 would refer to all columns within the dataframe.

Once the price_label function has been applied, run the value_counts() function to check the values in your new column.

In [10]:
# Check the values and if any missing values in your new price_label column
# Remember you are looking at a one-million row subset you created of your data
df['price_range'].value_counts(dropna = False)

price_range
Mid-range product    756450
Low-range product    243550
Name: count, dtype: int64

Notice that you don’t have any rows with the “High-range product” label. This means there aren’t any products within the dataframe greater than $15. 

# As a result of this one-million ONLY subset, there could be more-expensive products in your ENTIRE dataframe, because there aren’t any in the one-million-row subset you created.

In [11]:
# using the max function to return the max value within the "prices" column of the df dataframe
# checks what the most expensive product within the subset is
df['prices'].max()

14.8

This confirms your findings from the labels—that there aren’t any high-range products within the subset. If the returned max had been higher than 15, that would have pointed to some sort of error or miscalculation.

You’ve now created your first user-defined function and used it to sort your data into different categories. 

# But remember the warning message you received above about using the loc() method? So, using the loc() method is a preferable way of handling these types of operations.

If-Statement with the loc( ) Function

Let’s see how using loc() would change the workflow. Using loc(), you can apply the conditional logic of an if-statement to a function without explicitly creating an if-else construct.

# When entering the code into Jupyter, use different cells for each condition

In [12]:
# Applying loc() function to df
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


In [13]:
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [14]:
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [15]:
#Check value counts--Counting the number of products within each label
# used on the "price_range_loc" column 
df['price_range_loc'].value_counts(dropna = False)

price_range_loc
Mid-range product    756450
Low-range product    243550
Name: count, dtype: int64

With your conditions written, it’s time to use the value_counts() function to count the number of products within each label. This time, it’s used on the “price_range_loc” column, as that’s where you’ve put your new loc()-created labels. Do you get the same results as when you used your user-defined function?

If both methods arrive at the same results, why would you want to use loc() instead? First, using loc() won’t result in a warning message. While this won’t actually interfere with your work, it’s still a sign that, for whatever reason, Python thinks you should be doing something different. Second, the loc() method runs much faster; the loc() function applies the conditional filters before searching through the dataframe, while your user-defined function searches through the entire dataframe and then determines where to set the filters (remember axis = 1?).


# Let’s walk through each step of what this is doing.

You already know that the loc() function locates a particular column in the dataframe it’s been assigned to. Now, a logical operator (smaller than, larger than, equal to, etc.) is being added to the function to create a condition. The difference here is that there’s no explicit if in your if-statement. Instead, it’s all been implied.

From left to right, the loc() function is being called on the df dataframe. And within the brackets, the values in the “prices” column of the df dataframe are being compared to a value, 15, using the > operator. In normal language, you could say, “if the values in the ‘prices’ column of the df dataframe are greater than 15.”

After the comma comes the implied “then.” Here, a new column called “price_range_loc” is being set equal to the string “High-range product.” This is the same as the label you created in your user-defined function.

The two implied halves of the if-statement, then, would be:

if = df.loc[df['prices'] > 15,
then = 'price_range_loc'] = 'High-range product'
Remember—the comma is key! It’s what separates the “if” from the “then.”

Next line--

Let’s move on to the next line, where it becomes a bit more complicated:

df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product'
Start by thinking about what makes up the “if” and what makes up the “then.” Just look for the comma:

if = df.loc[(df['prices'] <= 15) & (df['prices'] > 5),
then = 'price_range_loc'] = 'Mid-range product'
This time, you’re dealing with two conditions—greater than 5 but less than or equal to 15. These two conditions are combined by the & sign in the middle of the “if” half. Additionally, the two conditions have been placed inside parentheses. This simply ensures that they’re both treated as separate conditions. When you’re working with multiple conditions within the same statement, section them off with parentheses!

Last line--

Finally, the last line of code follows the same structure as the first. Try separating it first into an “if” half and a “then” half, then deciphering what each part of the code is doing:

# Now that you’ve seen how much faster loc() works, you can try repeating the process—this time, on your entire dataframe. The only difference in the code is the use of ords_prods_merge instead of df:

In [16]:
#Applying loc() to entire dataframe
# Filtering entire df rather than just a subset
df_ords_prods_merged.loc[df_ords_prods_merged['prices'] > 15, 'price_range_loc'] = 'high-range product'

In [17]:
df_ords_prods_merged.loc[(df_ords_prods_merged['prices'] <= 15) & (df_ords_prods_merged['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [18]:
df_ords_prods_merged.loc[df_ords_prods_merged['prices'] <= 5, 'price_range_loc'] = "Low-range product"

In [19]:
df_ords_prods_merged['price_range_loc'].value_counts(dropna = False)

price_range_loc
Mid-range product     21890146
Low-range product     10126384
high-range product      417682
Name: count, dtype: int64

In [20]:
#Check price_range_loc column
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc
0,2539329,1,prior,1,2,8,,196,1,0,Soda,77,7,9.0,both,Mid-range product
1,2398795,1,prior,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Mid-range product
2,473747,1,prior,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Mid-range product
3,2254736,1,prior,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Mid-range product
4,431534,1,prior,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Mid-range product


If-Statements with For-Loops

# How you could use a for-loop in your Instacart dataframe. To do so, you’ll create a new column in your ords_prods_merge dataframe that summarizes how busy each day of the week is

This information is valuable information for stakeholders as it gives them insight into what products are being bought on the busiest and slowest days. They could use this information to tailor ads on specific days.

To start, you need to know on which day most orders take place. You can find this out by printing the frequency of the “orders_day_of_week” column

In [21]:
# Printing the frequency of a column
# counting which values appear most often within that column
df_ords_prods_merged['orders_day_of_week'].value_counts(dropna = False)

orders_day_of_week
0    6209632
1    5665830
6    4500246
2    4217766
5    4209449
3    3844096
4    3787193
Name: count, dtype: int64

# But what do these numbers mean?

# In your project brief, you can see that the value 0 means Saturday. This value has the highest frequency, meaning Saturday is the busiest day. 

# Meanwhile, the 4 value has the lowest frequency. A value of 4, here, refers to Wednesday, meaning Wednesday is the slowest day for Instacart app orders.

You want to use this information to create a new column, “busiest day,” that will contain one of three different values: “Busiest day,” “Least busy,” and “Regularly busy.” This can be done using a for-loop. The loop will run through every row in the “orders_day_of_the_week” column, compare its value with what you know are the busiest and slowest days, and assign it the corresponding string value.

In [22]:
# creating a new busiest_day column containing one of three different values
# using a for-loop for these three values: "busiest day, least busy and regulary busy"
#first step is to create an empty list "result" to place the results from your loop
result = []

In [23]:
#Setting the looping through of one column of your dataframe
# which will greatly speed up the performance
for value in df_ords_prods_merged["orders_day_of_week"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

if-else structure--

If the value in that row is equal to 0, a “Busiest day” string value is appended to your currently blank result list. 

If the value is equal to 4, a “Least busy” string value is appended to the result list. 

If neither of these conditions has been met (the value is neither 0 nor 4), then a “Regularly busy” string value is appended to the result list.

# Why would you want to follow this method rather than use a user-defined function? 

Well, with this method, you’re only looping through one column of your dataframe, which will greatly speed up the performance (your user-defined function had to search through the entire dataframe).



the value within the code--

This value is simply acting as a placeholder. It could stand for anything. And you could call it anything, too (oftentimes, an x is used, like in the simple loop you went through earlier). This element in the code represents every entry the loop will check. In the simple loop from earlier, it represented every possible age that could exist between 30 and 45. 

Here, it represents every possible value within the “orders_day_of_week” column. It’s what you want your loop to, well, loop through.

In [24]:
#print the result list to see what shows up
result

['Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Reg

result--

that's a long list containing an entry for every row within your df...and on its own does not do much good


Only by combining it with your ords_prods_merge dataframe can it be used effectively. To do so, create a new column within your ords_prods_merge dataframe and set it equal to result, like so:

# By adding the values in result to a new column in your dataframe, you can use your new labels more effectively, for instance, by showing at a glance how many sales occur on each different type of day

In [25]:
# Creating a new busiest_day column within ords_prods_merged df
# and set it equal to "result" 
df_ords_prods_merged['busiest_day'] = result

In [26]:
# printing the frequency of the busiest_day column
# and creating a summary column for the busiest day of the week using value.counts
df_ords_prods_merged['busiest_day'].value_counts(dropna = False)

busiest_day
Regularly busy    22437387
Busiest day        6209632
Least busy         3787193
Name: count, dtype: int64

In [27]:
# Cross-check with frequency of "orders_day_of_week"
df_ords_prods_merged['orders_day_of_week'].value_counts(dropna = False)

orders_day_of_week
0    6209632
1    5665830
6    4500246
2    4217766
5    4209449
3    3844096
4    3787193
Name: count, dtype: int64

# ANSWER AND CROSS CHECKS

# Summarizing how busy each day of the week is

This information is valuable information for stakeholders as it gives them insight into what products are being bought on the busiest and slowest days. They could use this information to tailor ads on specific days.

# CROSS CHECKS

# In your project brief, you can see that the value 0 means Saturday. This value has the highest frequency, meaning Saturday is the busiest day.

orders_day_of_week = 0 = 6209632 and busiest_day = Busiest day = 6209632

# Meanwhile, the 4 value has the lowest frequency. A value of 4, here, refers to Wednesday, meaning Wednesday is the slowest day for Instacart app orders.¶

orders_day_of_week = 4 = 3787193 and busiest_day = Least busy = 3787193

# 2. Suppose your clients have changed their minds about the labels you created in your “busiest_day” column. Now, they want “Busiest day” to become “Busiest days” (plural). This label should correspond with the two busiest days of the week as opposed to the single busiest day. At the same time, they’d also like to know the two slowest days. Create a new column for this using a suitable method.

In [28]:
# Printing the frequency of a column
# counting which values appear most often within that column
df_ords_prods_merged['orders_day_of_week'].value_counts(dropna = False)

orders_day_of_week
0    6209632
1    5665830
6    4500246
2    4217766
5    4209449
3    3844096
4    3787193
Name: count, dtype: int64

In [29]:
# creating a new (now plural) busiest_days column containing one of three different values
# using a for-loop for these three values: "busiest days, slowest days and regularly busy"
#first step is to create an empty list "results2" to place the results from your loop
results2 = []

In [30]:
#Setting the looping through of one column of your dataframe
# which will greatly speed up the performance
for value in df_ords_prods_merged["orders_day_of_week"]:
  if value == 0 or value == 1:
    results2.append("Busiest days")
  elif value == 4 or value == 3:
    results2.append("Slowest days")
  else:
    results2.append("Regularly busy")

In [31]:
#print the result list to see what shows up
results2

['Regularly busy',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Slowest days',
 'Busiest days',
 'Regularly busy',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 'Regularly busy',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 'Regularly busy',
 'Busiest days',
 'Regularly busy',
 'Regularly busy',
 'Busiest days',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Slowest days',
 'Regularly busy',
 'Slowest days',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 'Slowest days',
 'Slowest days',
 'Regularly busy',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Slowest days',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest 

In [32]:
# Creating a new busiest_days column and adding it to ords_prods_merged df
# and set it equal to "results2" 
df_ords_prods_merged['busiest_days'] = results2

In [33]:
# printing the frequency of the busiest_days column
# and creating a summary column for the busiest days of the week using value.counts
df_ords_prods_merged['busiest_days'].value_counts(dropna = False)

busiest_days
Regularly busy    12927461
Busiest days      11875462
Slowest days       7631289
Name: count, dtype: int64

In [34]:
# Cross-check with frequency of "orders_day_of_week"
df_ords_prods_merged['orders_day_of_week'].value_counts(dropna = False)

orders_day_of_week
0    6209632
1    5665830
6    4500246
2    4217766
5    4209449
3    3844096
4    3787193
Name: count, dtype: int64

In [35]:
# Check output
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days
0,2539329,1,prior,1,2,8,,196,1,0,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy
1,2398795,1,prior,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days
2,473747,1,prior,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days
3,2254736,1,prior,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days
4,431534,1,prior,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days


# 2 A Saturday = 0 and Sunday = 1 are the two "busiest days"

# Tuesday = 3 and Wednesday = 4 are the two "slowest days"

# 3. Check the values of this new column for accuracy. Note any observations in markdown format.

In [36]:
# Check frequency of the new busiest_days column
# and creating a summary column for the new busiest days of the week using value.counts
df_ords_prods_merged['busiest_days'].value_counts(dropna = False)

busiest_days
Regularly busy    12927461
Busiest days      11875462
Slowest days       7631289
Name: count, dtype: int64

In [37]:
df_ords_prods_merged['orders_day_of_week'].value_counts(dropna = False)

orders_day_of_week
0    6209632
1    5665830
6    4500246
2    4217766
5    4209449
3    3844096
4    3787193
Name: count, dtype: int64

# 3 A OBSERVATIONS--

# 11,875,462 = Busiest days are equal to the sum of Saturday = 0 = 6,209,632 + 5,665,830 = 1 = Sunday 

# 7,631,289 = Slowest days are equal to the sum of Tuesday = 3= 3,844,096 + Wednesday = 4 = 3,787,193

# 4. When too many users make Instacart orders at the same time, the app freezes. The senior technical officer at Instacart wants you to identify the busiest hours of the day. Rather than by hour, they want periods of time labeled “Most orders,” “Average orders,” and “Fewest orders.” Create a new column containing these labels called “busiest_period_of_day.”

In [38]:
# Check frequency of all values in the order_hour_of_day column
# and creating a summary column for the busiest hours of the day using value.counts
df_ords_prods_merged['order_hour_of_day'].value_counts()

order_hour_of_day
10    2764390
11    2738585
14    2691563
15    2664522
13    2663272
12    2620800
16    2537469
9     2456661
17    2089452
8     1719952
18    1637922
19    1259382
20     977017
7      891928
21     796362
22     634737
23     402612
6      290770
0      218942
1      115786
5       88057
2       69431
4       53283
3       51317
Name: count, dtype: int64

In [39]:
# creating a new busiest_period_of_day column containing one of three different values
# using a for-loop for these three "period of time" values: "Most orders, Average orders, and Fewest orders"
#first step is to create an empty list by "hour_of_day" to place the results from your loop
hour_of_day = []

NOTE: 3 periods of time in a 24 hour day, therefore 8 different order hour of day values for each period of time: Most, Average and Fewest 

In [40]:
#Setting the looping through of one column of your dataframe
# which will greatly speed up the performance
for value in df_ords_prods_merged["order_hour_of_day"]:
  if value in [10, 11, 14, 15, 13, 12, 16, 9]:
    hour_of_day.append("Most orders")
  elif value in [23, 6,0,1,5,2,4,3]:
    hour_of_day.append("Fewest orders")
  else:
    hour_of_day.append("Average orders")

In [41]:
#print the hour_of_day result list to see what shows up
hour_of_day

['Average orders',
 'Average orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most ord

In [42]:
# Creating a new busiest_period_of_day column and adding it to ords_prods_merged df
# and set it equal to "hour_of_day" 
df_ords_prods_merged['busiest_period_of_day'] = hour_of_day

In [43]:
# Check the output of new column display for accuracy
df_ords_prods_merged[[ 'order_hour_of_day', 'busiest_period_of_day']].head(10)

Unnamed: 0,order_hour_of_day,busiest_period_of_day
0,8,Average orders
1,7,Average orders
2,12,Most orders
3,7,Average orders
4,15,Most orders
5,7,Average orders
6,9,Most orders
7,14,Most orders
8,16,Most orders
9,8,Average orders


In [44]:
# Also just check the output of new column display for accuracy
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,prior,1,2,8,,196,1,0,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Average orders
1,2398795,1,prior,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Average orders
2,473747,1,prior,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Most orders
3,2254736,1,prior,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Average orders
4,431534,1,prior,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Most orders


In [45]:
#Check dimensions
df_ords_prods_merged.shape

(32434212, 19)

# 5. Print the frequency for this new column.

In [46]:
# Check frequency of all values in the busiest_period_of_day column
# and creating a summary column for the busiest hours of the day using value.counts
df_ords_prods_merged['busiest_period_of_day'].value_counts()

busiest_period_of_day
Most orders       21137262
Average orders    10006752
Fewest orders      1290198
Name: count, dtype: int64

# 5 A If you add up the "Most, Average and Fewest orders" grouped values, the three totals are equivalent to their frequencies.

# For example, "Fewest orders"

23     402612

6      290770

0      218942

1      115786

5       88057

2       69431

4       53283

3       51317

# Total = 1290198


# 6. Ensure your notebook is clean and structured and that your code is well commented.

# 7. Export your dataframe as a pickle file (since you added new columns) and store it correctly in your “Prepared Data” folder.

In [47]:
df_ords_prods_merged.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged_updated.pkl'))