# Deriving New Variables 

## Table of Contents
* [01. Importing Libraries](#01.-Importing-Libraries)
* [02. Importing Files](02.-Importing-Files)
* [03. Deriving Variables](03.-Deriving-Variables)
    * [Creating subset of data frame](#Creating-subset-of-data-frame)
    * [Creating price label column](#Creating-price-label-column)
    * [Creating busiest day column](#Creating-busiest-day-column)
    * [Creating busiest days column](#Creating-busiest-days-column)
    * [Creating busiest period of day column](#Creating-busiest-period-of-day-column)
* [04. Exporting File](#04.-Exporting-File)

# 01. Importing Libraries

In [1]:
# Import necessary libraries 
import pandas as pd
import numpy as np
import os 

# 02. Importing Files

In [2]:
# Importing CSV file
ords_prods_merge = pd.read_pickle(r'/Users/suzandiab/Documents/Instacart Basket Analysis/02 Data/Prepared Data/ords_prods_merge.pkl')

# 03. Deriving Variables

## Creating subset of data frame 

In [3]:
# Creating subset 
df = ords_prods_merge[:1000000]

In [4]:
df.shape

(1000000, 15)

## Creating price label column

In [5]:
# Defining user-defined function 
def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

Start by defining new function.
Name of function is price_label.
Looking through every row. 
Indentations indicate that those conditions are within the code that came before it.
Indentations indicate whether a statement is within the previous one.

colon : stands for "then"
elif stands for else if and is used to add more than one condition (or else). 
else indicates any other values that fall outside these conditions.

In [6]:
# Applying user-defined function
df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [7]:
# Checking values in new column
df['price_range'].value_counts(dropna = False)

price_range
Mid-range product    756450
Low-range product    243550
Name: count, dtype: int64

Nothing returned for high-range, no product above $15.

In [8]:
# Checking for max price
df['prices'].max()

14.8

In [9]:
# Loc instead of user-end
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


In [10]:
# Loc instead of user-end
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [11]:
# Loc instead of user-end
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [12]:
# Checking values
df['price_range_loc'].value_counts(dropna = False)

price_range_loc
Mid-range product    756450
Low-range product    243550
Name: count, dtype: int64

loc() function locates a particular column in the dataframe it’s been assigned to.
logical operator added to the function to create a condition. 
no explicit if in your if-statement.
if = df.loc[df['prices'] > 15,
then = 'price_range_loc'] = 'High-range product'
When you’re working with multiple conditions within the same statement, section them off with parentheses.
using loc() won’t result in a warning message.
the loc() method runs much faster
the loc() function applies the conditional filters before searching through the dataframe, while your user-defined function searches through the entire dataframe and then determines where to set the filters

In [13]:
# Conducting loc on whole data set
ords_prods_merge.loc[ords_prods_merge['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [14]:
# Conducting loc on whole data set
ords_prods_merge.loc[(ords_prods_merge['prices'] <= 15) & (ords_prods_merge['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [15]:
# Conducting loc on whole data set
ords_prods_merge.loc[ords_prods_merge['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [16]:
# checking values of whole set
ords_prods_merge['price_range_loc'].value_counts(dropna = False)

price_range_loc
Mid-range product     21861558
Low-range product     10130750
High-range product      412551
Name: count, dtype: int64

## Creating busiest day column

In [17]:
# Find what day the most orders take place
ords_prods_merge['orders_day_of_week'].value_counts(dropna = False)

orders_day_of_week
0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: count, dtype: int64

Saturday is busiest day. Wednesday is slowest day.

In [18]:
# Creating a new column called "busiest day"
result = []

for value in ords_prods_merge["orders_day_of_week"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

The loop will run through every row in the “orders_day_of_week” column, compare its value with what you know are the busiest and slowest days, and assign it the corresponding string value.
3 possible values.
With this method, you’re only looping through one column of your dataframe, which will greatly speed up the performance (rather than user-defined).
value is a placeholder, could stand for anything.

In [19]:
# Printing result
result

['Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Reg

In [20]:
# Combining results with the df_ords_prods_merge dataframe
ords_prods_merge['busiest_day'] = result

In [21]:
# Checking values for accuracy
ords_prods_merge['busiest_day'].value_counts(dropna = False)

busiest_day
Regularly busy    22416875
Busiest day        6204182
Least busy         3783802
Name: count, dtype: int64

In [22]:
# Checking first 5 rows to check if new column was added
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,exists,price_range_loc,busiest_day
0,2539329,1,1,2,8,,196,1,0,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy
1,2398795,1,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy
2,473747,1,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy
3,2254736,1,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy
4,431534,1,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy


## Creating busiest days column

In [23]:
# Creating a new column called "busiest days"
result = []

for value in ords_prods_merge["orders_day_of_week"]:
  if value == 0 or value == 1:
    result.append("Busiest days")
  elif value == 4 or value == 3:
    result.append("Slowest days")
  else:
    result.append("Regularly busy")

In [24]:
# Combining results with the df_ords_prods_merge dataframe
ords_prods_merge['busiest_days'] = result

In [25]:
# Checking values for accuracy
ords_prods_merge['busiest_days'].value_counts(dropna = False)

busiest_days
Regularly busy    12916111
Busiest days      11864412
Slowest days       7624336
Name: count, dtype: int64

Totals for each category have changed. Regularly busy days still account for the majority of days. 

In [26]:
# Displaying first 5 rows to check new column was added
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,exists,price_range_loc,busiest_day,busiest_days
0,2539329,1,1,2,8,,196,1,0,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy
1,2398795,1,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days
2,473747,1,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days
3,2254736,1,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days
4,431534,1,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days


## Creating busiest period of day column 

In [27]:
# Find what hours the most orders take place
ords_prods_merge['order_hour_of_day'].value_counts(dropna = False)

order_hour_of_day
10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: count, dtype: int64

Hour 10 has most orders. Hour 3 has least orders.

In [28]:
# Creating a new column called "busiest_period_of_day"
result = []

for value in ords_prods_merge["order_hour_of_day"]:
  if value == 10 or value == 11 or value == 14 or value == 15 or value == 13 or value == 12 or value == 16 or value == 9:
    result.append("Most orders")
  elif value == 17 or value == 8 or value == 18 or value == 19 or value == 20 or value == 7 or value == 21 or value == 22:
    result.append("Fewest Orders")
  else:
    result.append("Average Orders")

In [29]:
# Combining results with the df_ords_prods_merge dataframe
ords_prods_merge['busiest_period_of_day'] = result

In [30]:
# Checking values for accuracy
ords_prods_merge['busiest_period_of_day'].value_counts(dropna = False)

busiest_period_of_day
Most orders       21118071
Fewest Orders      9997651
Average Orders     1289137
Name: count, dtype: int64

In [31]:
# Checking columns
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,exists,price_range_loc,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,1,2,8,,196,1,0,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Fewest Orders
1,2398795,1,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Fewest Orders
2,473747,1,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Most orders
3,2254736,1,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Fewest Orders
4,431534,1,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Most orders


In [32]:
# Checking dimensions
ords_prods_merge.shape

(32404859, 19)

# 04. Exporting File

In [33]:
ords_prods_merge.to_pickle(r'/Users/suzandiab/Documents/Instacart Basket Analysis/02 Data/Prepared Data/ords_prods_merge_derived.pkl')