# 00. Table of contents
 - Importing libraries
 - Importing Dataset
 - Creating price labels (with IF function) for a subset
 - Creating price labels (with loc function) for the entire merged dataframe
 - Creating new column for flagging how busy a day is (2 versions: busiest day and busiest days)
 - Creating new column for labeling period of day how busy is based on number of orders

# 01. Importing libraries

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import os

# 02. Importing dataset

In [2]:
path = r'C:\Users\viki\Documents\Data Analytics\Immersion\Achievement 4\Instacart Basket Analysis'

In [3]:
df_ords_prods_merge= pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged.pkl'))

In [4]:
df_ords_prods_merge.shape

(32404859, 13)

# 03. Creating a subset to avoid memory issues

In [5]:
#creating subset of the first 1 million rows
df = df_ords_prods_merge[:1000000]

# 04. Creating price labels with IF statements


In [6]:
# defining new variable
def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

In [7]:
# creating new column for price range
df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [8]:
#checking frequency in our new column
df['price_range'].value_counts()

Mid-range product    756450
Low-range product    243550
Name: price_range, dtype: int64

In [9]:
#looking for the most expensive product
df['prices'].max()

14.8

# 05. creating price labels with loc for the entire df

In [10]:
df_ords_prods_merge.loc[df_ords_prods_merge['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [11]:
df_ords_prods_merge.loc[(df_ords_prods_merge['prices'] <= 15) & (df_ords_prods_merge['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [12]:
df_ords_prods_merge.loc[df_ords_prods_merge['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [13]:
#checing frequency on the newly created column
df_ords_prods_merge['price_range_loc'].value_counts()

Mid-range product     21860860
Low-range product     10126321
High-range product      417678
Name: price_range_loc, dtype: int64

# 06. analysing how busy each dow is

In [14]:
# checking frequency of dow
df_ords_prods_merge['orders_day_of_week'].value_counts(dropna=False)

0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: orders_day_of_week, dtype: int64

In [15]:
#creating a list to categorise how busy a day is
result = []

for value in df_ords_prods_merge["orders_day_of_week"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

In [16]:
#creating a new column with the result list
df_ords_prods_merge['busiest_day']= result

In [17]:
#checking frequency on the new column
df_ords_prods_merge['busiest_day'].value_counts(dropna=False)

Regularly busy    22416875
Busiest day        6204182
Least busy         3783802
Name: busiest_day, dtype: int64

# 07. Task
## Step 2
### Suppose your clients have changed their minds about the labels you created in your “busiest_day” column. Now, they want “Busiest day” to become “Busiest days” (plural). This label should correspond with the two busiest days of the week as opposed to the single busiest day. At the same time, they’d also like to know the two slowest days. Create a new column for this using a suitable method.

In [18]:
#creating a list to categorise busyness
result2 = []

for value in df_ords_prods_merge["orders_day_of_week"]:
  if value ==0 or value== 1:
    result2.append("Busiest days")
  elif value  ==3 or value== 4:
    result2.append("Least busy days")
  else:
    result2.append("Regularly busy")

In [19]:
#creating a new column with the result2 list
df_ords_prods_merge['busiest_days']= result2

## Step 3
### Check the values of this new column for accuracy. Note any observations in markdown format.

In [20]:
#checking frequency on the new column regarding business - output is in percentage
df_ords_prods_merge['busiest_days'].value_counts(dropna=False, normalize=True, ascending= True)

Least busy days    0.235284
Busiest days       0.366131
Regularly busy     0.398586
Name: busiest_days, dtype: float64

In [21]:
#checking frequency on the "old"column regarding business - output is in percentage
df_ords_prods_merge['busiest_day'].value_counts(dropna=False, normalize=True, ascending= True)

Least busy        0.116767
Busiest day       0.191458
Regularly busy    0.691775
Name: busiest_day, dtype: float64

<font color=blue>Not sure what exactly is meant by checking the accuracy. I compared how significant difference there is within the result in case we change how many days we consider as busy / regular / least busy. We can see that the weighting among the categories have changed significantly: regulary busy label is being not that outstanding anymore </font> 

## Step 4
### When too many users make Instacart orders at the same time, the app freezes. The senior technical officer at Instacart wants you to identify the busiest hours of the day. Rather than by hour, they want periods of time labeled “Most orders,” “Average orders,” and “Fewest orders.” Create a new column containing these labels called “busiest_period_of_day.”

In [22]:
#checking what order hour column exactly called
df_ords_prods_merge.columns

Index(['order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'product_id',
       'add_to_cart_order', 'reordered', 'product_name', 'aisle_id',
       'department_id', 'prices', 'price_range_loc', 'busiest_day',
       'busiest_days'],
      dtype='object')

In [23]:
#checking frequency for hours
df_ords_prods_merge['order_hour_of_day'].value_counts(dropna=False)

10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: order_hour_of_day, dtype: int64

In [24]:
#creating list for busiest hour categories
result3 = []

for value in df_ords_prods_merge["order_hour_of_day"]:
  if value in [10,11,14,15,13,12,16,9]:
    result3.append("Most orders")
  elif value in [3,4,2,5,1,0,6,23]:
    result3.append("Fewest orders")
  else:
    result3.append("Average orders")

In [25]:
#creating a new column with the result3 list
df_ords_prods_merge['busiest_period_of_day']= result3

## Step 5
### Print the frequency for this new column.

In [26]:
#printing frequency
df_ords_prods_merge['busiest_period_of_day'].value_counts(dropna=False)

Most orders       21118071
Average orders     9997651
Fewest orders      1289137
Name: busiest_period_of_day, dtype: int64

## Step 7
### Export your dataframe as a pickle file (since you added new columns) and store it correctly in your “Prepared Data” folder.

In [27]:
#checking all columns in the altered df
df_ords_prods_merge.columns

Index(['order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'product_id',
       'add_to_cart_order', 'reordered', 'product_name', 'aisle_id',
       'department_id', 'prices', 'price_range_loc', 'busiest_day',
       'busiest_days', 'busiest_period_of_day'],
      dtype='object')

In [28]:
df_ords_prods_merge.shape

(32404859, 17)

In [29]:
#exporting final df with new columns in pickle format
df_ords_prods_merge.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_merged_with_new_columns.pkl'))