In [3]:
import pandas as pd
import numpy as np
from os import path as pth


In [4]:
path = r'/Users/polusa/Library/Mobile Documents/com~apple~CloudDocs/my_DA_2024/CareerFoundry_Data_Analytics_Bootcamp/4-Python_Fundamentals_for_DA/04-2024_Instacart_Basket_Analysis/02-Data'
prepared_data_folder = r'02-Prepared_Data'
raw_data_folder = r'01-Raw_Data'

In [5]:
ords_prods_merge = pd.read_pickle(pth.join(path, prepared_data_folder, 'ords_prods_merge_4.7.pkl'))
ords_prods_merge.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_name,aisle_id,department_id,prices,price_label,busiest_days,busiest_period_of_day
0,2,33120,1,1,202279,3,5,9,8.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders
1,26,33120,5,0,153404,2,0,16,7.0,Organic Egg Whites,86,16,11.3,Mid-Range,busiest_days,most_orders
2,120,33120,13,0,23750,11,6,8,10.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders
3,327,33120,5,1,58707,21,6,9,8.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders
4,390,33120,28,1,166654,48,0,12,9.0,Organic Egg Whites,86,16,11.3,Mid-Range,busiest_days,most_orders


In [6]:
# import a subset (first 1_000_000 rows) of the dataset ords_prods_merge
df = ords_prods_merge[:1_000_000]
df.shape

(1000000, 16)

In [7]:
# check if the import looks ok
df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_name,aisle_id,department_id,prices,price_label,busiest_days,busiest_period_of_day
0,2,33120,1,1,202279,3,5,9,8.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders
1,26,33120,5,0,153404,2,0,16,7.0,Organic Egg Whites,86,16,11.3,Mid-Range,busiest_days,most_orders
2,120,33120,13,0,23750,11,6,8,10.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders
3,327,33120,5,1,58707,21,6,9,8.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders
4,390,33120,28,1,166654,48,0,12,9.0,Organic Egg Whites,86,16,11.3,Mid-Range,busiest_days,most_orders


### Group By and Aggregation  

There are essentially two steps involved when we need to group by and aggregate in pandas.  
__Group By__ refers to the action of putting together those elements that are indentical within a column. It can be applied to qualitative (most common) and also quantitative data.  
 
__Aggregation__ requires two arguments, the column that contains the quantitative data for which we want to calculate aggregation, and the type of aggregation we want to apply.  
We can additionally specify a name for a new column to create that will hold the aggregated information.

In [8]:
# group by department_id and calculate the average number of order, and total number of orders, for each departmnet. 
df.groupby('department_id').agg({'order_number':['mean','sum']})

Unnamed: 0_level_0,order_number,order_number
Unnamed: 0_level_1,mean,sum
department_id,Unnamed: 1_level_2,Unnamed: 2_level_2
3,14.898875,206558
4,18.847057,15109120
7,16.462291,533691
11,17.419048,1829
12,19.215747,323132
13,16.473586,184916
14,13.120454,85506
16,18.867191,1905058
19,14.369608,235834


The result above shows that the department with id 4 sells on average much more than the departmet with id 14. While this seems cut and clear, it would be useful to run an hypothesis test for the mean of two samples to verify that the difference is statistically significant.  



In [9]:
# when using standard aggregation function we can use a short version:
df.groupby('department_id')['order_number'].mean()

department_id
3     14.898875
4     18.847057
7     16.462291
11    17.419048
12    19.215747
13    16.473586
14    13.120454
16    18.867191
19    14.369608
Name: order_number, dtype: float64

#### Applying multiple statistics, for example like `min` `max`, at once

In [10]:
df.groupby('department_id').agg({'order_number': ['mean', 'min', 'max']})

Unnamed: 0_level_0,order_number,order_number,order_number
Unnamed: 0_level_1,mean,min,max
department_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
3,14.898875,1,99
4,18.847057,1,99
7,16.462291,1,99
11,17.419048,1,71
12,19.215747,1,99
13,16.473586,1,99
14,13.120454,1,99
16,18.867191,1,99
19,14.369608,1,99


After you’ve grouped and aggregated your data, it’s time to create flags for that data. In this example, you’ll be creating a loyalty flag column in your `ords_prods_merge` dataframe.  

To create your flag, you’ll need some criteria. You can use the following:

- If the maximum orders the user has made is over 40, then the customer will be labeled a “Loyal customer.”
- If the maximum orders the user has made is over 10 but less than or equal to 40, then the customer will be labeled a “Regular customer.”
- If the maximum orders the user has made is less than or equal to 10, then the customer will be labeled a “New customer.”

#### `agg()` VS `transform()`  

The main difference is that `agg()` reduces the dataframe size while `transform()` doesn't change the original size of it.  
Imagine the `agg()` function as a way to create the equivalent of an excel pivot table.

Explanation:

The `agg()` function is used to calculate a summary statistic of groups of data defined in the `groupby()` function. The resulting dataframe is smaller as data were aggregated.  

The `transform()` function is also used to calculate summary statistic, but it doesn't aggregate, it just "puts" the calculated statistic in each row leaving the original df size intact. 

In [11]:
# Split the data into groups based on the “user_id” column
df.groupby('user_id')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x16c3a3dd0>

In [12]:
# Apply the transform() function on the “order_number” column to generate the maximum orders for each user
df.groupby('user_id')['order_number'].transform('max')
#ords_prods_merge.groupby('user_id')['order_number'].transform(np.max) here you can use the max function from the numpy library

0          8
1         32
2         11
3         30
4         62
          ..
999995    38
999996    20
999997    18
999998    17
999999    96
Name: order_number, Length: 1000000, dtype: int64

In [13]:
# let's save the result in a new column
df['max_order'] = df.groupby('user_id')['order_number'].transform('max')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['max_order'] = df.groupby('user_id')['order_number'].transform('max')


In [14]:
df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_name,aisle_id,department_id,prices,price_label,busiest_days,busiest_period_of_day,max_order
0,2,33120,1,1,202279,3,5,9,8.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders,8
1,26,33120,5,0,153404,2,0,16,7.0,Organic Egg Whites,86,16,11.3,Mid-Range,busiest_days,most_orders,32
2,120,33120,13,0,23750,11,6,8,10.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders,11
3,327,33120,5,1,58707,21,6,9,8.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders,30
4,390,33120,28,1,166654,48,0,12,9.0,Organic Egg Whites,86,16,11.3,Mid-Range,busiest_days,most_orders,62


In [15]:
# change the max number of rows jupyter can display
pd.options.display.max_rows = None

In [16]:
df.sort_values('user_id', ascending=True).head(20)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_name,aisle_id,department_id,prices,price_label,busiest_days,busiest_period_of_day,max_order
901005,2398795,13176,4,0,1,2,3,7,15.0,Bag of Organic Bananas,24,4,10.3,Mid-Range,slowest_days,average_orders,5
682851,431534,13176,8,1,1,5,4,15,28.0,Bag of Organic Bananas,24,4,10.3,Mid-Range,slowest_days,most_orders,5
181401,1199898,33754,13,0,2,6,2,9,13.0,Total 2% with Strawberry Lowfat Greek Strained...,120,16,11.8,Mid-Range,regular_days,average_orders,14
185909,1718559,33754,10,1,2,9,2,9,8.0,Total 2% with Strawberry Lowfat Greek Strained...,120,16,11.8,Mid-Range,regular_days,average_orders,14
183113,1402090,33754,8,1,2,11,1,10,30.0,Total 2% with Strawberry Lowfat Greek Strained...,120,16,11.8,Mid-Range,busiest_days,most_orders,14
178263,839880,33754,8,1,2,14,3,10,13.0,Total 2% with Strawberry Lowfat Greek Strained...,120,16,11.8,Mid-Range,slowest_days,most_orders,14
198845,3186735,33754,15,1,2,12,1,9,28.0,Total 2% with Strawberry Lowfat Greek Strained...,120,16,11.8,Mid-Range,busiest_days,average_orders,14
875405,2168274,13176,12,0,2,1,2,11,-1.0,Bag of Organic Bananas,24,4,10.3,Mid-Range,regular_days,most_orders,14
169859,2037211,1819,1,0,3,4,2,18,20.0,All Natural No Stir Creamy Almond Butter,88,13,11.5,Mid-Range,regular_days,average_orders,12
465029,3002854,21903,3,1,3,3,3,16,21.0,Organic Baby Spinach,123,4,8.2,Mid-Range,slowest_days,most_orders,12


#### Deriving Columns with `loc()`  

With your new column ready to go, all that’s left is to create a flag that assigns a “loyalty” label to a user ID based on its corresponding max order value

In [17]:
# function to calculate level of loyalty
def loyalty_level(num_of_orders):
    if num_of_orders <= 10: return 'new customer'
    elif num_of_orders <= 40: return 'regular customer'
    elif num_of_orders > 40: return 'loyal customer'
    else: return 'loyalty level not found '

In [18]:
df['loyalty_flag'] = df['max_order'].apply(loyalty_level)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['loyalty_flag'] = df['max_order'].apply(loyalty_level)


In [19]:
# df.loc[df['order_max'] <= 10, 'loyalty_flag'] = 'new_customer'
# df.loc[(df['order_max'] > 10) & (df['order_max'] <= 40), 'loyalty_flag'] = 'regular_customer'
# df.loc[df['order_max'] > 40, 'loyalty_flag'] = 'loyal_customer'

In [20]:
df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_name,aisle_id,department_id,prices,price_label,busiest_days,busiest_period_of_day,max_order,loyalty_flag
0,2,33120,1,1,202279,3,5,9,8.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders,8,new customer
1,26,33120,5,0,153404,2,0,16,7.0,Organic Egg Whites,86,16,11.3,Mid-Range,busiest_days,most_orders,32,regular customer
2,120,33120,13,0,23750,11,6,8,10.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders,11,regular customer
3,327,33120,5,1,58707,21,6,9,8.0,Organic Egg Whites,86,16,11.3,Mid-Range,regular_days,average_orders,30,regular customer
4,390,33120,28,1,166654,48,0,12,9.0,Organic Egg Whites,86,16,11.3,Mid-Range,busiest_days,most_orders,62,loyal customer


In [21]:
df['loyalty_flag'].value_counts()

loyalty_flag
regular customer    466784
loyal customer      329051
new customer        204165
Name: count, dtype: int64

In [22]:
df[['user_id', 'max_order', 'loyalty_flag']].sort_values('max_order', ascending=False).head(60)

Unnamed: 0,user_id,max_order,loyalty_flag
935701,121860,99,loyal customer
546537,119835,99,loyal customer
219766,172294,99,loyal customer
546593,191820,99,loyal customer
320521,112044,99,loyal customer
546615,172806,99,loyal customer
546620,44182,99,loyal customer
997476,38720,99,loyal customer
546657,5449,99,loyal customer
219720,8664,99,loyal customer


Let's verify the distribution of the "loyalty_flag" variable

In [29]:
df['loyalty_flag'].value_counts(dropna = False)

loyalty_flag
regular customer    466784
loyal customer      329051
new customer        204165
Name: count, dtype: int64

dtype('O')