# 4.8: Grouping Data & Aggregating Variables

### Points for this Script
1. Imports & Set up
2. Grouping Data
3. Flagging loyal customers
4. Comparing spending habits
5. Flagging customer frequency
6. Exports

### 1. Imports

In [None]:
# Import libraries

import pandas as pd
import numpy as np
import os

In [None]:
# Import dataframes

path = r'C:\Users\walls\Documents\Coding\Data Analysis\CareerFoundry\Data Immersion A4\Instacart Basket Analysis 01-25'
df_op = pd.read_pickle(os.path.join(path, 'Data','Prepared Data', 'orders_products_merged_derived.pkl'))

In [None]:
df_op.head()

In [None]:
df_op.shape

### 2. Grouping Data

In [None]:
# grouping data by "department_id" column with mean aggregate of "order_number" column:
# the average number of orders per user for each department ID

df_op.groupby('department_id').agg({'order_count': ['mean']})

In [None]:
# Subset of first 1000000 of df_op dataframe

df_op_subset = df_op[:1000000]

In [None]:
# Calculating the mean to compare

df_op_subset.groupby('department_id').agg({'order_count': ['mean']})

##### Question: How do the results for the entire dataframe differ from those of the subset?

##### Many departments have a higher mean in the subset while few have a lower mean in the subset. 

### 3. Flagging Customers

In [None]:
# creating a new column, max_order, grouping by user_id, and transforming order_count by the max NumPy function

df_op['max_order'] = df_op.groupby(['user_id'])['order_count'].transform(np.max)

In [None]:
df_op.head()

In [None]:
# flagging users as loyal, regular, or new based on max orders placed

df_op.loc[df_op['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'
df_op.loc[(df_op['max_order'] <= 40) & (df_op['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'
df_op.loc[df_op['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [None]:
# Examining Value Counts on 'loyalty_flag'

df_op['loyalty_flag'].value_counts(dropna = False)

##### Observations: 
1. There are more "regular" customers than new or loyal -- Could this group also be spending the most money?

In [None]:
# Checking the Updated Output
df_op.head(15)

#### Loyalty Flag

In [None]:
# Gathering basic stats based on product price

df_op.groupby('loyalty_flag')['prices'].describe()

##### Question: Is there a difference between the spending habits of the three types of customers?

##### When comparing how purchases from loyal customers differ from those purchased by regular or new customers, there is only a small difference. All customers purchase about the same at the price level. Further investigation is needed to determine why. For example, are new or regular customers receiving a promo to increase purchases, is there no loyalty program currently so loyal customers aren't motivated to purchase beyond their budget, are there any differentiated pricing tools in effect?

#### Spending Flag

In [None]:
# Clearly printing out the averge price spent per user_id

df_op.groupby('user_id').agg({'prices': ['mean']})

In [None]:
# Creating a column for avgerage spending price for loyalty groups

df_op['avg_price'] = df_op.groupby(['user_id'])['prices'].transform(np.mean)

In [None]:
# flagging users as low or high spenders based on their average price spent

df_op.loc[df_op['avg_price'] < 10, 'spending_flag'] = 'Low spender'
df_op.loc[df_op['avg_price'] >= 10, 'spending_flag'] = 'High spender'

In [None]:
# Examining Value Count on 'spending_flag' Column
df_op['spending_flag'].value_counts(dropna = False)

In [None]:
df_op['spending_flag'].value_counts(dropna = False)

##### Observations:
##### The number of low spenders is more than high spenders. This correlates with the above data on customer spending habits. Most customers stay within a budget for what they purchase. Further investigation is needed to determine if it's related to customer loyalties, promos, etc.

### 4. Flagging

In [None]:
# Creating a column for avgerage spending price for loyalty groups

df_op['order_frequency'] = df_op.groupby(['user_id'])['days_since_prior_order'].transform(np.median)

In [None]:
df_op['order_frequency'] = df_op.groupby(['user_id'])['days_since_prior_order'].transform(np.median)

In [None]:
df_op.head()

In [None]:
# flagging users as non-frequent, regular, or frequency based on days_since_prior_order 

df_op.loc[df_op['order_frequency'] > 20, 'frequency_flag'] = 'Non-frequent customer'
df_op.loc[(df_op['order_frequency'] > 10) & (df_op['order_frequency'] <= 20), 'frequency_flag'] = 'Regular customer'
df_op.loc[df_op['order_frequency'] <= 10, 'frequency_flag'] = 'Frequent customer'

In [None]:
df_op['frequency_flag'].value_counts(dropna = False)

##### Observations: 

1. Despite purchases at the price level being the same, a fair amount of customers purchase more frequently than others.
2. Having loyalty criteria be based on frequency of orders than amount spent may be better for promos and marketing.
3. NaN values present

In [None]:
df_op.shape

##### Summary
1. df_op_merged now df_op
2. 7 columns added to df_op -- max_order, loyalty_flag, spending_flag, avg_price, order_frequency, frequency_flag
3. df_op shape (32404859, 24)

### 6. Exports

In [None]:
# Exporting df as a pickle
df_op.to_pickle(os.path.join(path, 'Data', 'Prepared Data', 'ords_prods_merge_agg.pkl'))