**4.10 Coding Etiquette & Excel Reporting - Task - Part 1**

Content:
- #01 Task 1 - Importing the most up-to-date dataset
- #02 Task 2 - Consider any security implications of the data and address any PII data
- #03 Task 3 - Create a regional segmentation of the data 
- #04 Task 4 - Create an exclusion flag for low-activity customers and exclude them from the data
- #05 Task 5 - Create a profiling variable based on age, income, certain goods in the 'department_id' column, and number of dependents

***

#01 **Task 1** - Import the dataset 

***

In [2]:
#import libraries
import pandas as pd
import numpy as np
import os

In [3]:
#set path
path=r'C:\Users\EliteMini HX90\OneDrive\Documents\CareerFoundry\Instacart Project Analysis'

In [4]:
#import dataset
ords_prods_merge = pd.read_pickle(os.path.join(path,'02_Data','02_Prepared_Data','20231107_orders_products_merged_new.pkl'))

***

#02 **Task 2** - Consider any security implications and address any PII data

***

In [5]:
#check data
ords_prods_merge.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,...,frequency_flag,first_name,surname,gender,state,age,date_joined,n_dependants,fam_status,income
0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,...,Frequent Customer,Charles,Cox,Male,Minnesota,81,8/1/2019,1,married,49620
1,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,...,Frequent Customer,Charles,Cox,Male,Minnesota,81,8/1/2019,1,married,49620
2,907,Premium Sliced Bacon,106,12,20.0,3160996,138,1,5,13,...,Frequent Customer,Charles,Cox,Male,Minnesota,81,8/1/2019,1,married,49620
3,907,Premium Sliced Bacon,106,12,20.0,2254091,138,10,5,14,...,Frequent Customer,Charles,Cox,Male,Minnesota,81,8/1/2019,1,married,49620
4,1000,Apricots,18,10,12.9,505689,138,9,6,12,...,Frequent Customer,Charles,Cox,Male,Minnesota,81,8/1/2019,1,married,49620


In [6]:
#check all column names in df
ords_prods_merge.columns

Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices',
       'order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'add_to_cart_order',
       'reordered', 'price_label', 'busiest_day', 'busiest_days',
       'busiest_period_of_day', 'max_order', 'loyalty_flag', 'avg_price',
       'spending_flag', 'median_days_since_order', 'frequency_flag',
       'first_name', 'surname', 'gender', 'state', 'age', 'date_joined',
       'n_dependants', 'fam_status', 'income'],
      dtype='object')

First name and surname of customers are personally identifiable information. Considering we also have a user_id associated with each customer, we can delete these columns to comply with GDPR regulation.

In a real-world scenario, this should be discussed with a senior colleague and/or sometone responsible for data security in the organisation.

In [7]:
#delete 'first_name' and 'surname'
ords_prods_merge = ords_prods_merge.drop(columns=['first_name','surname'])

In [8]:
#check all column names in df
ords_prods_merge.columns

Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices',
       'order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'add_to_cart_order',
       'reordered', 'price_label', 'busiest_day', 'busiest_days',
       'busiest_period_of_day', 'max_order', 'loyalty_flag', 'avg_price',
       'spending_flag', 'median_days_since_order', 'frequency_flag', 'gender',
       'state', 'age', 'date_joined', 'n_dependants', 'fam_status', 'income'],
      dtype='object')

***

#03 **Task 3** - Create a regional segmentation of the data

***

In [9]:
#check frequency table for state
ords_prods_merge['state'].value_counts().sort_index()

state
Alabama                 638003
Alaska                  648495
Arizona                 653964
Arkansas                636144
California              659783
Colorado                639280
Connecticut             623022
Delaware                637024
District of Columbia    613695
Florida                 629027
Georgia                 656389
Hawaii                  632901
Idaho                   607119
Illinois                633024
Indiana                 627282
Iowa                    625493
Kansas                  637538
Kentucky                632490
Louisiana               637482
Maine                   638583
Maryland                626579
Massachusetts           646358
Michigan                630928
Minnesota               647825
Mississippi             632675
Missouri                640732
Montana                 635265
Nebraska                625813
Nevada                  636139
New Hampshire           615378
New Jersey              627692
New Mexico              654494
Ne

In [10]:
#create a for-loop list to assign each state to a region
result_1 =  []

for value in ords_prods_merge['state']:
    if value in ['Maine','New Hampshire','Vermont','Massachusetts','Rhode Island','Connecticut','New York','Pennsylvania','New Jersey']:
        result_1.append('Northeast')
    elif value in ['Wisconsin','Michigan','Illinois','Indiana','Ohio','North Dakota','South Dakota','Nebraska','Kansas','Minnesota','Iowa','Missouri']:
        result_1.append('Midwest')
    elif value in [ 'Delaware','Maryland','District of Columbia','Virginia','West Virginia','North Carolina','South Carolina','Georgia','Florida','Kentucky','Tennessee','Mississippi','Alabama','Oklahoma','Texas','Arkansas','Louisiana']:
        result_1.append('South')
    else:
        result_1.append('West')

In [11]:
#create a new column in the df to display the results of the for-loop list
ords_prods_merge['region'] = result_1

In [12]:
#print frequency of newly created column to check for any errors or missing values
ords_prods_merge['region'].value_counts(dropna=False)

region
South        10791885
West          8292913
Midwest       7597325
Northeast     5722736
Name: count, dtype: int64

In [13]:
#create crosstab of region and spending flag to check for differences in spending habits
crosstab_region_spending_1 = pd.crosstab(ords_prods_merge['region'], ords_prods_merge['spending_flag'], margins=True, dropna = False)
#print crosstab
crosstab_region_spending_1

spending_flag,High Spender,Low Spender,All
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Midwest,155975,7441350,7597325
Northeast,108225,5614511,5722736
South,209691,10582194,10791885
West,160354,8132559,8292913
All,634245,31770614,32404859


The pattern for every region is to have a larger amount of "low spenders" than "high spenders". 

In [14]:
#create crosstab of region and spending flag normalised for easier comparison
crosstab_region_spending_2 = pd.crosstab(ords_prods_merge['region'], ords_prods_merge['spending_flag'], normalize='index', dropna = False)
#print crosstab
crosstab_region_spending_2

spending_flag,High Spender,Low Spender
region,Unnamed: 1_level_1,Unnamed: 2_level_1
Midwest,0.02053,0.97947
Northeast,0.018911,0.981089
South,0.01943,0.98057
West,0.019336,0.980664


The biggest difference between low spenders and high spenders seems to be in the Northeast region, and the smallest difference in the Midwest. However, the difference seems to be comparable across regions, with no particular region standing out from the others.

In [15]:
#create crosstab of region and spending flag normalised for easier comparison
crosstab_region_spending_3 = pd.crosstab(ords_prods_merge['spending_flag'], ords_prods_merge['region'], normalize='index', dropna = False)
#print crosstab
crosstab_region_spending_3

region,Midwest,Northeast,South,West
spending_flag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
High Spender,0.245922,0.170636,0.330615,0.252827
Low Spender,0.234221,0.17672,0.333081,0.255977


When it comes to impact on Instacart's USA market, the highest percentage of both high spenders and low spenders can be found in the South - easily explained by the fact that the region has the largest amount of customers.

***

#04 **Task 4** - Create an exclusion flag for low-activity customers and exclude them from the data.

***

In [16]:
#set condition for low-activity flag
ords_prods_merge.loc[ords_prods_merge['max_order'] <5, 'activity_flag'] = 'Low Activity'

  ords_prods_merge.loc[ords_prods_merge['max_order'] <5, 'activity_flag'] = 'Low Activity'


In [17]:
#print frequency of newly created column 
ords_prods_merge['activity_flag'].value_counts(dropna=False)

activity_flag
NaN             30964564
Low Activity     1440295
Name: count, dtype: int64

In [18]:
#create subset df excluding low-activity customers
ords_prods_active_only = ords_prods_merge[ords_prods_merge['activity_flag'].isnull()==True]

In [19]:
#print frequency of column that should have only null values
ords_prods_active_only['activity_flag'].value_counts(dropna=False)

activity_flag
NaN    30964564
Name: count, dtype: int64

In [20]:
#delete activity flag column from this subset
ords_prods_active_only = ords_prods_active_only.drop(columns=['activity_flag'])

In [21]:
ords_prods_active_only.columns

Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices',
       'order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'add_to_cart_order',
       'reordered', 'price_label', 'busiest_day', 'busiest_days',
       'busiest_period_of_day', 'max_order', 'loyalty_flag', 'avg_price',
       'spending_flag', 'median_days_since_order', 'frequency_flag', 'gender',
       'state', 'age', 'date_joined', 'n_dependants', 'fam_status', 'income',
       'region'],
      dtype='object')

In [22]:
#create subset df including only low-activity customers
ords_prods_low_activ = ords_prods_merge[ords_prods_merge['activity_flag']=='Low Activity']

In [23]:
#print frequency of column that should have no null values
ords_prods_low_activ['activity_flag'].value_counts(dropna=False)

activity_flag
Low Activity    1440295
Name: count, dtype: int64

In [24]:
#export low activity customers sample
ords_prods_low_activ.to_pickle(os.path.join(path,'02_Data','02_Prepared_Data','ords_prods_low_activity_customers.pkl'))

In [25]:
#export sample excluding low activity customers
ords_prods_active_only.to_pickle(os.path.join(path,'02_Data','02_Prepared_Data','ords_prods_exclud_low_activity_customers.pkl'))

***

#05 **Task 5** - Create a profiling variable

***

In [26]:
#check columns
ords_prods_active_only.columns

Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices',
       'order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'add_to_cart_order',
       'reordered', 'price_label', 'busiest_day', 'busiest_days',
       'busiest_period_of_day', 'max_order', 'loyalty_flag', 'avg_price',
       'spending_flag', 'median_days_since_order', 'frequency_flag', 'gender',
       'state', 'age', 'date_joined', 'n_dependants', 'fam_status', 'income',
       'region'],
      dtype='object')

In [27]:
#create subset with only relevant columns for task
df_profile = ords_prods_active_only[['user_id','age','income','department_id','n_dependants','fam_status']]

In [28]:
#check df
df_profile.head()

Unnamed: 0,user_id,age,income,department_id,n_dependants,fam_status
0,138,81,49620,19,1,married
1,138,81,49620,19,1,married
2,138,81,49620,12,1,married
3,138,81,49620,12,1,married
4,138,81,49620,10,1,married


In [29]:
#import departments dataset to act as data dictionary
df_deps = pd.read_csv(os.path.join(path,'02_Data','02_Prepared_Data','departments_wrangled.csv'),index_col=False)

In [30]:
#setting the index to start at 1 and not 0
df_deps.index = np.arange(1, len(df_deps)+1)

In [31]:
df_deps.head()

Unnamed: 0,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol


In [32]:
#creating data dictionary
data_dict = df_deps.to_dict('index')
data_dict

{1: {'department': 'frozen'},
 2: {'department': 'other'},
 3: {'department': 'bakery'},
 4: {'department': 'produce'},
 5: {'department': 'alcohol'},
 6: {'department': 'international'},
 7: {'department': 'beverages'},
 8: {'department': 'pets'},
 9: {'department': 'dry goods pasta'},
 10: {'department': 'bulk'},
 11: {'department': 'personal care'},
 12: {'department': 'meat seafood'},
 13: {'department': 'pantry'},
 14: {'department': 'breakfast'},
 15: {'department': 'canned goods'},
 16: {'department': 'dairy eggs'},
 17: {'department': 'household'},
 18: {'department': 'babies'},
 19: {'department': 'snacks'},
 20: {'department': 'deli'},
 21: {'department': 'missing'}}

In [33]:
#check statistics for age
df_profile['age'].describe()

count    3.096456e+07
mean     4.946803e+01
std      1.848528e+01
min      1.800000e+01
25%      3.300000e+01
50%      4.900000e+01
75%      6.500000e+01
max      8.100000e+01
Name: age, dtype: float64

In [34]:
#set condition for young adults under 30 years old
ords_prods_active_only.loc[ords_prods_active_only['age'] <30, 'age_group'] = 'Young Adult'

  ords_prods_active_only.loc[ords_prods_active_only['age'] <30, 'age_group'] = 'Young Adult'


In [35]:
#set condition for older adults at or over 60 years old
ords_prods_active_only.loc[ords_prods_active_only['age'] >=60, 'age_group'] = 'Older Adult'

In [36]:
#set condition for adults between 30 and 59 years old
ords_prods_active_only.loc[(ords_prods_active_only['age'] >=30)&(ords_prods_active_only['age']<60), 'age_group'] = 'Adult'

In [37]:
ords_prods_active_only['age_group'].value_counts(dropna=False)

age_group
Adult          14572457
Older Adult    10574504
Young Adult     5817603
Name: count, dtype: int64

In [38]:
#check statistics for income
df_profile['income'].describe()

count    3.096456e+07
mean     9.967587e+04
std      4.314187e+04
min      2.590300e+04
25%      6.729200e+04
50%      9.676500e+04
75%      1.281020e+05
max      5.939010e+05
Name: income, dtype: float64

In [39]:
df_profile['income'].median()

96765.0

In [40]:
#set condition for high income over or equal to 128102
ords_prods_active_only.loc[ords_prods_merge['income'] >=128102, 'income_group'] = 'High Income'

  ords_prods_active_only.loc[ords_prods_merge['income'] >=128102, 'income_group'] = 'High Income'


In [41]:
#set condition for low income below 67292
ords_prods_active_only.loc[ords_prods_active_only['income'] <67292, 'income_group'] = 'Low Income'

In [42]:
#set condition for average income between 67292 and 128102 years old
ords_prods_active_only.loc[(ords_prods_active_only['income'] >=67292)&(ords_prods_active_only['income']<128102), 'income_group'] = 'Average Income'

In [43]:
ords_prods_active_only['income_group'].value_counts(dropna=False)

income_group
Average Income    15482298
High Income        7741261
Low Income         7741005
Name: count, dtype: int64

In [44]:
#check statistics for number of dependants
df_profile['n_dependants'].describe()

count    3.096456e+07
mean     1.501819e+00
std      1.118896e+00
min      0.000000e+00
25%      1.000000e+00
50%      2.000000e+00
75%      3.000000e+00
max      3.000000e+00
Name: n_dependants, dtype: float64

In [45]:
#set condition for no dependants
ords_prods_active_only.loc[ords_prods_active_only['n_dependants'] ==0, 'dependant_group'] = 'No Dependants'

  ords_prods_active_only.loc[ords_prods_active_only['n_dependants'] ==0, 'dependant_group'] = 'No Dependants'


In [46]:
#set condition for dependants
ords_prods_active_only.loc[ords_prods_active_only['n_dependants'] >=1, 'dependant_group'] = 'Dependants'

In [47]:
ords_prods_active_only['dependant_group'].value_counts(dropna=False)

dependant_group
Dependants       23224883
No Dependants     7739681
Name: count, dtype: int64

In [48]:
#set condition for likely to have children (based on previous purchases from the babies department)
ords_prods_active_only.loc[ords_prods_active_only['department_id'] ==18, 'children_flag'] = 'With Children'

  ords_prods_active_only.loc[ords_prods_active_only['department_id'] ==18, 'children_flag'] = 'With Children'


In [49]:
ords_prods_active_only['children_flag'].value_counts(dropna=False)

children_flag
NaN              30554172
With Children      410392
Name: count, dtype: int64

In [50]:
ords_prods_active_only.columns

Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices',
       'order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'add_to_cart_order',
       'reordered', 'price_label', 'busiest_day', 'busiest_days',
       'busiest_period_of_day', 'max_order', 'loyalty_flag', 'avg_price',
       'spending_flag', 'median_days_since_order', 'frequency_flag', 'gender',
       'state', 'age', 'date_joined', 'n_dependants', 'fam_status', 'income',
       'region', 'age_group', 'income_group', 'dependant_group',
       'children_flag'],
      dtype='object')

In [51]:
#create subset with only the relevant columns to create different profiles
df_1 = ords_prods_active_only[['user_id','gender','fam_status','age_group','income_group','dependant_group','children_flag']]

In [52]:
df_1.head()

Unnamed: 0,user_id,gender,fam_status,age_group,income_group,dependant_group,children_flag
0,138,Male,married,Older Adult,Low Income,Dependants,
1,138,Male,married,Older Adult,Low Income,Dependants,
2,138,Male,married,Older Adult,Low Income,Dependants,
3,138,Male,married,Older Adult,Low Income,Dependants,
4,138,Male,married,Older Adult,Low Income,Dependants,


In [53]:
#apply children flag to the same user_id regardless of items from the babies department 
df_1['children_flag'] = np.where(df_1.children_flag.eq('With Children').groupby(df_1.user_id).transform('any'), 'With Children', df_1['children_flag'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_1['children_flag'] = np.where(df_1.children_flag.eq('With Children').groupby(df_1.user_id).transform('any'), 'With Children', df_1['children_flag'])


In [54]:
df_1['children_flag'].value_counts(dropna=False)

children_flag
NaN              21154311
With Children     9810253
Name: count, dtype: int64

In [55]:
df_1.shape

(30964564, 7)

In [56]:
df_1.head()

Unnamed: 0,user_id,gender,fam_status,age_group,income_group,dependant_group,children_flag
0,138,Male,married,Older Adult,Low Income,Dependants,
1,138,Male,married,Older Adult,Low Income,Dependants,
2,138,Male,married,Older Adult,Low Income,Dependants,
3,138,Male,married,Older Adult,Low Income,Dependants,
4,138,Male,married,Older Adult,Low Income,Dependants,


In [57]:
# Create subset of df with only unique values for user_id column
df_2 = df_1.drop_duplicates(subset=['user_id'])

In [58]:
df_2.shape

(162631, 7)

In [59]:
df_2.head()

Unnamed: 0,user_id,gender,fam_status,age_group,income_group,dependant_group,children_flag
0,138,Male,married,Older Adult,Low Income,Dependants,
148,709,Female,married,Older Adult,High Income,Dependants,
398,777,Female,married,Adult,Low Income,Dependants,
511,825,Male,living with parents and siblings,Young Adult,Low Income,Dependants,
544,910,Female,divorced/widowed,Older Adult,Low Income,No Dependants,


In [60]:
df_2.describe()

Unnamed: 0,user_id,gender,fam_status,age_group,income_group,dependant_group,children_flag
count,162631,162631,162631,162631,162631,162631,30230
unique,162631,2,4,3,3,2,1
top,138,Male,married,Adult,Average Income,Dependants,With Children
freq,1,81998,114296,76482,76250,121904,30230


In [61]:
#drop gender column to facilitate profiling combinations
df_2 = df_2.drop(columns=['gender'])

In [62]:
df_2.head()

Unnamed: 0,user_id,fam_status,age_group,income_group,dependant_group,children_flag
0,138,married,Older Adult,Low Income,Dependants,
148,709,married,Older Adult,High Income,Dependants,
398,777,married,Adult,Low Income,Dependants,
511,825,living with parents and siblings,Young Adult,Low Income,Dependants,
544,910,divorced/widowed,Older Adult,Low Income,No Dependants,


In [63]:
df_2['fam_status'].value_counts(dropna=False)

fam_status
married                             114296
single                               26896
divorced/widowed                     13831
living with parents and siblings      7608
Name: count, dtype: int64

In [64]:
df_2['age_group'].value_counts(dropna=False)

age_group
Adult          76482
Older Adult    55773
Young Adult    30376
Name: count, dtype: int64

In [65]:
df_2['income_group'].value_counts(dropna=False)

income_group
Average Income    76250
Low Income        48270
High Income       38111
Name: count, dtype: int64

In [66]:
df_2['dependant_group'].value_counts(dropna=False)

dependant_group
Dependants       121904
No Dependants     40727
Name: count, dtype: int64

In [67]:
df_2['children_flag'].value_counts(dropna=False)

children_flag
NaN              132401
With Children     30230
Name: count, dtype: int64

In [68]:
#create customer_profile conditions

df_2.loc[(df_2['age_group']=='Adult') & (df_2['income_group']=='Average Income'), 'customer_profile'] = 'Adult with Average Income'
df_2.loc[(df_2['age_group']=='Adult') & (df_2['income_group']=='Average Income') & (df_2['dependant_group']=='Dependants')& (df_2['children_flag']=='With Children'), 'customer_profile'] = 'Adult with Average Income and Children'
df_2.loc[(df_2['age_group']=='Older Adult') & (df_2['income_group']=='Average Income'), 'customer_profile'] = 'Older Adult with Average Income'
df_2.loc[(df_2['age_group']=='Young Adult') & (df_2['income_group']=='Average Income'), 'customer_profile'] = 'Young Adult with Average Income'
df_2.loc[(df_2['age_group']=='Young Adult') & (df_2['income_group']=='Average Income') & (df_2['dependant_group']=='Dependants') & (df_2['children_flag']=='With Children'), 'customer_profile'] = 'Young Adult with Average Income and Children'

df_2.loc[(df_2['age_group']=='Adult') & (df_2['income_group']=='Low Income'), 'customer_profile'] = 'Adult with Low Income'
df_2.loc[(df_2['age_group']=='Adult') & (df_2['income_group']=='Low Income') & (df_2['dependant_group']=='Dependants') & (df_2['children_flag']=='With Children'), 'customer_profile'] = 'Adult with Low Income and Children'
df_2.loc[(df_2['age_group']=='Older Adult') & (df_2['income_group']=='Low Income'), 'customer_profile'] = 'Older Adult with Low Income'
df_2.loc[(df_2['age_group']=='Young Adult') & (df_2['income_group']=='Low Income'), 'customer_profile'] = 'Young Adult with Low Income'
df_2.loc[(df_2['age_group']=='Young Adult') & (df_2['income_group']=='Low Income') & (df_2['dependant_group']=='Dependants') & (df_2['children_flag']=='With Children'), 'customer_profile'] = 'Young Adult with Low Income and Children'

df_2.loc[(df_2['age_group']=='Adult') & (df_2['income_group']=='High Income'),  'customer_profile'] = 'Adult with High Income'
df_2.loc[(df_2['age_group']=='Adult') & (df_2['income_group']=='High Income') & (df_2['children_flag']=='With Children'), 'customer_profile'] = 'Adult with High Income and Children'
df_2.loc[(df_2['age_group']=='Older Adult') & (df_2['income_group']=='High Income'), 'customer_profile'] = 'Older Adult with High Income'
df_2.loc[(df_2['age_group']=='Young Adult') & (df_2['income_group']=='High Income'), 'customer_profile'] = 'Young Adult with High Income'
df_2.loc[(df_2['age_group']=='Young Adult') & (df_2['income_group']=='High Income') & (df_2['dependant_group']=='Dependants') & (df_2['children_flag']=='With Children'), 'customer_profile'] = 'Young Adult with High Income and Children'


  df_2.loc[(df_2['age_group']=='Adult') & (df_2['income_group']=='Average Income'), 'customer_profile'] = 'Adult with Average Income'


The 'children flag' is used to create a profile in combination with the 'dependant group' to more accurately represent customers with children in their household, rather than potentially flagging customers buying items from the 'babies' department as gifts, for example.

In [69]:
df_2['customer_profile'].value_counts(dropna=False)

customer_profile
Adult with Average Income                       30164
Older Adult with Average Income                 25507
Adult with Low Income                           20562
Older Adult with High Income                    20068
Adult with High Income                          14197
Young Adult with Low Income                     13255
Young Adult with Average Income                 12734
Older Adult with Low Income                     10198
Adult with Average Income and Children           5561
Adult with High Income and Children              3604
Adult with Low Income and Children               2394
Young Adult with Average Income and Children     2284
Young Adult with Low Income and Children         1861
Young Adult with High Income                      204
Young Adult with High Income and Children          38
Name: count, dtype: int64

In [70]:
df_2.head()

Unnamed: 0,user_id,fam_status,age_group,income_group,dependant_group,children_flag,customer_profile
0,138,married,Older Adult,Low Income,Dependants,,Older Adult with Low Income
148,709,married,Older Adult,High Income,Dependants,,Older Adult with High Income
398,777,married,Adult,Low Income,Dependants,,Adult with Low Income
511,825,living with parents and siblings,Young Adult,Low Income,Dependants,,Young Adult with Low Income
544,910,divorced/widowed,Older Adult,Low Income,No Dependants,,Older Adult with Low Income


In [71]:
df_2.shape

(162631, 7)

In [72]:
#merge dfs back together
ords_prods_profile = ords_prods_active_only.merge(df_2,on = ['user_id','fam_status','age_group','income_group','dependant_group'])

In [73]:
ords_prods_profile.shape

(30964564, 37)

In [74]:
ords_prods_profile.columns

Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices',
       'order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'add_to_cart_order',
       'reordered', 'price_label', 'busiest_day', 'busiest_days',
       'busiest_period_of_day', 'max_order', 'loyalty_flag', 'avg_price',
       'spending_flag', 'median_days_since_order', 'frequency_flag', 'gender',
       'state', 'age', 'date_joined', 'n_dependants', 'fam_status', 'income',
       'region', 'age_group', 'income_group', 'dependant_group',
       'children_flag_x', 'children_flag_y', 'customer_profile'],
      dtype='object')

In [75]:
#check now duplicated column 1 to find the most up to date one
ords_prods_profile['children_flag_x'].value_counts(dropna=False)

children_flag_x
NaN              30554172
With Children      410392
Name: count, dtype: int64

In [76]:
#check now duplicated column 2 to find the most up to date one
ords_prods_profile['children_flag_y'].value_counts(dropna=False)

children_flag_y
NaN              21154311
With Children     9810253
Name: count, dtype: int64

In [77]:
#delete children_flag_x (before it was applied to all orders by same customer)

ords_prods_profile=ords_prods_profile.drop(columns='children_flag_x')

In [78]:
#rename children_flag_y

ords_prods_profile=ords_prods_profile.rename(columns={'children_flag_y':'children_flag'})

In [79]:
#check columns after deleting and renaming
ords_prods_profile.columns

Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices',
       'order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'add_to_cart_order',
       'reordered', 'price_label', 'busiest_day', 'busiest_days',
       'busiest_period_of_day', 'max_order', 'loyalty_flag', 'avg_price',
       'spending_flag', 'median_days_since_order', 'frequency_flag', 'gender',
       'state', 'age', 'date_joined', 'n_dependants', 'fam_status', 'income',
       'region', 'age_group', 'income_group', 'dependant_group',
       'children_flag', 'customer_profile'],
      dtype='object')

In [80]:
#check df shape
ords_prods_profile.shape

(30964564, 36)

In [81]:
#export dataframe

ords_prods_profile.to_pickle(os.path.join(path,'02_Data','02_Prepared_Data','ords_prods_with_customer_profile.pkl'))