## Repeat Purchase Project
#### This will use Dunnhumby data to predict the repurchase chance of existing customers. 
Machine learning will be used. Much of the work here will be creating a data set and a model to fit the data set. RFM will be important here as these will be indicators/features we look out for.

Link to XGBoost Tutorial on Towards Data Science:
https://towardsdatascience.com/getting-started-with-xgboost-in-scikit-learn-f69f5f470a97

In [90]:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
import numpy as np

### Now we load transaction, product, consumer csv's

In [91]:
transaction_df = pd.read_csv('../Resources/dunnhumby/transaction_data.csv')
product_df = pd.read_csv('../Resources/dunnhumby/product.csv')
hh_df = pd.read_csv('../Resources/dunnhumby/hh_demographic.csv')

In [92]:
transaction_df.head()

Unnamed: 0,household_key,BASKET_ID,DAY,PRODUCT_ID,QUANTITY,SALES_VALUE,STORE_ID,RETAIL_DISC,TRANS_TIME,WEEK_NO,COUPON_DISC,COUPON_MATCH_DISC
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0
2,2375,26984851472,1,1036325,1,0.99,364,-0.3,1631,1,0.0,0.0
3,2375,26984851472,1,1082185,1,1.21,364,0.0,1631,1,0.0,0.0
4,2375,26984851472,1,8160430,1,1.5,364,-0.39,1631,1,0.0,0.0


In [93]:
product_df.head()

Unnamed: 0,PRODUCT_ID,MANUFACTURER,DEPARTMENT,BRAND,COMMODITY_DESC,SUB_COMMODITY_DESC,CURR_SIZE_OF_PRODUCT
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,
2,26093,69,PASTRY,Private,BREAD,BREAD:ITALIAN/FRENCH,
3,26190,69,GROCERY,Private,FRUIT - SHELF STABLE,APPLE SAUCE,50 OZ
4,26355,69,GROCERY,Private,COOKIES/CONES,SPECIALTY COOKIES,14 OZ


In [94]:
hh_df.head()

Unnamed: 0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7
2,25-34,U,25-34K,Unknown,2 Adults Kids,3,1,8
3,25-34,U,75-99K,Homeowner,2 Adults Kids,4,2,13
4,45-54,B,50-74K,Homeowner,Single Female,1,None/Unknown,16


#### We could do some quick summaries using buckets just to see what the data as a whole looks like. Let's take a look at the consumer data, first.

In [95]:
hh_df.groupby(['AGE_DESC']).count()

Unnamed: 0_level_0,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key
AGE_DESC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
19-24,46,46,46,46,46,46,46
25-34,142,142,142,142,142,142,142
35-44,194,194,194,194,194,194,194
45-54,288,288,288,288,288,288,288
55-64,59,59,59,59,59,59,59
65+,72,72,72,72,72,72,72


#### We can tell from the aboce data that the big spenders/the largest group of consumers is 45-54, and 45-54 as a runner-up.

In [96]:
hh_df.groupby(['INCOME_DESC']).count()

Unnamed: 0_level_0,AGE_DESC,MARITAL_STATUS_CODE,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key
INCOME_DESC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
100-124K,34,34,34,34,34,34,34
125-149K,38,38,38,38,38,38,38
15-24K,74,74,74,74,74,74,74
150-174K,30,30,30,30,30,30,30
175-199K,11,11,11,11,11,11,11
200-249K,5,5,5,5,5,5,5
25-34K,77,77,77,77,77,77,77
250K+,11,11,11,11,11,11,11
35-49K,172,172,172,172,172,172,172
50-74K,192,192,192,192,192,192,192


In [97]:
hh_df.groupby(['INCOME_DESC']).count()['household_key'].sum()

801

#### Majority of our consumers here are under 50k a year. We could compare that income to number of people per house, kids, marital status, etc, but we can leave those as features for our model that we will end up building.

In order to get into modeling, we have to process the data in such a way that we organize the households into one time purchasers and multiple time purchasers. We can do this through the transaction dataframe, where we count, for each unique household key, how many times that household occurs. gather the data for those households, and assemble a numpy array that can be passed into an XGBClassifier. This classifier will be provided the data of customers who have purchased multiple times and what their behavior is, and then will be used to predict on the households who have only purchased once to see who is most likely. The XGBClassifier will return values ranging between 0 and 1, which will reflect a probability or confidence of repurchase. We could then have an algorithm process the predictions and report back who exceeded a certain threshold based on what amount of confidence we ask that algorithm specifically to look for.

In [98]:
transaction_df.head()

Unnamed: 0,household_key,BASKET_ID,DAY,PRODUCT_ID,QUANTITY,SALES_VALUE,STORE_ID,RETAIL_DISC,TRANS_TIME,WEEK_NO,COUPON_DISC,COUPON_MATCH_DISC
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0
2,2375,26984851472,1,1036325,1,0.99,364,-0.3,1631,1,0.0,0.0
3,2375,26984851472,1,1082185,1,1.21,364,0.0,1631,1,0.0,0.0
4,2375,26984851472,1,8160430,1,1.5,364,-0.39,1631,1,0.0,0.0


In [99]:
#number of unique households
len(transaction_df['household_key'].unique())

2500

In [122]:
#number of actual transactions
len(transaction_df.groupby(['BASKET_ID']).mean())

276484

In [115]:
'''
the way the data was collected, there are multiple times households may show up that are really part of the same
single transaction, so to avoid that, and since household key wont change if we take the average for each 
basket id which is unique to that transaction, so we can group by basket id to consolidate a transaction, and then
once again count for a household key so that we can get a true count of the number of transactions per household.
this is all to determine if households have come back to purchase. 
We also will want to separate purchases by store. If we consolidate all purchases and make an assumption that these 
are all with one brand, we sort of dirty the waters and ignore the fact that these customers are making return 
purchases to specific brands which in and of itself is a decision the consumer is making.'''
transaction_df.groupby(['BASKET_ID']).mean()

Unnamed: 0_level_0,household_key,DAY,PRODUCT_ID,QUANTITY,SALES_VALUE,STORE_ID,RETAIL_DISC,TRANS_TIME,WEEK_NO,COUPON_DISC,COUPON_MATCH_DISC
BASKET_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
26984851472,2375.0,1.0,2.463398e+06,1.000000,1.182000,364.0,-0.258000,1631.0,1.0,0.0,0.0
26984851516,2375.0,1.0,3.328273e+06,1.166667,2.071667,364.0,-0.543333,1642.0,1.0,0.0,0.0
26984896261,1364.0,1.0,9.160190e+05,1.000000,2.274000,31742.0,-0.436000,1520.0,1.0,0.0,0.0
26984905972,1130.0,1.0,9.686606e+05,1.800000,0.510000,31642.0,-0.416000,1340.0,1.0,0.0,0.0
26984945254,1173.0,1.0,9.599073e+05,1.333333,1.176667,412.0,0.000000,2042.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
42302712006,2262.0,711.0,4.290221e+06,1.000000,2.445000,446.0,-0.447500,1652.0,102.0,0.0,0.0
42302712189,1369.0,711.0,2.661998e+06,1.200000,6.728000,446.0,-0.220000,1730.0,102.0,0.0,0.0
42302712298,2225.0,711.0,3.756731e+06,1.000000,3.700000,446.0,-0.035455,1754.0,102.0,0.0,0.0
42305362497,1598.0,711.0,7.094158e+06,1.200000,1.122000,3228.0,-0.444000,1516.0,102.0,0.0,0.0


In [130]:
#transaction stuff per store based on store id
store_count_df = transaction_df.groupby(['BASKET_ID']).mean().groupby(['STORE_ID']).count()
#just want to make sure there are enough data points to train our model with. we can feed the same model
#multiple stores' data because it can maybe be assumed that customer decisions can be universally classified
#no matter the store. the only reason we'd separate the stores into multiple data sets is because it divides
#the purchases into their respective category. if a customer only shopped twice but it was to two different stores,
#it's pretty obvious that they didn't repurchase at the same store and thus can't be classified as a returning
#customer
store_count_df.loc[store_count_df['DAY'] > 500,:].sort_values(by=['DAY'])

Unnamed: 0_level_0,household_key,DAY,PRODUCT_ID,QUANTITY,SALES_VALUE,RETAIL_DISC,TRANS_TIME,WEEK_NO,COUPON_DISC,COUPON_MATCH_DISC
STORE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
366.0,583,583,583,583,583,583,583,583,583,583
286.0,595,595,595,595,595,595,595,595,595,595
352.0,610,610,610,610,610,610,610,610,610,610
415.0,691,691,691,691,691,691,691,691,691,691
322.0,783,783,783,783,783,783,783,783,783,783
...,...,...,...,...,...,...,...,...,...,...
381.0,5047,5047,5047,5047,5047,5047,5047,5047,5047,5047
343.0,5141,5141,5141,5141,5141,5141,5141,5141,5141,5141
361.0,5146,5146,5146,5146,5146,5146,5146,5146,5146,5146
406.0,5422,5422,5422,5422,5422,5422,5422,5422,5422,5422


In [126]:
#number of stores
len(transaction_df.groupby(['BASKET_ID']).mean().groupby(['STORE_ID']).mean())

582

In [121]:
individual_transactions = transaction_df.groupby(['BASKET_ID']).mean()
individual_transactions.groupby('household_key').count()['QUANTITY']

household_key
1.0        86
2.0        45
3.0        47
4.0        30
5.0        40
         ... 
2496.0     63
2497.0    221
2498.0    172
2499.0     90
2500.0    113
Name: QUANTITY, Length: 2500, dtype: int64

In [None]:
transaction_count = transaction_df

In [100]:
transaction_df.groupby(['household_key']).count()['QUANTITY']
#the numbers here represent the number of products each household has purchased, it's the number of times each 
#household shows up on the transaction_df

household_key
1       1727
2        714
3        922
4        301
5        222
        ... 
2496    1489
2497    1962
2498     859
2499    1166
2500    1503
Name: QUANTITY, Length: 2500, dtype: int64

In [101]:
hh_df.groupby('household_key').count()
#this shows that the hh_df is a list of each unique household

Unnamed: 0_level_0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC
household_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,1,1,1,1,1,1
7,1,1,1,1,1,1,1
8,1,1,1,1,1,1,1
13,1,1,1,1,1,1,1
16,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...
2494,1,1,1,1,1,1,1
2496,1,1,1,1,1,1,1
2497,1,1,1,1,1,1,1
2498,1,1,1,1,1,1,1


In [104]:
merged_hh = pd.merge(hh_df, transaction_df.groupby(['household_key']).count()['QUANTITY'], on = "household_key")
merged_hh.head()

Unnamed: 0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key,QUANTITY
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1,1727
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7,1286
2,25-34,U,25-34K,Unknown,2 Adults Kids,3,1,8,1979
3,25-34,U,75-99K,Homeowner,2 Adults Kids,4,2,13,2348
4,45-54,B,50-74K,Homeowner,Single Female,1,None/Unknown,16,517


In [105]:
''' 
so we have matched each household with the number of products they purchased. This should allow some
decent visualization/organization. also our XGBClassifier should be able to take this 'QUANTITY' value into
account and evaluate liklihood of repurchase based on that as well.
'''
#for the next step we are going to swap the KID_CATEGORY_DESC 'None/Unknown' value for 0
merged_hh['KID_CATEGORY_DESC'] = merged_hh['KID_CATEGORY_DESC'].replace({'None/Unknown':0})

In [106]:
merged_hh['HOMEOWNER_DESC'].unique()

array(['Homeowner', 'Unknown', 'Renter', 'Probable Renter',
       'Probable Owner'], dtype=object)

In [107]:
#converting the homeowner category to one-hot-encoding
from keras.utils import to_categorical
merged_hh['HOMEOWNER_DESC_TO_INT'] = merged_hh['HOMEOWNER_DESC']
merged_hh['HOMEOWNER_DESC_TO_INT'] = merged_hh['HOMEOWNER_DESC_TO_INT'].replace({'Homeowner':0,
                                                                                'Unknown':1,
                                                                                'Renter':2,
                                                                                'Probable Renter':3,
                                                                                'Probable Owner':4})
HOMEOWNER_NP = merged_hh['HOMEOWNER_DESC_TO_INT'].to_numpy()
HOMEOWNER_TF = tf.keras.utils.to_categorical(HOMEOWNER_NP, num_classes = 5)
HOMEOWNER_TF

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]], dtype=float32)

In [108]:
merged_hh['HOMEOWNER_BINARY'] = HOMEOWNER_TF[:, 0]
merged_hh['HOMEOWNER_BINARY'] = merged_hh['HOMEOWNER_BINARY'].astype('int32')
merged_hh['UNKNOWN_BINARY'] = HOMEOWNER_TF[:, 1]
merged_hh['UNKNOWN_BINARY'] = merged_hh['UNKNOWN_BINARY'].astype('int32')
merged_hh['RENTER_BINARY'] = HOMEOWNER_TF[:, 2]
merged_hh['RENTER_BINARY'] = merged_hh['RENTER_BINARY'].astype('int32')
merged_hh['PROBABLE_RENTER_BINARY'] = HOMEOWNER_TF[:, 3]
merged_hh['PROBABLE_RENTER_BINARY'] = merged_hh['PROBABLE_RENTER_BINARY'].astype('int32')
merged_hh['PROBABLE_OWNER_BINARY'] = HOMEOWNER_TF[:, 4]
merged_hh['PROBABLE_OWNER_BINARY'] = merged_hh['PROBABLE_OWNER_BINARY'].astype('int32')

In [109]:
merged_hh['KID_CATEGORY_DESC'].unique()

array([0, '1', '2', '3+'], dtype=object)

In [110]:
merged_hh['HOUSEHOLD_SIZE_DESC'].unique()

array(['2', '3', '4', '1', '5+'], dtype=object)

In [112]:
merged_hh.head()

Unnamed: 0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key,QUANTITY,HOMEOWNER_DESC_TO_INT,HOMEOWNER_BINARY,UNKNOWN_BINARY,RENTER_BINARY,PROBABLE_RENTER_BINARY,PROBABLE_OWNER_BINARY
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,0,1,1727,0,1,0,0,0,0
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,0,7,1286,0,1,0,0,0,0
2,25-34,U,25-34K,Unknown,2 Adults Kids,3,1,8,1979,1,0,1,0,0,0
3,25-34,U,75-99K,Homeowner,2 Adults Kids,4,2,13,2348,0,1,0,0,0,0
4,45-54,B,50-74K,Homeowner,Single Female,1,0,16,517,0,1,0,0,0,0


In [113]:
merged_hh['HOUSEHOLD_SIZE_DESC'] = merged_hh['HOUSEHOLD_SIZE_DESC'].replace({'5+':5})
merged_hh['KID_CATEGORY_DESC'] = merged_hh['KID_CATEGORY_DESC'].replace({'3+':3})
merged_hh['KID_CATEGORY_DESC'] = merged_hh['KID_CATEGORY_DESC'].astype('int32')
merged_hh['HOUSEHOLD_SIZE_DESC'] = merged_hh['HOUSEHOLD_SIZE_DESC'].astype('int32')
merged_hh['ADULTS'] = merged_hh['HOUSEHOLD_SIZE_DESC'].astype('int32') - merged_hh['KID_CATEGORY_DESC'].astype('int32')
merged_hh = merged_hh.rename(columns = {'KID_CATEGORY_DESC':'KIDS',
                                       'HOUSEHOLD_SIZE_DESC':'HOUSEHOLD_SIZE'})
merged_hh.head()

Unnamed: 0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE,KIDS,household_key,QUANTITY,HOMEOWNER_DESC_TO_INT,HOMEOWNER_BINARY,UNKNOWN_BINARY,RENTER_BINARY,PROBABLE_RENTER_BINARY,PROBABLE_OWNER_BINARY,ADULTS
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,0,1,1727,0,1,0,0,0,0,2
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,0,7,1286,0,1,0,0,0,0,2
2,25-34,U,25-34K,Unknown,2 Adults Kids,3,1,8,1979,1,0,1,0,0,0,2
3,25-34,U,75-99K,Homeowner,2 Adults Kids,4,2,13,2348,0,1,0,0,0,0,2
4,45-54,B,50-74K,Homeowner,Single Female,1,0,16,517,0,1,0,0,0,0,1


In [None]:
'''
we want to now merge this data about households with transactions per store, maybe keep a dictionary of store data
that is separated by store so that we can feed each one into the model and train it further. so we can have a database
of transactions by store, where each transaction has consumer data. i actually still have yet to figure out whether
we feed the model consumers and decide whether or not those consumers should be classified as returning customers OR
whether we feed them transaction data for which case we have the store predict which consumer will be the next
transaction. although speaking outloud it seems it might be best to feed it a consumer data base with number of 
purchases made to each store. so maybe we can still do it by having a database of consumer data with all the purchases
they made to a specific store. i guess in that way we wouldnt have to one-hot encode for every store (that would
be a lot of columns). another thing, should we combine this with product information? we could one-hot that which
be quite a lot of that but could be interesting. or at least when it came to what type of products each transaction
focused on. we have that data.'''