## Repeat Purchase Project
#### This will use Dunnhumby data to predict the repurchase chance of existing customers. 
Machine learning will be used. Much of the work here will be creating a data set and a model to fit the data set. RFM will be important here as these will be indicators/features we look out for.

Link to XGBoost Tutorial on Towards Data Science:
https://towardsdatascience.com/getting-started-with-xgboost-in-scikit-learn-f69f5f470a97

In [15]:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
import numpy as np

### Now we load transaction, product, consumer csv's

In [2]:
transaction_df = pd.read_csv('../Resources/dunnhumby/transaction_data.csv')
product_df = pd.read_csv('../Resources/dunnhumby/product.csv')
hh_df = pd.read_csv('../Resources/dunnhumby/hh_demographic.csv')

In [3]:
transaction_df.head()

Unnamed: 0,household_key,BASKET_ID,DAY,PRODUCT_ID,QUANTITY,SALES_VALUE,STORE_ID,RETAIL_DISC,TRANS_TIME,WEEK_NO,COUPON_DISC,COUPON_MATCH_DISC
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0
2,2375,26984851472,1,1036325,1,0.99,364,-0.3,1631,1,0.0,0.0
3,2375,26984851472,1,1082185,1,1.21,364,0.0,1631,1,0.0,0.0
4,2375,26984851472,1,8160430,1,1.5,364,-0.39,1631,1,0.0,0.0


In [4]:
product_df.head()

Unnamed: 0,PRODUCT_ID,MANUFACTURER,DEPARTMENT,BRAND,COMMODITY_DESC,SUB_COMMODITY_DESC,CURR_SIZE_OF_PRODUCT
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,
2,26093,69,PASTRY,Private,BREAD,BREAD:ITALIAN/FRENCH,
3,26190,69,GROCERY,Private,FRUIT - SHELF STABLE,APPLE SAUCE,50 OZ
4,26355,69,GROCERY,Private,COOKIES/CONES,SPECIALTY COOKIES,14 OZ


In [5]:
hh_df.head()

Unnamed: 0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7
2,25-34,U,25-34K,Unknown,2 Adults Kids,3,1,8
3,25-34,U,75-99K,Homeowner,2 Adults Kids,4,2,13
4,45-54,B,50-74K,Homeowner,Single Female,1,None/Unknown,16


#### We could do some quick summaries using buckets just to see what the data as a whole looks like. Let's take a look at the consumer data, first.

In [6]:
hh_df.groupby(['AGE_DESC']).count()

Unnamed: 0_level_0,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key
AGE_DESC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
19-24,46,46,46,46,46,46,46
25-34,142,142,142,142,142,142,142
35-44,194,194,194,194,194,194,194
45-54,288,288,288,288,288,288,288
55-64,59,59,59,59,59,59,59
65+,72,72,72,72,72,72,72


#### We can tell from the aboce data that the big spenders/the largest group of consumers is 45-54, and 45-54 as a runner-up.

In [7]:
hh_df.groupby(['INCOME_DESC']).count()

Unnamed: 0_level_0,AGE_DESC,MARITAL_STATUS_CODE,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key
INCOME_DESC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
100-124K,34,34,34,34,34,34,34
125-149K,38,38,38,38,38,38,38
15-24K,74,74,74,74,74,74,74
150-174K,30,30,30,30,30,30,30
175-199K,11,11,11,11,11,11,11
200-249K,5,5,5,5,5,5,5
25-34K,77,77,77,77,77,77,77
250K+,11,11,11,11,11,11,11
35-49K,172,172,172,172,172,172,172
50-74K,192,192,192,192,192,192,192


#### Majority of our consumers here are under 50k a year. We could compare that income to number of people per house, kids, marital status, etc, but we can leave those as features for our model that we will end up building.

In order to get into modeling, we have to process the data in such a way that we organize the households into one time purchasers and multiple time purchasers. We can do this through the transaction dataframe, where we count, for each unique household key, how many times that household occurs. gather the data for those households, and assemble a numpy array that can be passed into an XGBClassifier. This classifier will be provided the data of customers who have purchased multiple times and what their behavior is, and then will be used to predict on the households who have only purchased once to see who is most likely. The XGBClassifier will return values ranging between 0 and 1, which will reflect a probability or confidence of repurchase. We could then have an algorithm process the predictions and report back who exceeded a certain threshold based on what amount of confidence we ask that algorithm specifically to look for.

In [8]:
transaction_df.head()

Unnamed: 0,household_key,BASKET_ID,DAY,PRODUCT_ID,QUANTITY,SALES_VALUE,STORE_ID,RETAIL_DISC,TRANS_TIME,WEEK_NO,COUPON_DISC,COUPON_MATCH_DISC
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0
2,2375,26984851472,1,1036325,1,0.99,364,-0.3,1631,1,0.0,0.0
3,2375,26984851472,1,1082185,1,1.21,364,0.0,1631,1,0.0,0.0
4,2375,26984851472,1,8160430,1,1.5,364,-0.39,1631,1,0.0,0.0


In [12]:
#number of unique households
len(transaction_df['household_key'].unique())

2500

In [36]:
transaction_df.groupby(['household_key']).count()['QUANTITY']
#the numbers here represent the number of products each household has purchased, it's the number of times each 
#household shows up on the transaction_df

household_key
1       1727
2        714
3        922
4        301
5        222
        ... 
2496    1489
2497    1962
2498     859
2499    1166
2500    1503
Name: QUANTITY, Length: 2500, dtype: int64

In [29]:
hh_df.groupby('household_key').count()
#this shows that the hh_df is a list of each unique household

Unnamed: 0_level_0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,QUANTITY PURCHASED
household_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1,1,1,1,1,1,1,0
7,1,1,1,1,1,1,1,1
8,1,1,1,1,1,1,1,1
13,1,1,1,1,1,1,1,1
16,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...
2494,1,1,1,1,1,1,1,1
2496,1,1,1,1,1,1,1,1
2497,1,1,1,1,1,1,1,1
2498,1,1,1,1,1,1,1,1


In [33]:
merged_hh = pd.merge(hh_df, transaction_df.groupby(['household_key']).count()['QUANTITY'], on = "household_key")

In [37]:
merged_hh.drop(columns = ['QUANTITY PURCHASED'], inplace = True)
merged_hh.head()

Unnamed: 0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key,QUANTITY
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1,1727
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7,1286
2,25-34,U,25-34K,Unknown,2 Adults Kids,3,1,8,1979
3,25-34,U,75-99K,Homeowner,2 Adults Kids,4,2,13,2348
4,45-54,B,50-74K,Homeowner,Single Female,1,None/Unknown,16,517


In [None]:
''' 
so we have matched each household with the number of products they purchased. This should allow some
decent visualization/organization. also our XGBClassifier should be able to take this 'QUANTITY' value into
account and evaluate liklihood of repurchase based on that as well.
'''

In [None]:
# from IPython.display import Image
# Image(filename='Resources/Screenshots/')