# Build Ranker Dataset

**The purpose of this notebook is to:**
- generate candidate articles for purchase for each customer in each weekly window
- generate features for each customer/article/week candidate pairing

**This notebook uses the following candidate generation methods:**
- All articles that were purchased in the last 30 days (return policy)
- All articles given >50% likelihood of repurchasing using the repeat model
- Top 12 association articles (what articles are most commonly also purchased by the customers who purchased X?)
- Top 12 general predictions for each age group (used for cold start customers)
- 7 "neighbor articles" (e.g. articles within 7 of the article ids of purchases in the last 30 days)

**This model uses the following features:**
- Which candidate generation methods did this article come from? How many lists was it in?
- How many days since the article was last purchased by the customer?
- How many times has the article been purchased by the customer?
- Age + subscription status of customer
- How many weeks removed are we from the peak sales week of the given article?
- Are there any current sales for the article? (Average price sold last week / Average price sold overall)
    - What was the sale 2 weeks ago?
    - What was the sale a month ago?
- What is the propensity for someone in the customer's age/gender bracket to purchase the article?
- What percent of sales of this article occur in the given week?
- How popular are articles in the given index/garment group/section in the last week?
- How often does the customer repurchase articles? (Total number of purchases / Total number of unique purchases)
- How many days since the article was first sold + last sold at H&M?
- How often is the article repurchased? (Total number of purchases / total number of unique customers purchased)
- What is the median age of the article purchaser? How far off is the median age from the customer in question?
- What was the average sale price in the last week vs the customer's average article purchase price? How many standard deviations is it away? (e.g. is it out of their usual price range)
- How similar is this article's metadata to previous purchases? (Color, index group, garment group)
- How popular is the article last week/month? (average sales per week/month vs sales last week/month)
- How often is the article purchased online vs in store? How does that compare to the customer's likelihood to purchase online vs in store?

## Import statements

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import datetime as dt

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

from scipy import sparse 
from pandas.api.types import CategoricalDtype 

from sklearn.neighbors import NearestNeighbors
from scipy.spatial import KDTree

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,roc_curve,roc_auc_score,f1_score,precision_score,recall_score
from sklearn.model_selection import GridSearchCV,GroupKFold
from sklearn.calibration import CalibratedClassifierCV

import xgboost as xgb

from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

from tqdm import tqdm
tqdm.pandas()

## Read in data + fix data types

In [None]:
sample = ''

# Read in articles data
df_art = pd.read_csv('../Data/articles/articles'+sample+'.csv')
df_cust = pd.read_csv('../Data/customers/customers'+sample+'.csv')
df_trans = pd.read_csv('../Data/transactions_train/transactions_train'+sample+'.csv')

In [None]:
# Fix format of article IDs
df_art['article_id'] = df_art['article_id'].astype(str).str.zfill(10)
df_art['detail_desc'] = df_art['detail_desc'].astype(str)
df_trans['article_id'] = df_trans['article_id'].astype(str).str.zfill(10)

# Fix datetime type
df_trans['t_dat'] = pd.to_datetime(df_trans['t_dat'])

# Build df_cust age brackets
df_cust['Age_Bracket'] = pd.cut(df_cust['age'],[1,19,29,39,49,59,200],labels=[1,2,3,4,5,6]).fillna(2)

# Update the color column for df_art
df_art['color'] = np.where(df_art['perceived_colour_master_name'].isin(['Blue','Turquoise','Bluish Green']),'Blue',\
                  np.where(df_art['perceived_colour_master_name'].isin(['Green','Yellowish Green','Khaki green']),'Green',\
                  np.where(df_art['perceived_colour_master_name'].isin(['Brown','Beige','Mole']),'Brown',\
                  np.where(df_art['perceived_colour_master_name'].isin(['Grey','Metal']),'Grey',\
                           df_art['perceived_colour_master_name']))))

In [None]:
# # Build training dataset by removing the last 7 days of data

test_start_date = '2020-09-09'
test_end_date = '2020-09-15'

df_trans_train = df_trans.loc[df_trans['t_dat'] < test_start_date,:].copy()
df_trans_test = df_trans.loc[(df_trans['t_dat'] >= test_start_date)&(df_trans['t_dat'] <= test_end_date),:].copy()

# TEST EFFICACY OF CANDIDATE LISTS

### Metrics to test:

- Recall: what percent of unique purchases show up in the candidates?
- How many times more total candidates do I have than true candidates?

In [None]:
def build_candidate_metrics(df,date1,date2):
    
    # Create datasets for building metrics
    df_avail = df_trans.loc[df_trans['t_dat'] < date1].copy()
    df_trans_test = df_trans.loc[(df_trans['t_dat'] >= date1)&(df_trans['t_dat'] <= date2),\
                                ['customer_id','article_id']].drop_duplicates()
    
    df_joined = pd.merge(df,df_trans_test,on=['customer_id','article_id']).drop_duplicates()
    
    # Create candidate multiplier
    mult = len(df) / len(df_joined)
    recall = len(df_joined) / len(df_trans_test)
    print('Recall:',len(df_joined))
    print('Num Candidates:',len(df))
    print('Overall Score:',len(df_joined)/len(df))
    print()

# Building the full TRAINING dataset

# Build the dataset - completed features

In [None]:
def get_general_pred(dfx):
    
    df_build = dfx.copy()
    
    last_ts = df_build['t_dat'].max()
    last_day = last_ts.strftime('%Y-%m-%d')

    df_build['subdays'] = (last_ts - df_build['t_dat'])
    df_build['temp'] = df_build['subdays'].dt.floor('7D')
    df_build['ldbw'] = last_ts - df_build['temp']

    del df_build['subdays']
    del df_build['temp']
    
    weekly_sales = df_build.drop('customer_id', axis=1).groupby(['ldbw', 'article_id']).count().reset_index()
    weekly_sales = weekly_sales.rename(columns={'t_dat': 'count'})
    weekly_sales = weekly_sales[['ldbw','article_id','count']].copy()

    df_build = pd.merge(df_build,weekly_sales, on=['ldbw', 'article_id'])
    weekly_sales = weekly_sales.reset_index().set_index('article_id')
    
    df = pd.merge(df_build,
        weekly_sales.loc[weekly_sales['ldbw']==last_day, ['count']],
        on='article_id', suffixes=('','_targ'))

    df['count_targ'].fillna(0, inplace=True)
    df['quotient'] = df['count_targ'] / df['count']
    
    target_sales = df.drop('customer_id', axis=1).groupby('article_id')['quotient'].sum()
    
    return target_sales

In [None]:
# BUILD THE LIST OF OPTIONS:

def build_options(date1,date2,generate_test=True,num_general = 0,num_assoc=12,num_neighbors=6):
    
    # Create dataset for building metrics
    df_avail = df_trans.loc[df_trans['t_dat'] < date1].copy()
    df_trans_test = df_trans.loc[(df_trans['t_dat'] >= date1)&(df_trans['t_dat'] <= date2),\
                                ['customer_id','article_id']].drop_duplicates()
    curr_arts = df_trans.loc[df_trans['t_dat'] >= '2020-09-08','article_id'].unique()
    
    # Build the "repeats"
    daterange = ('FULL' if date1=='2020-09-23' else date1[-5:-3]+date1[-2:]+'_'+date2[-5:-3]+date2[-2:])
    filename = '../Datasets/Outputs'+sample+'/RepeatFULL_'+daterange+'.feather'
    df_repeat = pd.read_feather(filename)
    df_repeat = df_repeat.loc[(df_repeat['predict_prob'] >= 0.5)].copy()
    df_repeat['article_id'] = df_repeat['article_id'].astype(str).str.zfill(10)
    df_repeat = df_repeat[['customer_id','article_id']].drop_duplicates()
    df_repeat['IsRepeat'] = 1
    
    # Build the "purchased in the last 30 days" articles
    df_last30days = df_avail.loc[df_avail['t_dat'] >= dt.datetime.strptime(date1,'%Y-%m-%d')\
                                       +dt.timedelta(days=-30),['customer_id','article_id']].drop_duplicates()
    df_last30days['IsLast30Days'] = 1
    
    # Build the "associations"
    df_artdict = pd.read_csv('../Datasets/association_v2.csv')
    art_dict2 = {}
    for x in df_artdict.itertuples():
        art_dict2[str(x[1]).zfill(10)] = [str(j).zfill(10) for j in list(x[2:])]

    cust_list = []
    art_list = []
    for x in df_last30days.itertuples():
        if x[2] in art_dict2:
            arts = art_dict2[x[2]][:num_assoc]
            cust_list += [x[1]]*len(arts)
            art_list += arts
    df_association = pd.DataFrame()
    df_association['customer_id'] = cust_list
    df_association['article_id'] = art_list
    df_association = df_association.drop_duplicates()
    del df_artdict
    del art_dict2
    del cust_list
    del art_list
    df_association['IsAssociation'] = 1
    df_association = df_association.loc[df_association['article_id'].isin(curr_arts)].copy()
    
    # Build the "general_pred" articles
    gen_pred_dict = pd.read_feather('../Datasets/gen_pred_dict.feather')
    gen_pred_dict[list(range(12))] = pd.DataFrame(gen_pred_dict.prediction.tolist(), index= gen_pred_dict.index)
    del gen_pred_dict['prediction']
    g2 = pd.melt(gen_pred_dict,id_vars = 'Age_Bracket')
    g2.columns = ['Age_Bracket','rank','article_id']
    g2 = g2[['Age_Bracket','article_id']]

    df_gen_articles = pd.merge(df_cust[['customer_id','Age_Bracket']],g2,on='Age_Bracket')
    del g2
    del df_gen_articles['Age_Bracket']
    del gen_pred_dict
    df_gen_articles['IsGenPred'] = 1
    
    # Build neighbors dataframe
    df_last30days['article_num'] = df_last30days['article_id'].astype('int64')
    df_art['article_num'] = df_art['article_id'].astype('int64')
    df_neighbor = pd.DataFrame()
    for inc in range(1,num_neighbors+1):
        df_last30days['offset'] = df_last30days['article_num'] + inc
        df_last30days['offset_id'] = df_last30days['offset'].astype(str).str.zfill(10)
        df_offset = pd.merge(df_last30days[['customer_id','offset_id']],df_art[['article_id']],\
                        left_on=['offset_id'],right_on=['article_id'])
        df_neighbor = pd.concat([df_neighbor,df_offset[['customer_id','article_id']]])
        del df_offset

        df_last30days['offset'] = df_last30days['article_num'] - inc
        df_last30days['offset_id'] = df_last30days['offset'].astype(str).str.zfill(10)
        df_offset = pd.merge(df_last30days[['customer_id','offset_id']],df_art[['article_id']],\
                        left_on=['offset_id'],right_on=['article_id'])
        df_neighbor = pd.concat([df_neighbor,df_offset[['customer_id','article_id']]])
        del df_offset
    del df_art['article_num']
    del df_last30days['article_num']
    del df_last30days['offset']
    del df_last30days['offset_id']
    df_neighbor['IsNeighbor'] = 1
    df_neighbor = df_neighbor.loc[df_neighbor['article_id'].isin(curr_arts)].copy()

    
    # Build the full dataset
    df_full = pd.concat([df_repeat,\
                         df_last30days,\
                         df_association,\
                         df_gen_articles,\
                         df_neighbor,\
                        ]).fillna(0)
    df_full = df_full.groupby(['customer_id','article_id']).sum().reset_index()
    df_full['NumListsAppeared'] = df_full[\
                ['IsRepeat','IsAssociation','IsGenPred','IsLast30Days','IsNeighbor']].sum(axis=1)
    
    
    # Generate response column if appropriate
    if generate_test:
        df_trans_test['Response'] = 1
        df_full = pd.merge(df_full,df_trans_test,how='left',on=['customer_id','article_id']).fillna(0)
        yes_custs = df_full.loc[df_full['Response']==1,'customer_id']
        df_full = df_full.loc[df_full['customer_id'].isin(yes_custs)].copy()
        del yes_custs
    
    del df_repeat
    del df_last30days
    del df_association
    del df_gen_articles
    del df_avail
    del df_trans_test
    
    return df_full

In [None]:
# BUILD FULL DATASET

def build_full_dataset(df_full,date1,date2,generate_test=True):
    
    # Create dataset for building metrics
    print('Build datasets for metrics')
    df_avail = df_trans.loc[df_trans['t_dat'] < date1].copy()
    df_unique = df_avail.groupby(['customer_id','article_id','t_dat']).agg({'price':'mean'}).reset_index()
    df_lastpurchase = df_avail.groupby(['customer_id','article_id']).agg({'t_dat':'max'}).reset_index()
    
    # BEGIN BUILDING FEATURES
    
    # Total number of times customer X purchased article Y
    print('Step 1: num times purchased')
    df_num_times = df_avail.groupby(['customer_id','article_id']).agg(\
                            {'t_dat':'nunique'}).reset_index().rename(columns={'t_dat':'UniqueDaysCustBoughtArt'})
    df_full = pd.merge(df_full,df_num_times,how='left',on=['customer_id','article_id']).fillna(0)
    del df_num_times
    
    # Find number of days since last time this article was purchased by this customer
    print('Step 2: days since last purchase')
    df_full['thres'] = date1
    df_full['thres'] = pd.to_datetime(df_full['thres'])
    df_full = pd.merge(df_full,df_lastpurchase,on=['customer_id','article_id'],how='left')
    df_full['DaysSinceCustLastPurchasedArt'] = ((df_full['thres'] - df_full['t_dat']).dt.days).fillna(9999)
    del df_full['thres']
    del df_full['t_dat']
    
    # Age of customer + subscription status
    print('Step 3: age + subscription status of customer')
    df_full = pd.merge(df_full,df_cust[['customer_id','age']],how='left',on='customer_id')
    df_full['age'] = df_full['age'].fillna(32)
    
    # How far removed from the article's peak?
    print('Step 4: how far from article peak?')
    test_week = dt.datetime.strptime(date1,'%Y-%m-%d').isocalendar()[1]

    df_avail['weekNum'] = df_avail.t_dat.dt.isocalendar().week
    df_week_count = df_avail.groupby(['article_id','weekNum']).size().reset_index()
    df_week_count['rank'] = df_week_count.groupby('article_id')[0].rank('first',ascending=False)
    df_week_count = df_week_count.loc[df_week_count['rank']==1,['article_id','weekNum']]

    df_full = pd.merge(df_full,df_week_count,how='left',on='article_id')

    df_full['TestWeekNum'] = test_week
    df_full['Try1'] = (df_full['weekNum'] - df_full['TestWeekNum']).abs()
    df_full['Try2'] = (52 + df_full['weekNum'] - df_full['TestWeekNum']).abs()
    df_full['WeeksFromPeak'] = df_full[['Try1','Try2']].min(axis=1)
    del df_full['weekNum']
    del df_full['TestWeekNum']
    del df_full['Try1']
    del df_full['Try2']
    
    # Add feature identifying potential discounts!
    print('Step 5: Identify current price discounts')
    
    weekBeforeStart = dt.datetime.strptime(date1,'%Y-%m-%d') - dt.timedelta(days=7)
    weekBeforeEnd = dt.datetime.strptime(date1,'%Y-%m-%d') - dt.timedelta(days=1)
    
    df_mean_price = df_avail.loc[df_avail['t_dat'] <= weekBeforeEnd].groupby('article_id').agg({'price':'mean'})

    trainPurchases = df_avail.loc[(df_avail['t_dat'] <= weekBeforeEnd)&(df_avail['t_dat'] >= weekBeforeStart)].groupby(\
                                                            'article_id').agg({'price':'mean'})

    trainPurchases = pd.merge(trainPurchases,df_mean_price,left_index=True,right_index=True)
    trainPurchases['PriceLastWeekVsMean'] = trainPurchases['price_x'] / trainPurchases['price_y']
    trainPurchases = trainPurchases[['PriceLastWeekVsMean']].copy().reset_index()

    df_full = pd.merge(df_full,trainPurchases,how='left',on='article_id').fillna(1)
    
    sale_behavior = pd.merge(df_avail,df_mean_price,on='article_id',suffixes=('','_mean'))
    sale_behavior['BoughtOnSale'] = np.where(sale_behavior['price'] < sale_behavior['price_mean'],1,0)
    sale_behavior = sale_behavior.groupby('customer_id').agg({'BoughtOnSale':['count','sum']}).reset_index()

    sale_behavior.columns = ['customer_id','num_purchases','num_sale_purchases']
    sale_behavior['CustomerPropensityToBuySales'] = sale_behavior['num_sale_purchases'] / sale_behavior['num_purchases']
    
    df_full = pd.merge(df_full,sale_behavior[['customer_id','CustomerPropensityToBuySales']],\
                                           how='left',on='customer_id').fillna(0)
    df_full['ArtCustSalePropensity'] = np.where(df_full['PriceLastWeekVsMean']<1,\
                                                df_full['CustomerPropensityToBuySales'],0)
    
    del df_mean_price
    del trainPurchases
    del sale_behavior
    
    # SALE LAG FEATURE
    print('Step 6: Identify price discount level of week before!')
    weekBeforeStart = dt.datetime.strptime(date1,'%Y-%m-%d') - dt.timedelta(days=14)
    weekBeforeEnd = dt.datetime.strptime(date1,'%Y-%m-%d') - dt.timedelta(days=8)
    
    df_mean_price = df_avail.loc[df_avail['t_dat'] <= weekBeforeEnd].groupby('article_id').agg({'price':'mean'})

    trainPurchases = df_avail.loc[(df_avail['t_dat'] <= weekBeforeEnd)&(df_avail['t_dat'] >= weekBeforeStart)].groupby(\
                                                            'article_id').agg({'price':'mean'})

    trainPurchases = pd.merge(trainPurchases,df_mean_price,left_index=True,right_index=True)
    trainPurchases['Price2WeeksAgoVsMean'] = trainPurchases['price_x'] / trainPurchases['price_y']
    trainPurchases = trainPurchases[['Price2WeeksAgoVsMean']].copy().reset_index()

    df_full = pd.merge(df_full,trainPurchases,how='left',on='article_id').fillna(1)
    del df_mean_price
    del trainPurchases
    
    # SALE LAG FEATURE
    print('Step 7: Identify price discount level of month before!')
    weekBeforeStart = dt.datetime.strptime(date1,'%Y-%m-%d') - dt.timedelta(days=30)
    weekBeforeEnd = dt.datetime.strptime(date1,'%Y-%m-%d') - dt.timedelta(days=8)
    
    df_mean_price = df_avail.loc[df_avail['t_dat'] <= weekBeforeEnd].groupby('article_id').agg({'price':'mean'})

    trainPurchases = df_avail.loc[(df_avail['t_dat'] <= weekBeforeEnd)&(df_avail['t_dat'] >= weekBeforeStart)].groupby(\
                                                            'article_id').agg({'price':'mean'})

    trainPurchases = pd.merge(trainPurchases,df_mean_price,left_index=True,right_index=True)
    trainPurchases['PriceLastMonthVsMean'] = trainPurchases['price_x'] / trainPurchases['price_y']
    trainPurchases = trainPurchases[['PriceLastMonthVsMean']].copy().reset_index()

    df_full = pd.merge(df_full,trainPurchases,how='left',on='article_id').fillna(1)
    del df_mean_price
    del trainPurchases
    
    # What % of customer purchases match the article age/gender?
    print('Step 8: customer age/gender purchase habits')
    df_avail = pd.merge(df_avail,df_art[['article_id','Gender_Category','Age_Category']],on='article_id').rename(\
                                        columns={'Gender_Category':'ArtGender','Age_Category':'ArtAge'})
    
    df_avail['Purchase']=1
    df_pivot = pd.pivot_table(df_avail,index='customer_id',columns='ArtGender',values='Purchase',\
                                                                          aggfunc='sum',fill_value=0)
    df_pivot['Total'] = df_pivot[['F','M','U']].sum(axis=1)
    df_pivot = df_pivot.reset_index()
    df_pivot.columns = ['customer_id','F','M','U','Total']

    df_pivot['F'] = df_pivot['F'] / df_pivot['Total']
    df_pivot['M'] = df_pivot['M'] / df_pivot['Total']
    df_full = pd.merge(df_full,df_art[['article_id','Gender_Category','Age_Category']],how='left',on='article_id')
    df_full = pd.merge(df_full,df_pivot[['customer_id','F','M']],how='left',on='customer_id')
    df_full['GenderPropensity'] = np.where(df_full['Gender_Category']=='U',0.5,\
                                          np.where(df_full['Gender_Category']=='F',df_full['F'],df_full['M']))
    del df_full['F']
    del df_full['M']
    del df_full['Gender_Category']
    del df_pivot
    
    df_pivot = pd.pivot_table(df_avail,index='customer_id',columns='ArtAge',values='Purchase',\
                                                                          aggfunc='sum',fill_value=0)
    df_pivot['Total'] = df_pivot[['Adult','Kids','YA']].sum(axis=1)
    df_pivot = df_pivot.reset_index()
    df_pivot.columns = ['customer_id','Adult','Kids','YA','Total']

    df_pivot['Adult'] = df_pivot['Adult'] / df_pivot['Total']
    df_pivot['Kids'] = df_pivot['Kids'] / df_pivot['Total']
    df_pivot['YA'] = df_pivot['YA'] / df_pivot['Total']
    df_full = pd.merge(df_full,df_pivot[['customer_id','Adult','Kids','YA']],how='left',on='customer_id')
    df_full['AgePropensity'] = np.where(df_full['Age_Category']=='Adult',df_full['Adult'],\
                                          np.where(df_full['Age_Category']=='YA',df_full['YA'],df_full['Kids']))
    del df_full['Adult']
    del df_full['YA']
    del df_full['Kids']
    del df_full['Age_Category']
    del df_pivot
    
    # What percent of sales have been done in this time in history?
    print('Step 9: Sales by time of year')
    df_total = df_avail.groupby('article_id').size().reset_index().rename(columns={0:'OverallSales'})

    currweek = dt.datetime.strptime(date1,'%Y-%m-%d').isocalendar()[1]
    df_avail['weekNum'] = df_avail.t_dat.dt.isocalendar().week

    df_curr_week = df_avail.loc[df_avail['weekNum'].isin([currweek,(currweek+1)%52,(currweek-1)%52])]\
                            .groupby('article_id').size().reset_index().rename(columns={0:'TimeFrameSales'})
    df_total = pd.merge(df_total,df_curr_week,on='article_id',how='left').fillna(0)
    df_total['PctTimeFrame'] = df_total['TimeFrameSales'] / df_total['OverallSales']
    
    df_full = pd.merge(df_full,df_total[['article_id','PctTimeFrame']],how='left',on='article_id').fillna(1/52)
    del df_total
    del df_curr_week
    
    # What is the weekly popularity of the garment group? Section?
    print('Step 10: Popularity of different sections')
    for section in ['garment_group_no','product_type_no','section_no']:
        df_transart = pd.merge(df_avail,df_art[['article_id',section]],on='article_id',how='left')
        df_transart_overall = df_transart.groupby(section).size().reset_index().rename(\
                                                                                columns={0:'OverallGroupSales'})
        df_transart_last_wk = df_transart.loc[df_transart['t_dat'] >= (dt.datetime.strptime(date1,'%Y-%m-%d')-dt.timedelta(\
                        days=7))].groupby(section).size().reset_index().rename(columns={0:'LastWkGroupSales'})

        df_transart_overall = pd.merge(df_transart_overall,df_transart_last_wk,how='left',on=section).fillna(0)
        del df_transart_last_wk
        df_transart_overall['OverallGroupSalesPerWeek'] = df_transart_overall['OverallGroupSales']/\
                        ((dt.datetime.strptime(date1,'%Y-%m-%d')-dt.datetime.strptime('2018-09-23','%Y-%m-%d')).days/7)
        df_transart_overall[section+'PopularityLastWeek'] = df_transart_overall['LastWkGroupSales']/\
                                                                            df_transart_overall['OverallGroupSalesPerWeek']

        df_full = pd.merge(df_full,df_art[['article_id',section]],how='left',on='article_id')
        df_full = pd.merge(df_full,df_transart_overall[[section,section+'PopularityLastWeek']],\
                                                                   how='left',on=section)
        del df_full[section]
        del df_transart
        del df_transart_overall
    
    # What percent of articles are returned - customer repurchase factor
    print('Step 11: Customer propensity to repurchase')
    df_returns = df_unique.groupby('customer_id').agg({'article_id':['count','nunique']}).reset_index()
    df_returns.columns = ['customer_id','num_articles','num_unique']
    df_returns['RepurchaseFactor_cust'] = df_returns['num_articles'] / df_returns['num_unique']
    df_full = pd.merge(df_full,df_returns[['customer_id','RepurchaseFactor_cust']],\
                           how='left',on='customer_id')
    df_full['RepurchaseFactor_cust'] = df_full['RepurchaseFactor_cust'].fillna(1)
    del df_returns
    
    # Article repurchase factor
    print('Step 12: Article propensity to be repurchased')
    df_art_rep = df_unique.groupby('article_id').agg({'customer_id':['count','nunique']}).reset_index()
    df_art_rep.columns = ['article_id','num_cust','num_unique']
    df_art_rep['RepurchaseFactor_art'] = df_art_rep['num_cust'] / df_art_rep['num_unique']
    df_full = pd.merge(df_full,df_art_rep[['article_id','RepurchaseFactor_art']],\
                           how='left',on='article_id')
    df_full['RepurchaseFactor_art'] = df_full['RepurchaseFactor_art'].fillna(1)
    
    # What is the median age of the article purchasers?
    print('Step 13: Median age of article purchasers')
    df_lastpurchase = pd.merge(df_lastpurchase,df_cust[['customer_id','age']],how='left',on='customer_id').fillna(32)
    df_midage = df_lastpurchase.groupby('article_id').agg({'age':'median'}).reset_index().rename(\
                                                                                columns={'age':'MedianAge'})
    df_full = pd.merge(df_full,df_midage,how='left',on='article_id').fillna(32)
    df_full['YearsFromMedianAge'] = df_full['age'] - df_full['MedianAge']
    del df_midage
    del df_lastpurchase
#     del df_full['age']
#     df_full['AbsYearsFromMedianAge'] = (df_full['age'] - df_full['MedianAge']).abs()
    
    # How many days since the article was first sold at H&M?
    print('Step 14: days since article was first/last sold at H&M')
    df_sold = df_avail.groupby('article_id').agg({'t_dat':['min','max']}).reset_index()
    df_sold.columns = ['article_id','FirstSold','LastSold']
    df_sold['DaysSinceFirstSold'] = (dt.datetime.strptime(date1,'%Y-%m-%d') - df_sold['FirstSold']).dt.days
    df_sold['DaysSinceLastSold'] = (dt.datetime.strptime(date1,'%Y-%m-%d') - df_sold['LastSold']).dt.days
    df_full = pd.merge(df_full,df_sold[['article_id','DaysSinceFirstSold','DaysSinceLastSold']],\
                                                   how='left',on='article_id').fillna(0)
    del df_sold
    
    # What is the average sale price, and what was it last week?
    print('Step 15: Last week + average sale price')
    df_avg_price = df_avail.groupby('article_id').agg({'price':'mean'}).reset_index().rename(\
                                                                    columns={'price':'AverageSalePriceOverall'})
    df_full = pd.merge(df_full,df_avg_price,how='left',on='article_id')
    
    df_avg_price = df_avail.loc[df_avail['t_dat'] >= weekBeforeStart].groupby('article_id').agg(\
                     {'price':'mean'}).reset_index().rename(columns={'price':'AverageSalePriceLastWk'})
    df_full = pd.merge(df_full,df_avg_price,how='left',on='article_id')
    df_full['AverageSalePriceLastWk'] = np.where(pd.isnull(df_full['AverageSalePriceLastWk']),\
                                                df_full['AverageSalePriceOverall'],df_full['AverageSalePriceLastWk'])
    del df_avg_price
    
    df_avg_cust_price = df_avail.groupby('customer_id').agg({'price':['mean','std']}).reset_index()
    df_avg_cust_price.columns = ['customer_id','CustPurchasePriceAvg','CustPurchasePriceStd']
    df_full = pd.merge(df_full,df_avg_cust_price,on='customer_id',how='left')
    del df_avg_cust_price
    
    df_full['PriceOverallStdFromCustomerMean'] = ((df_full['AverageSalePriceOverall']-df_full['CustPurchasePriceAvg'])/\
                                                            df_full['CustPurchasePriceStd']).fillna(0).replace(\
                                                                                    {np.inf:0,-np.inf:0})
    
    df_full['PriceLastWkStdFromCustomerMean'] = ((df_full['AverageSalePriceLastWk']-df_full['CustPurchasePriceAvg'])/\
                                                            df_full['CustPurchasePriceStd']).fillna(0).replace(\
                                                                                    {np.inf:0,-np.inf:0})
    
    del df_full['CustPurchasePriceAvg']
    del df_full['CustPurchasePriceStd']
    
    
    
    # What is the weekly/monthly/overall popularity of the article?
    print('Step 15: Average weekly article sales, last week article sales, ratio')
    df_avail['year'] = df_avail['t_dat'].dt.year
    df_avail['week'] = df_avail['t_dat'].dt.isocalendar().week

    num_weeks = (dt.datetime.strptime(date1,'%Y-%m-%d') - dt.datetime(2018,9,23)).days / 7

    df_weekly = df_avail.groupby(['article_id','year','week']).size().reset_index().rename(columns={0:'count'})
    
    df_weeklymax = df_weekly.groupby('article_id').agg({'count':'max'})
    df_currweek = df_weekly.loc[(df_weekly['year'] == '2020') & (df_weekly['week'] == \
                                        dt.datetime.strptime(date1,'%Y-%m-%d').isocalendar()[1]),['article_id','count']]
    df_weeklymax = pd.merge(df_weeklymax,df_currweek,how='left',on='article_id',suffixes=('_Top','_Curr')).fillna(0)
    df_weeklymax['PctOfPeakWeeklySales'] = df_weeklymax['count_Curr'] / df_weeklymax['count_Top']
    df_full = pd.merge(df_full,df_weeklymax[['article_id','PctOfPeakWeeklySales']],how='left',on='article_id').fillna(0)
    
    del df_currweek
    del df_weeklymax
    del df_avail['week']
    del df_avail['year']
    
    df_avg = df_weekly.groupby('article_id').agg({'count':'sum'}).reset_index()
    df_avg.columns = ['article_id','PurchaseRatePerWeek']
    df_avg['PurchaseRatePerWeek'] = df_avg['PurchaseRatePerWeek'] / num_weeks
    df_full = pd.merge(df_full,df_avg,how='left',on='article_id').fillna(0)
    del df_weekly
    del df_avg
    
    df_avail_last_week = df_avail.loc[df_avail['t_dat'] >= \
                            (dt.datetime.strptime(date1,'%Y-%m-%d') - dt.timedelta(days=7))].copy()
    df_avail_last_week = df_avail_last_week.groupby('article_id').size().reset_index().rename(\
                                                                    columns={0:'PurchaseRateLastWeek'})
    df_full = pd.merge(df_full,df_avail_last_week,on='article_id',how='left').fillna(0)
    df_full['LastWeekPopularity'] = np.where(df_full['PurchaseRatePerWeek'] == 0,0,\
                                    df_full['PurchaseRateLastWeek']/df_full['PurchaseRatePerWeek'])
    del df_avail_last_week
    
    df_avail_last_month = df_avail.loc[df_avail['t_dat'] >= \
                            (dt.datetime.strptime(date1,'%Y-%m-%d') - dt.timedelta(days=28))].copy()
    df_avail_last_month = df_avail_last_month.groupby('article_id').size().reset_index().rename(\
                                                                    columns={0:'PurchaseRateLastMonth'})
    df_avail_last_month['PurchaseRateLastMonth'] = df_avail_last_month['PurchaseRateLastMonth']/4
    df_full = pd.merge(df_full,df_avail_last_month,on='article_id',how='left').fillna(0)
    df_full['LastMonthPopularity'] = np.where(df_full['PurchaseRatePerWeek'] == 0,0,\
                                    df_full['PurchaseRateLastMonth']/df_full['PurchaseRatePerWeek'])
    del df_avail_last_month
    
    # What are the similarities between article + customer online vs instore?
    print('Step 16: Similarities in online vs in store')

    df_avail['purchase'] = 1
    df_cust_online = pd.pivot_table(df_avail,columns='sales_channel_id',index='customer_id',values='purchase',\
                                    aggfunc='sum',fill_value=0).reset_index()
    df_art_online = pd.pivot_table(df_avail,columns='sales_channel_id',index='article_id',values='purchase',\
                                    aggfunc='sum',fill_value=0).reset_index()
    del df_avail['purchase']
    
    df_cust_online['PctCustInStore'] = df_cust_online[1]/(df_cust_online[1]+df_cust_online[2])
    df_cust_online['PctCustOnline'] = 1 - df_cust_online['PctCustInStore']
    df_cust_online.columns = ['customer_id','NumStore','NumOnline','PctCustInStore','PctCustOnline']

    df_art_online['PctArtInStore'] = df_art_online[1]/(df_art_online[1]+df_art_online[2])
    df_art_online['PctArtOnline'] = 1 - df_art_online['PctArtInStore']
    df_art_online.columns = ['article_id','NumStore','NumOnline','PctArtInStore','PctArtOnline']
    
    df_full = pd.merge(df_full,df_cust_online[['customer_id','PctCustInStore','PctCustOnline']],\
                                                                   how='left',on='customer_id').fillna(0)
    df_full = pd.merge(df_full,df_art_online[['article_id','PctArtInStore','PctArtOnline']],\
                                                                   how='left',on='article_id').fillna(0)
    df_full['SalesChannelSimilarity'] = (df_full['PctCustInStore'] * df_full['PctArtInStore']) + \
                                                (df_full['PctCustOnline'] * df_full['PctArtOnline'])
    
    del df_full['PctCustInStore']
    del df_full['PctArtInStore']
    del df_full['PctCustOnline']
    del df_full['PctArtOnline']
    del df_cust_online
    del df_art_online
    
    # Popularity measure from online metric
    print('Step 17: new popularity measure')
    df_pop = pd.DataFrame(get_general_pred(df_avail)).reset_index()
    df_pop.columns = ['article_id','popularity_quotient']
    df_full = pd.merge(df_full,df_pop,how='left',on='article_id').fillna(0)
    del df_pop
    
    # Compare article metadata to customer historical metadata purchases
    print('Step 18: how similar is this product metadata to previous purchases')
    df_dummies = pd.get_dummies(df_art[['product_group_name','perceived_colour_value_name','color','index_code','garment_group_name']])
    df_dummies.index = df_art['article_id']
    df_dummies = df_dummies[[i for i in list(df_dummies.columns) if 'Unknown' not in i or 'Undefined' not in i or \
                            'undefined' not in i]]
    df_dummies = df_dummies.loc[:,df_dummies.sum() > 500].reset_index()

    df_trans_dummy = pd.merge(df_avail[['customer_id','article_id']].drop_duplicates(),df_dummies,on='article_id')
    del df_dummies

    df_groups = df_trans_dummy[[i for i in df_trans_dummy.columns if i not in \
                                ['t_dat','price','sales_channel_id']]].groupby('customer_id').sum()
    df_num_purchases = pd.DataFrame(df_trans_dummy.groupby('customer_id').size()).rename(\
                                                                columns={0:'num_purchases'}).reset_index()
    df_groups = pd.merge(df_groups,df_num_purchases,on='customer_id')
    del df_num_purchases

    for col in [i for i in df_groups.columns if i not in ['customer_id','num_purchases']]:
        df_groups[col] = df_groups[col] / df_groups['num_purchases']
    del df_groups['num_purchases']

    df_trans_full = pd.merge(df_trans_dummy,df_groups,on='customer_id',suffixes = ('','_cust'))
    df_trans_join = df_trans_full[['customer_id','article_id']].copy()
    del df_trans_dummy
    del df_groups

    for colType in ['product_group_name','perceived_colour_value_name','color','index_code','garment_group_name']:
        df_trans_join[colType+'_similarity'] = 0
        for col in [j for j in df_trans_full.columns if j[:len(colType)] == colType and j[-5:] != '_cust']:
            df_trans_join[colType+'_similarity'] += (df_trans_full[col] * df_trans_full[col + '_cust'])
    df_trans_join['overall_metadata_similarity'] = df_trans_join.iloc[:,2:].sum(axis=1)

    del df_trans_full
    df_full = pd.merge(df_full,df_trans_join,how='left',on=['customer_id','article_id']).fillna(0)
    del df_trans_join
    
    return df_full

In [None]:
full_train_months = [('2020-06-10','2020-06-16'),('2020-06-17','2020-06-23'),\
              ('2020-06-24','2020-06-30'),('2020-07-01','2020-07-07'),\
              ('2020-07-08','2020-07-14'),('2020-07-15','2020-07-21'),\
              ('2020-07-22','2020-07-28'),('2020-07-29','2020-08-04'),\
              ('2020-08-05','2020-08-11'),('2020-08-12','2020-08-18'),\
              ('2020-08-19','2020-08-25'),('2020-08-26','2020-09-01'),\
              ('2020-09-02','2020-09-08'),('2020-09-09','2020-09-15'),\
              ('2020-09-16','2020-09-22'),('2020-09-23','2020-09-29')]

In [None]:
# Create the train and test sets
print('Sampling:',sample)

start = dt.datetime.now()
print(start)

for dates in full_train_months:
    print(dates,dt.datetime.now()-start)
    gt = (False if dates[0] == '2020-09-23' else True)
    date_range = ('FULL' if dates[0] == '2020-09-23' else \
                  dates[0][-5:-3]+dates[0][-2:]+'_'+dates[1][-5:-3]+dates[1][-2:])
    
    df_full = build_options(dates[0],dates[1],generate_test=gt,num_general=0,num_assoc=5,num_neighbors=12)
    print('Built Candidates',len(df_full))
    df_train_set = build_full_dataset(df_full,dates[0],dates[1],generate_test = gt)
    df_train_set.to_feather('../Datasets/Full_Test'+sample+'/Repurchase_'+date_range+'_yes.feather')
    del df_full
    del df_train_set

## Metric top scores:

### 5% - OLD
- 9/2-9/8:   48.620 (1644, 2718823)
- 9/9-9/15:  47.275 (1561, 2626308)

### 5% - NEW
- 8/19-8/25: 55.336 (1707, 2619766, 20100)
- 8/26-9/1:  60.392 (1895, 2623053, 22684)
- 9/2-9/8:   52.824 (1711, 2710538, 20446)
- 9/9-9/15:  51.490 (1624, 2609838, 19626)

### Overall - OLD
- 8/19-8/25: 39.066 (18785, 39132721, 230825)
- 8/26-9/1:  39.748 (20158, 40063013, 255172)
- 9/2-9/8:   40.709 (19671, 39925366, 238074)
- 9/9-9/15:  40.521 (18933, 38814497, 227910)
- 9/16-9/22: 37.278 (17368, 37860385, 213728)

### Overall - NEW
- 8/19-8/25: 44.197 (21245, 44241584, 230825)
- 8/26-9/1:  43.548 (22451, 45359030, 255172)
- 9/2-9/8:   44.042 (21723, 45004618, 238074)
- 9/9-9/15:  44.172 (20923, 43484765, 227910)
- 9/16-9/22: 41.371 (19303, 42139636, 213728)