# H&M Recommender

In this competition, we must develop product recommendations based on data from previous transactions, as well as from customer and product meta data. Following, we will import necessay libraries and read the data to see what can we do with it.

In [None]:
import pandas as pd

In [None]:
# Path depends on where data is
data_path = '../input/h-and-m-personalized-fashion-recommendations/'

### Importing and Reading Data

- Transaction Data

In [None]:
df_trans = pd.read_csv(data_path+"transactions_train.csv")
df_trans.head()

This will be our main training data. We can play around with dates and articles to see which articles are trending at the moment

- Articles metadata

In [None]:
df_articles = pd.read_csv(data_path+"articles.csv")
df_articles.head()

In [None]:
df_articles["index_group_name"].value_counts()

We can see some features might be usefull like index_group_name for detecting if the article is meant for women or men or baby, etc. This could help for recommending articles with same index_group_name to customers who already bought something in that group (sort of detecting its sex)

- Customer Data

In [None]:
df_customers = pd.read_csv(data_path+"customers.csv")
df_customers.head()

Don't see anything that might be usefull to increase performance

## 1. Setup training data

We will use df_trans as our main training set following these steps:
- Format t_dat column as datetime to be able to sort and filter by date
- keep only important columns

In [None]:
df_trans["t_dat"] = pd.to_datetime(df_trans["t_dat"])
df_trans = df_trans[['t_dat','customer_id','article_id']]
df_trans.head()

Now, we will add index_group_name feature from df_articles to our training set

In [None]:
product_merge = df_trans.merge(df_articles,on=['article_id'],how='left')
product_merge = product_merge[['t_dat','customer_id','article_id','index_group_name']]
product_merge.head()

### Find customers SEX!

Finding customer sex will be difficult but with the table we have above, we can see the transaction quantity by index_group_name

In [None]:
product_merge["index_group_name"].value_counts()

In [None]:
print("Ladieswear Transactions: {:.0%}".format(product_merge["index_group_name"].value_counts()[0]/product_merge["index_group_name"].value_counts().sum()))
print("Divided Transactions: {:.0%}".format(product_merge["index_group_name"].value_counts()[1]/product_merge["index_group_name"].value_counts().sum()))

- We can see that Ladieswear transaction account for 64% of all transactions! On the other hand, "Divided" transactions (accessories or shoes), which has the second highest quantity, account for only 22%. Since Ladieswear is the most common transaction by far, we will split our our customer base by "female" users and "others". 

To do this, we first need to group all the purchases a customer did by index_group_name. We will assume that if the customer purchased a Ladiesware article, the customer is a female

In [None]:
sex_user = pd.DataFrame(product_merge.groupby(['customer_id'])['index_group_name'].apply(list)).reset_index()
sex_user.head()

In [None]:
sex_user = sex_user.explode('index_group_name')
sex_user = sex_user.loc[sex_user["index_group_name"]=="Ladieswear"]
sex_user = sex_user.drop_duplicates(['customer_id'])
sex_user.head()

This final sex_user table are the customers which are catalogued as female as part of our test. Next, We can then see how many female are in our data set and how it compares with all our customers

In [None]:
sex_user["customer_id"].value_counts().sum()

In [None]:
print("Female Customers: {:.0%}".format(sex_user["customer_id"].value_counts().sum()/df_customers["customer_id"].value_counts().sum()))

WOW, 87% are female customers. Does this make any sense? If you go to an H&M I think 9 out of 10 people are usually women so it definitely sounds possible

## 2. Find Each Customer's Last Week of Purchases

Now we will find each customers last week of purchases. This can easily be done as following:
- keeping max t_dat per customer id
- merging back to our main training table
- get the difference between actual date and max date per customer
- filter ones that has more than 6-7 days

In [None]:
tmp = product_merge.groupby('customer_id').t_dat.max().reset_index()
tmp.columns = ['customer_id','max_dat']
train = product_merge.merge(tmp,on=['customer_id'],how='left')
train.head()

In [None]:
print('Train shape before:',train.shape[0])

In [None]:
train['diff_dat'] = (train.max_dat - train.t_dat).dt.days
train = train.loc[train['diff_dat']<=6]
print('Train shape after:',train.shape[0])

We can see how we significantly reduced the training set size. Apart for faster computation time, this also works for fast fashion since we want to recommend product customers will buy, so this will be trendy products or the ones that are being marketed and new in the store

## 3. Most Often Previously Purchased Items

Here, we will do the main part of our recommender. The steps are the following:
- Get count of previously purchased item to be able to sort by most trendy item
- Get pairs of articles bought frequently with each other

### Count of previoulsy purchased items

In [None]:
tmp = train.groupby(['customer_id','article_id'])['t_dat'].agg('count').reset_index()
tmp.columns = ['customer_id','article_id','count']
tmp.head()

In [None]:
train = train.merge(tmp,on=['customer_id','article_id'],how='left')
train = train.sort_values(['count','t_dat'],ascending=False)
train = train.drop_duplicates(['customer_id','article_id'])
train = train.sort_values(['count','t_dat'],ascending=False)
train.head()

### Pairs of items frequently purchased together

This is the most imporant part of the recommendation. Here we will do the following:
- Use the main traning transactional dataset (with all the transactions) to calcualte paired articles
- Get the value counts of female articles and others.
- Create a dictionary with paired items most frequently bought together for female and for others

In [None]:
df_trans1 = product_merge[['customer_id','article_id','index_group_name']]

In [None]:
#This will sort articles by purchase count 
vc_female = df_trans1.loc[df_trans1["index_group_name"]=="Ladieswear"].article_id.value_counts()
vc_else = df_trans1.loc[df_trans1["index_group_name"]!="Ladieswear"].article_id.value_counts()

In [None]:
pairs_f = {}
for j,i in enumerate(vc_female.index.values[1000:1032]):
    #if j%10==0: print(j,', ',end='')
    USERS = df_trans1.loc[df_trans1.article_id==i.item(),'customer_id'].unique()
    vc2 = df_trans1.loc[(df_trans1.customer_id.isin(USERS))&(df_trans1.article_id!=i.item()),'article_id'].value_counts()
    pairs_f[i.item()] = [vc2.index[0], vc2.index[1], vc2.index[2]]

In [None]:
pairs_e = {}
for j,i in enumerate(vc_else.index.values[1000:1032]):
    #if j%10==0: print(j,', ',end='')
    USERS = df_trans1.loc[df_trans1.article_id==i.item(),'customer_id'].unique()
    vc2 = df_trans1.loc[(df_trans1.customer_id.isin(USERS))&(df_trans1.article_id!=i.item()),'article_id'].value_counts()
    pairs_e[i.item()] = [vc2.index[0], vc2.index[1], vc2.index[2]]

In [None]:
pairs_f

In [None]:
pairs_e

Above, we can see the pairs of articles. Now we can just add up the 2 dictionaries and map them in our trainning set

In [None]:
pairs_f.update(pairs_e)

In [None]:
pairs_f

In [None]:
train['article_id2'] = train.article_id.map(pairs_f)
train.head()

## 4. Recommendation of paired items

Now we will filter our data and keep only important features. This trainning data set will become our prediction submission. We will do the following:
- keep only customer id and our paired articles feature
- remove null values
- format our new paired articles feature to be something match the submission

In [None]:
train2 = train[['customer_id','article_id2']].copy()
train2 = train2.loc[train2.article_id2.notnull()]
train2 = train2.rename({'article_id2':'article_id'},axis=1)
train2.head()

In [None]:
train2[['team1','team2', "team3"]] = pd.DataFrame(train2.article_id.tolist(), index= train2.index)
train2["join"] = train2.team1.astype(str) + " 0" + train2.team2.astype(str) + " 0" + train2.team3.astype(str)
train2 = train2.drop(['article_id', 'team1','team2', "team3"], axis=1)
train2 = train2.rename({'join':'article_id'},axis=1)
train2 = train2.drop_duplicates(['customer_id','article_id'])
train2.head()

In [None]:
train = train[['customer_id','article_id']]
train = pd.concat([train,train2],axis=0,ignore_index=True)
train = train.drop_duplicates(['customer_id','article_id'])
train

In [None]:
train.article_id = ' 0' + train.article_id.astype('str')

In [None]:
train

In [None]:
preds = pd.DataFrame(train.groupby('customer_id').article_id.sum().reset_index())
preds.columns = ['customer_id','prediction']
preds.head()

Almost there. Table already in its final format. Now, we need to add the top 12 popular items for each customer (female or others)

## 5. Recommend Last Week's Most Popular Items

Here, we will use our first training set with all data and filter by the last week

In [None]:
df_trans2 = product_merge.loc[product_merge.t_dat >= pd.to_datetime('2020-09-16')]
df_trans2.head()

We now filter by index_group_name to see if it is female or not by filtering with Ladieswear

In [None]:
df_trans2_female = df_trans2.loc[df_trans2["index_group_name"]=="Ladieswear"]
df_trans2_else = df_trans2.loc[df_trans2["index_group_name"]!="Ladieswear"]

Next, we get our top12 articles for females and for others, in the format needed

In [None]:
top12_female = '0' + ' 0'.join(df_trans2_female["article_id"].value_counts().index.astype('str')[:12])
top12_female

In [None]:
top12_else = '0' + ' 0'.join(df_trans2_else["article_id"].value_counts().index.astype('str')[:12])
top12_else

## Submission

First, lets read the sample submission, keep all customers and merge our pred table with customer_id as key

In [None]:
sub = pd.read_csv(data_path+"sample_submission.csv")
sub.head()

In [None]:
sub = sub[['customer_id']]
sub = sub.merge(preds,on='customer_id', how='left').fillna('')
sub.head()

Now, we can split our customer_id by female and others (by using the sex_user table we created in step 1)

In [None]:
sex_female = sex_user[["customer_id"]]
sub_female = sub.loc[(sub.customer_id.isin(sex_female.customer_id))]
sub_else = sub.loc[~sub.customer_id.isin(sex_female.customer_id)]

Now, lets append the top 12 predictions for females and for others and concat them together to get the submission with all customers.

In [None]:
sub_female.prediction = sub_female.prediction + " " +top12_female
sub_else.prediction = sub_else.prediction + " " +top12_else
sub_final = pd.concat([sub_female,sub_else])
sub_final.prediction = sub_final.prediction.str.strip()
sub_final.prediction = sub_final.prediction.str[:131]
sub_final.to_csv(f'submission.csv',index=False)
sub_final.head()

Disclaimer - Most of the notebook methodology was taken from: https://www.kaggle.com/cdeotte/recommend-items-purchased-together-0-021.