## Jashanjot Singh Bindra
### 101903159
### 3COE16

## Table of Contents
**Problem Name** -> H&M Personalized Fashion Recommendations

**Problem Link** -> https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/overview

**Problem Type** -> Classification  and Data Analysis

**Libraries used** -> cuDF, cuPy, cuML

**Models Implemented** -> KNearestNeighbours Classifier (using minkowski distance)

**Evaluation Metrics Used** -> Mean Average Precision @ 12

**Kaggle Rank Achieved with total number of teams (if applicable)** -> 482 rank out of 1231

**Tasks done in code:-**
* Loading training datasets
* Data Evaluation
* Pre-processing Training dataset
* Finding Items that are purchased most often and then sorting  them by date
* Finding Items that were most popular last week
* Recommending items by age of customer and other features of article
* Applying KNN
* Creating Submission File


In [None]:
import cudf
import cupy as cp
import cuml

## Loading training datasets

In [None]:
df_train = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')
df_train.head()

In [None]:
cust = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/customers.csv')
cust.head()

In [None]:
articles=cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/articles.csv')
articles.head()

## Preprocessing Training dataset
* Here we are trying to reduce the memory consumption of training dataset by storing customer id as int64 which takes 8 bytes instead of string which takes 64 bytes
* We further reduce memory consumption by storing article id as int32 which takes 4 bytes instead of string which takes 64 bytes
* We also remove unnecessary columns from training dataset

In [None]:
df_train['customer_id'] = df_train['customer_id'].str[-16:].str.hex_to_int().astype('int64')
df_train['article_id'] = df_train.article_id.astype('int32')
df_train.t_dat = cudf.to_datetime(df_train.t_dat)
df_train = df_train[['t_dat','customer_id','article_id']]
df_train_original = df_train
print( df_train.shape )
df_train.head()

In [None]:
cust = cust[['customer_id','age']]
cust['customer_id'] = cust['customer_id'].str[-16:].str.hex_to_int().astype('int64')
cust.head()

In [None]:
articles=articles[['article_id','product_type_no','graphical_appearance_no','colour_group_code']]
articles['article_id'] = articles.article_id.astype('int32')
articles.head()

### Finding Customer's Last 2 weeks Purchases
* We are keeping only those purchases of each customer that are 2 weeks older than his most recent purchase date

In [None]:
temp = df_train.groupby('customer_id').t_dat.max().reset_index() #Finding most recent purchase of the customer
temp.columns = ['customer_id','max_dat']
temp

In [None]:
df_train = df_train.merge(temp,on=['customer_id'],how='left')
df_train['diff_dat'] = (df_train.max_dat - df_train.t_dat).dt.days
df_train = df_train.loc[df_train['diff_dat']<=14]

In [None]:
df_train['diff_dat'].unique() # checking whether all differences are present or not

In [None]:
df_train

## 1) Finding Items that are purchased most often and then sorting them by date
* If a person purchases an item quite often he is more likely to purchase it again
* Further we store the most recent of the most frequently purchased items first to further improve predictions

In [None]:
temp = df_train.groupby(['customer_id','article_id'])['t_dat'].agg('count').reset_index() # Finding number of times a particular item is purchased by a particular customer
temp.columns = ['customer_id','article_id','count']
temp

In [None]:
df_train = df_train.merge(temp,on=['customer_id','article_id'],how='left')
df_train = df_train.sort_values(['count','t_dat'],ascending=False)
df_train

In [None]:
df_train = df_train.drop_duplicates(['customer_id','article_id'])
df_train = df_train.sort_values(['count','t_dat'],ascending=False)
df_train=df_train.reset_index(drop=True)
df_train

In [None]:
df_train=df_train.reset_index(drop=False)
df_train

## 2) Find Items that were most popular last week
* We will recommend the 12 most popular items to all the users
* Extra items will be removed later on while creating submission file so no harm in adding them now
* Also the problem description says that predicting 12 items for all customers is benificial 

In [None]:
print('Latest Date ',df_train_original['t_dat'].max())

In [None]:
print('Last Week\'s Date ',df_train_original['t_dat'].max()-518400000000000) 
# 518400000000000 are nanoseconds in 1 week

In [None]:
df_train_original = df_train_original.loc[df_train_original.t_dat >= cudf.to_datetime('2020-09-16')]
top12 = ' 0' + ' 0'.join(df_train_original.article_id.value_counts().to_pandas().index.astype('str')[:12])
print("Last week's top 12 popular items:")
print( top12 )

## 3) Recommending items by age of customer and other features of article
* In this we'll be using KNN to predict the nearest/most similar article that the customer will buy
* Here we are using the concept if a person has bought a product with certain colour, product type(Like tshirt, shorts), material then he is more likely to buy another product with similar characteristic
* We also take the customer's age into consideration, people of similar age group by similar clothes

In [None]:
age_train=cudf.merge(df_train, cust, on='customer_id')
age_train.head()

In [None]:
age_train=age_train[['index','customer_id','age','article_id']]
age_train=age_train.fillna({'age':18})
age_train

In [None]:
arti_age_train = cudf.merge(age_train, articles, on='article_id')
arti_age_train

In [None]:
X=arti_age_train[['age','product_type_no','graphical_appearance_no','colour_group_code']]
X.head()

In [None]:
Y=arti_age_train[['article_id']]
Y.head()

## Applying KNN
* The value of K I have taken is K=10
* I am using MinkowskiDistance as distance metric

In [None]:
from cuml.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=10,metric='minkowski')

knn.fit(X[:50000], Y[:50000]) #training only subsample to avoid overfitting and save gpu memory

In [None]:
# ans=knn.predict(X) wanted to do this but gpu is going out of memory
ans=knn.predict(X[:100000])

In [None]:
ls = ans.to_arrow().to_pylist()

In [None]:
arti_age_trai = arti_age_train[['index','customer_id','article_id']]
arti_age_trai

In [None]:
df = {'index': range(len(arti_age_trai),len(arti_age_trai)+len(ls)), 'customer_id': arti_age_trai['customer_id'][:100000], 'article_id': ls}
arti_age_trai = arti_age_trai.append(df, ignore_index = True)

In [None]:
arti_age_trai

### I am maintaing index because previously I calculated recently most purchased items previous and i will recommend those first and then KNN predictions later

In [None]:
arti_age_trai = arti_age_trai[['index','customer_id','article_id']].sort_values('index')
arti_age_trai = arti_age_trai[['customer_id','article_id']]
arti_age_trai=arti_age_trai.reset_index(drop=True)
arti_age_trai

In [None]:
arti_age_trai = arti_age_trai.drop_duplicates(['customer_id','article_id'])
arti_age_trai

In [None]:
df_train = arti_age_trai.sort_index()

##  Creating Submission File
* In this file we group all article ids for a customer and then store it as a string as required by submission rules

In [None]:
df_train.article_id = ' 0' + df_train.article_id.astype('str')
df_train

In [None]:
p_df_train = df_train[['customer_id','article_id']].to_pandas() #cudf does not support sum of str in group by and loop is expensive so we convert to pandas
p_df_train

In [None]:
temp = p_df_train.groupby('customer_id').sum().reset_index()
temp.columns = ['customer_id','prediction']
df_train=cudf.DataFrame(temp)

In [None]:
df_train

In [None]:
df_train.rename(columns={'customer_id':'customer_id_edited'},inplace=True)
df_train

In [None]:
submission = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')
submission = submission[['customer_id']]
submission['customer_id_edited'] = submission['customer_id'].str[-16:].str.hex_to_int().astype('int64')
submission = submission.merge(df_train, on='customer_id_edited', how='left').fillna('')
del submission['customer_id_edited']
submission

In [None]:
submission.prediction = submission.prediction + top12
submission.prediction = submission.prediction.str.strip()
submission.prediction = submission.prediction.str[:131] # 10 * 12 = 120 plus 11 spaces is 131, we do this to only keep 12 predictions for each customer
submission.to_csv('submission.csv',index=False)
submission.head()