# H&M Personalized Fashion Recommendations
<img src="https://images.unsplash.com/photo-1578983662508-41895226ebfb?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1211&q=80" width=600></img>


##### Goal:  To predict what articles each customer will purchase in the 7-day period immediately after the training data ends. 

<img src="https://images.unsplash.com/photo-1607160199580-1b0c9b736b66?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=2070&q=80" width=600></img>


##### Files:
*    articles.csv - detailed metadata for each article_id available for purchase <br>
*    customers.csv - metadata for each customer_id in dataset<br>
*    sample_submission.csv - a sample submission used for validation<br>
*    transactions_train.csv - the training data, consisting of the purchases each customer for each date, as well as additional information<br>

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import gc

In [None]:
# Importing the files
transactions_train = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")
submissions = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv")
articles = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
customers = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv")

**Apache Parquet** is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. Parquet operates well with complex data in large volumes. It is known for its both performant data compression and its ability to handle a wide variety of encoding types. 

In [None]:
# Converting to Parquet to shrink size and save memory                                                                                                       
articles.to_parquet('articles.parquet')
customers.to_parquet('customers.parquet')
transactions_train.to_parquet('transactions_train.parquet')
submissions.to_parquet('submission.parquet')

#### Cross Validation Scheme
The first step  is to build a reliable local validation scheme. Then we use our local validation score to evaluate experiment ideas and/or tune hyperparameters.

**Train Data**

The last day in the transaction dataframe is 2020-09-22. The public LB contains 1 week of transactions after this date. Therefore to create a local validation that mimics Kaggle's train test relationship, we can train on all transactions before 2020-9-15. And validate on the last week in train data.

In [None]:
# Creating a local cross validation scheme
train = pd.read_parquet('transactions_train.parquet')
train.t_dat = pd.to_datetime( train.t_dat )
train = train.loc[ train.t_dat <= pd.to_datetime('2020-09-15') ]

**Validation Data**

The code below will create a dataframe with only the customers who made purchases during the last week of train (which are the only ones that affect competition metric).
It formats the predictions as strings like sample_submission.csv

In [None]:
valid = pd.read_parquet('transactions_train.parquet')
valid.t_dat = pd.to_datetime( valid.t_dat )
test = valid.loc[ valid.t_dat >= pd.to_datetime('2020-09-16') ]
test = test.groupby('customer_id').article_id.apply(list).reset_index()
test = test.rename({'article_id':'prediction'},axis=1)
test['prediction'] = test.prediction.apply(lambda x: ' '.join(['0'+str(k) for k in x]))

In [None]:
articles = pd.read_parquet("./articles.parquet")
customers = pd.read_parquet("./customers.parquet")
transactions = pd.read_parquet("./transactions_train.parquet")
submission = pd.read_parquet("./submission.parquet")


### **Understanding the data**

#### **Articles.csv**
##### The Articles file contains information about all the products that are sold. The article_id is the primary key. There are 1,05,542 products present in the data. There are 24 columns which contain information about each product.
* **Primary key**: 
    * 'article_id' <br>
<br>
* **Basic information about the Product and it's Category**: 
    * product_code, prod_name, product_type_no, product_type_name, product_group_name  <br>
<br>
* **Visual Classification of the products**:
    * Graphical Appearance: <br>
    'graphical_appearance_no','graphical_appearance_name' <br>
    
    * Colour: <br>
    'colour_group_code', 'colour_group_name','perceived_colour_value_id', 'perceived_colour_value_name', 'perceived_colour_master_id', 'perceived_colour_master_name'

* **Classification of Products by Department:** 
    * Department: <br>
    'department_no', 'department_name', <br>

    * Shop Index : <br>
    'index_code', 'index_name', 'index_group_no', 'index_group_name'

    * Section : <br>
    'section_no', 'section_name'<br>
    
    * Garment group:  <br>
    'garment_group_no', 'garment_group_name', 'detail_desc'<br>


In [None]:
print(articles.shape)
articles.tail()

#### **Customers.csv**
##### The Customers file contains information about all the products that are sold. The customer_id is the primary key. There are 13,71,980 customers present in the data. There are 6 columns which contain information about each customer.
* **Primary key:** 
    * 'customer_id' <br>
<br>
* **Information about whether the customer has subscribed to the Fashion News Updates: **
    * FN, Active, club_member_status, fashion_news_frequency  <br>
<br>
* **Age** <br>

* **Postal_code **

In [None]:
print(customers.shape)
customers.head()

#### **Transactions.csv**
##### The Transactions file contains information about all past transactions of customers from the years 2020 to 2022. There are 3,17,88,324 transactions present in the data. There are 4 columns which contain information about each transaction.
##### This has `customer_id` and `article_id`, which are foreign keys for the articles and customer tables.
Beside this, transaction also contains `sales_channel_id`.

* **Transaction Date:** 
    * 't_dat' <br>
<br>
* **Customer and Product information:**
    * customer_id, article_id, price  <br>
<br>
* **Sales Channel (Online/ Offline)** <br>
    * sales_channel_id

In [None]:
print(transactions.shape)
transactions.head()

In [None]:
# Defining Helper Functions

def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

def unique_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    uniques = []
    for col in data.columns:
        unique = data[col].nunique()
        uniques.append(unique)
    tt['Uniques'] = uniques
    return tt    

##### Check for Missing Values

In [None]:
# 416 products have missing descriptions (Only 0.4% of the data)
missing_data(articles)

In [None]:
# Only customer id and postal code are completely filled. Age, fashion news frequency have arounfd 1% missing data, FN has 65% missing and Active has 66% missing data.
missing_data(customers)

In [None]:
# No missing data from transactions data
missing_data(transactions)

##### Unique Values

In [None]:
unique_values(articles)

We observe that features for which we expect to have the same number of unique value, like:
* product_type_no and product_type_name,  
* departmant_no and department_name,  
* section_no and section_name 
have different number of unique values, which might means that we might have categories with same name.
Others, like:
* index_code and index_name,
* garment_group_no and garment_group_name
have the same number of unique values.

In [None]:
unique_values(customers)

In [None]:
unique_values(transactions)

We observe that not all the customers (in the customer data) have transactions in the transaction data. As well as, not all articles are represented in this data. It is interesting that the number of different prices is quite small, out of 31.7M transactions, and for 1.3M customers, buying 104K different articles. Same for the dates, there are only 734 different dates. Let's check some stats here.

In [None]:
print(f"Percentage of articles present in the transactions data: {round(transactions.article_id.nunique()/articles.article_id.nunique()*100, 2)}%")
print(f"Percentage of customers present in the transactions data: {round(transactions.customer_id.nunique()*100/customers.customer_id.nunique(), 2)}%")

#### **Preprocessing**

In [None]:
# There is a mistake in data entry where NONE and None have been entered differently
print(customers.fashion_news_frequency.value_counts())
customers.loc[customers["fashion_news_frequency"] == "NONE","fashion_news_frequency"] = "None"
customers.fashion_news_frequency.value_counts()

#### **EDA**

In [None]:
# Most popular products
articles.prod_name.value_counts()[0:10]

In [None]:
# Number of Product Types per Product Group
prod_grps = articles.groupby(["product_group_name"])["product_type_name"].nunique()
df = pd.DataFrame({'Product Group': prod_grps.index,
                   'Product Types': prod_grps.values
                  })
df = df.sort_values(['Product Types'], ascending=False)
plt.figure(figsize = (8,6))
plt.title('Number of Product Types per Product Group')
sns.set_color_codes("pastel")
s = sns.barplot(x="Product Types", y = 'Product Group', data=df)
locs, labels = plt.xticks()
plt.show()

In [None]:
# Number of Articles per Product Type
prod_types = articles.groupby(["product_type_name"])["article_id"].nunique()
df = pd.DataFrame({'Product Type': prod_types.index,
                   'Articles': prod_types.values
                  })
total_types = len(df['Product Type'].unique())
df = df.sort_values(['Articles'], ascending=False)[0:50]
plt.figure(figsize = (20,15))
plt.title(f'Number of Articles per Product Type (top 50 from total: {total_types})')
sns.set_color_codes("pastel")
s = sns.barplot(y = 'Product Type', x="Articles", data=df)
locs, labels = plt.xticks()
plt.show()

In [None]:
# Number of Articles per Department
temp = articles.groupby(["department_name"])["article_id"].nunique()
df = pd.DataFrame({'Department Name': temp.index,
                   'Articles': temp.values
                  })
total_depts = len(df['Department Name'].unique())
df = df.sort_values(['Articles'], ascending=False).head(50)
plt.figure(figsize = (16,15))
plt.title(f'Number of Articles per each Department (top 50 from total: {total_depts})')
sns.set_color_codes("pastel")
s = sns.barplot(y = 'Department Name', x="Articles", data=df)
locs, labels = plt.xticks()
plt.show()

In [None]:
# Number of Articles per Graphical Appearance Name
temp = articles.groupby(["graphical_appearance_name"])["article_id"].nunique()
df = pd.DataFrame({'Graphical Appearance Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False).head(50)
plt.figure(figsize = (16,15))
plt.title(f'Number of Articles per each Graphical Appearance Name')
sns.set_color_codes("pastel")
s = sns.barplot(y = 'Graphical Appearance Name', x = "Articles", data=df)
locs, labels = plt.xticks()
plt.show()

In [None]:
# Number of Articles per Index Name
temp = articles.groupby(["index_name"])["article_id"].nunique()
df = pd.DataFrame({'Index Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)
plt.figure(figsize = (8,6))
plt.title(f'Number of Articles per each Index Name')
sns.set_color_codes("bright")
s = sns.barplot(y = 'Index Name', x="Articles", data=df)
locs, labels = plt.xticks()
plt.show()

In [None]:
# Number of Articles per Index Group Name
temp = articles.groupby(["colour_group_name"])["article_id"].nunique()
df = pd.DataFrame({'Colour Group Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)
plt.figure(figsize = (12,16))
plt.title(f'Number of Articles per each Colour Group Name')
sns.set_color_codes("pastel")
s = sns.barplot(y = 'Colour Group Name', x="Articles", data=df)
locs, labels = plt.xticks()
plt.show()

In [None]:
# Number of Articles per Perceived Colour Name
temp = articles.groupby(["perceived_colour_value_name"])["article_id"].nunique()
df = pd.DataFrame({'Perceived Colour Group Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)
plt.figure(figsize = (6,6))
plt.title(f'Number of Articles per each Perceived Colour Group Name')
sns.set_color_codes("pastel")
s = sns.barplot(y ='Perceived Colour Group Name', x="Articles", data=df)
locs, labels = plt.xticks()
plt.show()

In [None]:
# Distibution of Customers across Age Group
temp = customers.groupby(["age"])["customer_id"].count()
df = pd.DataFrame({'Age': temp.index,
                   'Customers': temp.values
                  })
df = df.sort_values(['Age'], ascending=False)
plt.figure(figsize = (20,6))
plt.title(f'Number of Customers per each Age')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Age', y="Customers", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

In [None]:
# # Number of customers and thier Fashion News Frequency
print(customers.fashion_news_frequency.value_counts())
temp = customers.groupby(["fashion_news_frequency"])["customer_id"].count()
df = pd.DataFrame({'Fashion News Frequency': temp.index,
                   'Customers': temp.values
                  })                 
df = df.sort_values(['Customers'], ascending=False)
plt.figure(figsize = (6,6))
plt.title(f'Number of Customers per each Fashion News Frequency')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Fashion News Frequency', y="Customers", data=df)
locs, labels = plt.xticks()
plt.show()

In [None]:
# Daily Sales (Transactions)
df = transactions.groupby(["t_dat"])["article_id"].count().reset_index()
df["t_dat"] = df["t_dat"].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))
df.columns = ["Date", "Transactions"]
fig, ax = plt.subplots(1, 1, figsize=(20,6))
plt.plot(df["Date"], df["Transactions"], color="Darkblue")
plt.xlabel("Date")
plt.ylabel("Transactions")
plt.title(f"Daily Sales (Transactions)")
plt.show()

In [None]:
# Distribution of the transactions over time
df = transactions.groupby(["t_dat", "sales_channel_id"])["article_id"].count().reset_index()
df["t_dat"] = df["t_dat"].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))
df.columns = ["Date", "Sales Channel Id", "Transactions"]
fig, ax = plt.subplots(1, 1, figsize=(16,6))
g1 = ax.plot(df.loc[df["Sales Channel Id"]==1, "Date"], df.loc[df["Sales Channel Id"]==1, "Transactions"], label="Sales Channel 1", color="Darkblue")
g2 = ax.plot(df.loc[df["Sales Channel Id"]==2, "Date"], df.loc[df["Sales Channel Id"]==2, "Transactions"], label="Sales Channel 2", color="Magenta")
plt.xlabel("Date")
plt.ylabel("Transactions")
ax.legend()
plt.title(f"Transactions per day, grouped by Sales Channel")
plt.show()

In [None]:
# Distribution of Number of Unique articles sold each day
df = transactions.groupby(["t_dat", "sales_channel_id"])["article_id"].nunique().reset_index()
df["t_dat"] = df["t_dat"].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))
df.columns = ["Date", "Sales Channel Id", "Unique Articles"]
fig, ax = plt.subplots(1, 1, figsize=(16,6))
g1 = ax.plot(df.loc[df["Sales Channel Id"]==1, "Date"], df.loc[df["Sales Channel Id"]==1, "Unique Articles"], label="Sales Channel 1", color="Blue")
g2 = ax.plot(df.loc[df["Sales Channel Id"]==2, "Date"], df.loc[df["Sales Channel Id"]==2, "Unique Articles"], label="Sales Channel 2", color="Green")
plt.xlabel("Date")
plt.ylabel("Unique Articles / Day")
ax.legend()
plt.title(f"Unique articles per day, grouped by Sales Channel")
plt.show()

#### **Modelling**

In [None]:
# Defining the scoring metrics

def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.
    This function computes the average prescision at k between two lists of
    items.
    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The average precision at k over the input lists
    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=12):
    """
    Computes the mean average precision at k.
    This function computes the mean average prescision at k between two lists
    of lists of items.
    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The mean average precision at k over the input lists
    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

### **Top Ranking Recently Bought Items based Model:**
* If there are articles for a certain client, pick the most recent buys
* If there are not articles for a certain client, just pick the most frequently bought articles.

In [None]:
transactions2 = train.sort_values(["customer_id", "t_dat"], ascending=False)
# Capturing most frequently bought articles
last_date = transactions2.t_dat.max()
most_frequent_articles = list(transactions2.loc[transactions2.t_dat==last_date].article_id.value_counts()[0:12].index)
art_list = []
for art in most_frequent_articles:
    art = "0"+str(art)
    art_list.append(art)
art_str = " ".join(art_list)
print("Frequent articles bought recently: ", art_str)

In [None]:
agg_df = transactions2.groupby(["customer_id"])["article_id"].agg(lambda x: str(x.values[0:12])[1:-1]).reset_index()

def padding_articles(x):
    if x:
        xl = x.split()
        x = []
        for xi in xl:
            x.append("0"+xi)
        dimm_x = len(x)
        if dimm_x < 12:
            x.extend(art_list[:12-dimm_x])
        return(" ".join(x))

agg_df["article_id"] = agg_df["article_id"].apply(lambda x: padding_articles(x))

recommendations = agg_df.merge(submission[["customer_id"]], how="right")
recommendations.columns = ["customer_id", "prediction"]
recommendations.loc[recommendations.prediction.isna(), ["prediction"]] = art_str
recommendations.head()

In [None]:
sub = recommendations
sub = recommendations.set_index('customer_id').loc[test.customer_id].reset_index()
round(mapk(test.prediction.str.split(), sub.prediction.str.split(), k=12), 3)