# H&M Comppetition so easy Baseline  
I did [EDA](https://www.kaggle.com/code/kaicho0504/h-m-eda-2022-4-20), I think popular product type is't much different in each year. In addition, I assume popular products in the season are sold a lot. Therefore I use data during 2018, 2019 and 2020 September.  

1. Count kind of popular products by using past data(2018, 2019) and select cadidate product type.  

2. Count popular products by using close target period data(2020 September) and select recommended product to users that product type is included cadidate.  

## Import Library  

In [None]:
import os
import warnings

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

warnings.filterwarnings("ignore")
%matplotlib inline

## Load Data  

In [None]:
DIR = "../input/h-and-m-personalized-fashion-recommendations"
articles = pd.read_csv(os.path.join(DIR, "articles.csv"))
customers = pd.read_csv(os.path.join(DIR, "customers.csv"))
transactions = pd.read_csv(os.path.join(DIR, "transactions_train.csv"))
submission = pd.read_csv(os.path.join(DIR, "sample_submission.csv"))

print(f"artiles data shape: {articles.shape}")
print(f"cusomters data shape: {customers.shape}")
print(f"transactions data shape: {transactions.shape}")
print(f"sample submission shape: {submission.shape}")

display(articles.head())
display(customers.head())
display(transactions.head())
display(submission.head())

## Preprocessing data  
**add columns**  
- yeear  
- month  
- day  
- binning age  

**merge data**
- transactions + product_type_name(articles) + bin_age(customers)  

**extract data**
1. transacsions in 2018 September.  
2. transactions in 2019 September.  
3. transactions in 2020 September.  

In [None]:
# add columns to transactions
transactions["t_dat"] = pd.to_datetime(transactions["t_dat"])

# add year, month, day
transactions["year"] = transactions["t_dat"].dt.year
transactions["month"] = transactions["t_dat"].dt.month
transactions["day"] = transactions["t_dat"].dt.day

# add binning age
bins = [i for i in range(10, 101, 10)]
labels = [i for i in range(1, len(bins))]

customers["bin_age"] = pd.cut(customers["age"], bins=bins, labels=labels)

# merge data
transactions = transactions.merge(articles[["article_id", "product_type_name"]],
                                  on="article_id")
transactions = transactions.merge(customers[["customer_id", "bin_age"]],
                                  on="customer_id")

# extract data
transactions_2018sep = transactions.query("year == 2018 and month == 9").copy().reset_index(drop=True)
transactions_2019sep = transactions.query("year == 2019 and month == 9").copy().reset_index(drop=True)
transactions_2020sep = transactions.query("year == 2020 and month == 9").copy().reset_index(drop=True)

# to save memory
del transactions

display(transactions_2018sep)
display(transactions_2019sep)
display(transactions_2020sep)

In [None]:
# select cadidate product type by using 2018 and 2019 data
# count sold product type
prod_count_2018 = transactions_2018sep["product_type_name"].value_counts()
prod_count_2019 = transactions_2019sep["product_type_name"].value_counts()

# initilize dataframe to store count data
prod_count = pd.DataFrame(
    index=[2018, 2019],
    columns=articles["product_type_name"].unique()
)
prod_count = prod_count.fillna(0)

# assign count values to dataframe
for year, df in zip([2018, 2019], [prod_count_2018, prod_count_2019]):
    for prod in df.index:
        prod_count.loc[year, prod] = df.loc[prod]

display(prod_count)

In [None]:
# extract product type top 12 in each year
prod_count_2018 = prod_count.loc[2018]
prod_count_2019 = prod_count.loc[2019]

# descending sort
indices_2018 = prod_count_2018.values.argsort()[::-1]
indices_2019 = prod_count_2019.values.argsort()[::-1]

# extract cadidate product names and values
cadidate_2018 = prod_count_2018.index[indices_2018][:12]
values_2018 = prod_count_2018.loc[cadidate_2018]

cadidate_2019 = prod_count_2019.index[indices_2019][:12]
values_2019 = prod_count_2019.loc[cadidate_2019]

print("2018 cadidate and values")
for prod in cadidate_2018:
    print(f"{prod}: {values_2018.loc[prod]}")

print("\n2019 cadidate and values")
for prod in cadidate_2019:
    print(f"{prod}: {values_2019.loc[prod]}")

In [None]:
# concatenate candidates and eliminate duplication
cadidate = set(cadidate_2018.tolist() + cadidate_2019.tolist())

# add 2018 values and 2019 values
prod_cadidate = pd.DataFrame(
    data=(prod_count_2018[cadidate].values+prod_count_2019[cadidate].values).reshape(-1, 1),
    index=cadidate,
    columns=["num of sold product"],
)

# descending sort to prioritize
prod_cadidate = prod_cadidate.sort_values(by="num of sold product", ascending=False)
cadidate = prod_cadidate.index.tolist()

print(cadidate)
display(prod_cadidate)

## Select recommended product  
Products that the user bought recently will be not bought by the user(same user) so should not recoomend same products.  

First, extract the product selling the most in each product type.  

In [None]:
# list to store top product
top_products = []

for prod in cadidate:
    top_product_count = transactions_2020sep.query("product_type_name == @prod")["article_id"].value_counts()
    top_products.append(top_product_count.index[0])

    # display number of other product(top5)
    print(f"prod:")
    for i in range(5):
        print(f"{top_product_count.index[i]}: {top_product_count.iloc[i]}")

    print("\n")

print(top_products)

There is not by far the best selling product, so I should select some cadidate in each product type.  

In [None]:
# condatenate article id and transform to str
top_products_str = " ".join(np.array(top_products)[:12].astype(str))
submission["prediction"] = top_products_str
display(submission)

## Discussion  
I don't expected to get good score by this baseline at first, but kaggle is a big community so I think someone will help us and can discovery good method with us.  

My thought and what I'll do after are following.  
- I tried to apply similar this method each customers, but it take much execution time more than 1 month.  So I'll classification customers and apply this method.  
- I saw using LGBMRanker of scikit-learn in other user notebook, I'll try to use this model and improve score.  
- I did'nt use image data and natural language data, but we make features of these data to use CNN model, ViT, or BERT etc. I think how to use these features. We must consider execution time because this dataset is so big.  

Thank you for watching my notebook and I'm forward to your advice!