# H&M - Exploration & Baseline

### Data (CSV files)

In [None]:
import pandas as pd

df_articles = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/articles.csv', dtype={'article_id': str})
df_customers = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/customers.csv')
df_train = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv', dtype={'article_id': str})
df_sample_submission = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv',
                                   dtype={'article_id': str})

In [None]:
df_articles.info()

In [None]:
df_customers.info()

In [None]:
df_train.info()

In [None]:
df_sample_submission.info()

In [None]:
n = len(list(set(df_train.customer_id.unique()) - set(df_sample_submission.customer_id.unique())))
print('Customers that have bought at least once during training and do not appear in the submission file:', n)

In [None]:
n = len(list(set(df_sample_submission.customer_id.unique()) - set(df_train.customer_id.unique())))
print('Customers that have not bought during training and appear in the submission file:', n)

All customers from the submission file appear in the train dataset. However, we find some new customers (~0.7%) that do not appear in the training dataset. We will need to deal with a few customers customers with no historical transaction.


#### Adding sales information to each article

In [None]:
sales_product = df_train.article_id.value_counts()

In [None]:
df_articles = df_articles.merge(sales_product.rename('sales'), left_on='article_id', right_index=True, how='left')

In [None]:
df_articles['sales'] = df_articles['sales'].fillna(value=0)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


sns.set(rc={'figure.figsize':(20.7,12.27)})

sns.violinplot(data=df_articles, x='index_group_name', y='sales')
plt.title('Sales distribution per index group')

In [None]:
sns.boxplot(data=df_articles, x='index_group_name', y='sales', showfliers=False)
plt.title('Sales distribution per index group - no outliers')

### Adding article information to transactions data

In [None]:
df_train = df_train.merge(on='article_id', 
                          how='left', 
                          right=df_articles[['article_id', 'index_group_name', 'product_group_name']])

In [None]:
df_train['t_dat'] = pd.to_datetime(df_train['t_dat'])
print(f'Perimeter training: {df_train.t_dat.min(), df_train.t_dat.max()}')

On the Evaluation section we can read:
> For each customer_id observed in the training data, you may predict up to 12 labels for the article_id, which is the predicted items a customer will buy in the next 7-day period after the training time period.


This means that we are going to predict the purchases from the customers from 2020-09-23 to 2020-09-30. Can we use seasonality to make more accurate predictions? It looks like an interesting thing to take into account.

In [None]:
df_train['month_year'] = df_train['t_dat'].dt.to_period('M').astype(str)   # Adding YYYY-MM

In [None]:
sns.lineplot(data=df_train.month_year.value_counts().sort_index())
plt.title('Monthly sales during training period')

I wonder if the distribution of sales for different categories changes from september to the whole period.

In [None]:
df_september = df_train.query('month_year == "2019-09" or month_year == "2019-09" or month_year == "2020-09"')

In [None]:
df_september.index_group_name.value_counts(normalize=True).plot(kind='bar', color='skyblue', position=1, 
                                                               width = .25, label='season')
df_train.index_group_name.value_counts(normalize=True).plot(kind='bar', color='red', position=0, 
                                                              width = .25, label='train')
plt.title('Product group sales distribution - Train VS September')
plt.legend()

In [None]:
df_september.product_group_name.value_counts(normalize=True).plot(kind='bar', color='skyblue', position=1, 
                                                               width = .25, label='season')
df_train.product_group_name.value_counts(normalize=True).plot(kind='bar', color='red', position=0, 
                                                              width = .25, label='train')
plt.title('Product group sales distribution - Train VS September')
plt.legend()

### Customer interactions with articles
Are products purchased more than once by the same customers?

In [None]:
df_customer_articles_count = df_train.groupby(['customer_id', 'article_id'])['article_id'].count()

In [None]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df_customer_articles_count.describe()

Wow some article has been bought 570 times by the same customer. The person must love this product :).

In [None]:
sns.histplot(df_customer_articles_count[df_customer_articles_count < 5], stat="percent", discrete=True)
plt.title('Number of times a product is bought by the same customer')

There is a fair amoung of repurchases. We are probably interested in recommending products that the customer has already purchased before.

In [None]:
df_enriched = df_customer_articles_count[df_customer_articles_count < 5].rename('purchases').reset_index(level=[0, 1])

In [None]:
df_enriched = df_enriched.merge(on='article_id', 
                                how='left', 
                                right=df_articles[['article_id', 'index_group_name']])

In [None]:
df_enriched = df_enriched.set_index('index_group_name')

In [None]:
df_enriched.groupby('index_group_name').purchases\
                                      .value_counts(normalize=True)\
                                      .plot.bar(title='Number of times a product is bought by the same customer')

It looks like more or less all index groups share the same distribution of repurchases.


#### Baseline

Let's build a baseline model to see how it works this competition. We are going to use some of the insights explored.

First, let's get the most popular products of September. This is probably not the best solution since there might be products that appeared years before that are no longer popular. Anyway, this is just a baseline.

In [None]:
most_popular_september = df_september['article_id'].value_counts()

Let's recommend most repeated purchased products for the customer. Then, we will fill the remaining recommendations with the most popular products of September.

In [None]:
def get_recommendation(most_popular, customer_sales_count):
    if type(customer_sales_count) == pd.core.series.Series:
        recommendation = [customer_sales_count.article_id]
    else:
        recommendation = list(customer_sales_count.sort_values(by='purchases', ascending=False).article_id[0:12])
    i = 0
    while (len(recommendation) < 12):
        recommendation.append(most_popular.index[i])
        i += 1
    return recommendation

In [None]:
df_customer_articles_count = df_customer_articles_count.rename('purchases').reset_index(level=[1])

In [None]:
newcomers = list(set(df_sample_submission.customer_id.unique()) - set(df_train.customer_id.unique()))
default_recc = {customer:1 for customer in newcomers}  

How fast is our recommendation function?

In [None]:
%%timeit
get_recommendation(most_popular_september,
                   df_customer_articles_count.loc['0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c9199e53dbb81641becd7'])

Fair enough. I will not spend time on optimizing it since it is already acceptable.

In [None]:
df_sample_submission_original = df_sample_submission.copy()

In [None]:
from tqdm import tqdm

tqdm.pandas()


df_sample_submission['prediction'] = df_sample_submission\
                                        .progress_apply(lambda x: get_recommendation(most_popular_september,
                                                                                     df_customer_articles_count.loc[x.customer_id])
                                                                  if x.customer_id not in default_recc 
                                                                  else list(most_popular_september.index[0:12]),
                                                        axis=1)

#### Preparing data for submission

We could integrate it into the recommendation function if we want to avoid two loops.

In [None]:
def prepare_list_submission(recommendations):
    recommendations = str(recommendations)
    REMOVE_CHARS = ["'", ",", "[", "]"]
    for char in REMOVE_CHARS:
        recommendations = recommendations.replace(char, '')
    return recommendations

In [None]:
df_sample_submission['prediction'] = df_sample_submission.progress_apply(lambda x: prepare_list_submission(x.prediction),
                                                                         axis=1)

In [None]:
df_sample_submission.iloc[0].prediction

In [None]:
df_sample_submission_original.iloc[0].prediction

Looks **OK**! Let's save it and submit.

In [None]:
df_sample_submission.to_csv('submission.csv', index=False)

Thanks for reading.