In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# I. Overview
## 1. Problem
In recent time, we are reported that deep model(deep learning, tree model,...) get better performance than traditional model in recommendation. But by viewing [best score notebooks](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/code?competitionId=31254&sortBy=scoreDescending), I am surprised that traditional approaches (trending, most purchased items for each group, ...) outperform deep model in this data. I also tried to create some deep models in [this notebook](https://www.kaggle.com/astrung/recbole-lstm-sequential-for-recomendation-tutorial) and [this notebook](https://www.kaggle.com/astrung/lstm-sequential-modelwith-item-features-tutorial), but it still get lower score than traditional model. Is it weird ? 

I think our data may has something which is different from others data in publications, so i start this notebook to investigate the problem. This is my hypothesis: our data has too many customers which theirs preference can not predicted from past data, or data for each user is too small for deep model.

By EDA on customers, we can see that most of users in test data are cold start user, or user who is inactive in a long time, so their data isn't enough, or doesn't reflect their interest correctly:

* Inactive users: In our test data(1371980 users), **509256 users(37% of all user) have been inactive for a year**-they stop buying from before 2020. Beside, **373171 users(27% of all user) have been inactive for 3 months**(Sep, Aug, July) or more in 2020. In other words, **882427(64% of all user) users have been inactive in our test data**, they have gave up in a long time, and then they reappear in our test set. In most scenarios, customers give up on a system because their interest/priority factors have been changed, so their past data isn't enough to inference their desire now. Example: I gave up on H&M a year ago because i have changed my fashion style, but now i have seen a sale-off/advertising campaign/hot trending in H&M, so i comeback. With this type of user, my advice is avoiding use their past data correctly for recommendation. You should use items in sale-off/advertising campaign/hot trending, because they are likely the reason they come back.The more time they disappeared in past data, the more challenged correct prediction for their interest. 

* Cold-start users: In recommendation, cold-start users are some users who don't have past data, or their past data is too small for inference correct recommendation. In this notebook, I demonstrate that most of users have very small transaction data. In detail, in 862724 users who have transactions in 2020, 547161 users(63% all users) have number transactions < 10. In other words, we have too many cold-start users in test data, and deep models don't work with this type of user. Deep model only works with users who have big number of transactions, not cold-start users or inactive users.

* Another challenge is low frequent in transaction. In average, each user will buy only 4 items in a month/1 item in a week. so we need to predict unique correct item in one week test data. It is very challenged, but it is ussual in practice.

## 2.Solution
Deep model only works with active/non cold start users. But in our test data, only there are only 9% users who satisfy this condition. So can we give up deep model ?

**In this case, my advice is using hybrid approach. You should use sale-off/advertising campaign/hot trending for 92% users who is inactive/cold-start in test data, then you can use deep model with remaining users**. We should only use deep models for right use cases.By combining deep model into general approach, i got some higher score than origin general approach. After cleaning my notebook, I will publish it as soon as posible. 

# Dataset for anyone who want to use directly inactive/cold start user
I already publish inactive/cold-start users in this customer metadata dataset, so you can use inactive/cold-start information directly for your hybrid approach:
* https://www.kaggle.com/astrung/hm-customer-metadata

If you want find more ideal about hybrid approach, you can check comments in my thread:
https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/312653

I also create a dataset for items, which extract sale-off/advertising campaign from transaction data. You can get sale-off information from this data for your general recommendation. If you want to find more information about article and campaign, please check my following notebook:
* sale-off item dataset: https://www.kaggle.com/astrung/hm-article-capaign
* sale-off item notebook: https://www.kaggle.com/astrung/eda-extract-campaign-from-transactions



In [None]:
df = pd.read_csv(r'../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')
df.head()

In [None]:
df['t_dat'] = pd.to_datetime(df['t_dat'], format="%Y-%m-%d")
df['month'] = df['t_dat'].dt.strftime('%m')
df['year'] = df['t_dat'].dt.strftime('%Y')
df.head()

In [None]:
df = df[df['year'] == '2020']
df.shape

In [None]:
df_test_user = pd.read_csv(r'../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')
df_test_user.shape

# Find inactive user

First, let count number of transaction in each month for all users

In [None]:
df_month_avg_item_per_u = df.groupby(['customer_id', 'month'])['price'].count().unstack().reset_index()
df_month_avg_item_per_u

Then merge with test data. Test data has more rows than our transaction data. It means we have some users who don't have any transactions in 2020 in test data. Let check how many users like it

In [None]:
df_month_avg_item_per_u = pd.merge(df_month_avg_item_per_u, df_test_user[['customer_id']], on='customer_id', how='outer')
df_month_avg_item_per_u

In [None]:
df_month_avg_item_per_u['num_missing_months'] = df_month_avg_item_per_u.isnull().sum(axis=1)
df_month_avg_item_per_u

**num_missing_months=9 means users don't have any transactions in 2020(9 months of 2020). There is 37% users with this condition in test data**

In [None]:
num_missing_year = len(df_month_avg_item_per_u[df_month_avg_item_per_u['num_missing_months'] == 9])
print(num_missing_year)
print(num_missing_year/len(df_test_user))

In [None]:
df_month_avg_item_per_u = df_month_avg_item_per_u.fillna(0)
df_month_avg_item_per_u

**Inactive users with more than 3 consecutive months will be still masked as 3**

In [None]:
def cal_inactive_months(x):
    if x['09'] > 0:
        return 0
    elif x['09'] == 0 and x['08'] > 0:
        return 1
    elif x['09'] == 0 and x['08'] == 0 and x['07'] > 0:
        return 2
    elif x['09'] == 0 and x['08'] == 0 and x['07'] == 0:
        return 3
    else:
        return 4

df_month_avg_item_per_u['lastest_inactive_months'] = df_month_avg_item_per_u[
    df_month_avg_item_per_u.columns.difference(['customer_id', 'num_missing_months'])].apply(
    lambda x: cal_inactive_months(x), axis=1)
df_month_avg_item_per_u

**In below cell, we see that 50% of users disappers in 8 or more months before reactive. It is another challenge in our data**

In [None]:
print(df_month_avg_item_per_u.num_missing_months.value_counts())
print(df_month_avg_item_per_u.num_missing_months.describe())
df_month_avg_item_per_u.num_missing_months.hist()

***In following cell, we see that 64% of users disappeared in recent 3 months (Sep, Aug, July) before reappear in testdata. ***

In [None]:
print(df_month_avg_item_per_u.lastest_inactive_months.value_counts())
print(df_month_avg_item_per_u.lastest_inactive_months.describe())
df_month_avg_item_per_u.lastest_inactive_months.hist()

In [None]:
print("Missing 3 months")
num_missing_3months = len(df_month_avg_item_per_u[df_month_avg_item_per_u['lastest_inactive_months'] == 3])
print(num_missing_3months)
print(num_missing_3months/len(df_test_user))
print("Missing 2 months")
num_missing_2months = len(df_month_avg_item_per_u[df_month_avg_item_per_u['lastest_inactive_months'] == 2])
print(num_missing_2months)
print(num_missing_2months/len(df_test_user))
print("Missing 1 months")
num_missing_1months = len(df_month_avg_item_per_u[df_month_avg_item_per_u['lastest_inactive_months'] == 1])
print(num_missing_1months)
print(num_missing_1months/len(df_test_user))

Create a dataframe for inactive user, in order to merge with other information about user

In [None]:
df_month_avg_item_per_u['active_status'] = 'active'
df_month_avg_item_per_u.loc[(df_month_avg_item_per_u.num_missing_months == 9),'active_status']='inactive_in_year'
df_month_avg_item_per_u.loc[(df_month_avg_item_per_u.num_missing_months < 9) &
                            (df_month_avg_item_per_u.lastest_inactive_months == 3),
                            'active_status']='inactive_in_3_months_or_more'
df_month_avg_item_per_u.loc[
    (df_month_avg_item_per_u.lastest_inactive_months == 2),'active_status']='inactive_in_2_months'
df_month_avg_item_per_u.loc[
    (df_month_avg_item_per_u.lastest_inactive_months == 1),'active_status']='inactive_in_1_month'
df_month_avg_item_per_u

In [None]:
df_active_user = df_month_avg_item_per_u[['customer_id', 'num_missing_months', 'lastest_inactive_months', 'active_status']].copy()
df_active_user

# Find coldstart customer

**First, count number of transaction. We will mask users with number of transactions <= 10 are cold start user. They are users with too small data for correct recommendation**

In [None]:
df_avg_item_per_u = df.groupby(['customer_id'])['price'].count().reset_index()
df_avg_item_per_u.columns = ['customer_id', 'num_transactions']
df_avg_item_per_u

In test data, we have some users who dont have any transactions in 2020. Let add it into our dataframe, and label their number of transaction as 0

In [None]:
df_avg_item_per_u = pd.merge(df_avg_item_per_u, df_test_user[['customer_id']], on='customer_id', how='outer')
df_avg_item_per_u = df_avg_item_per_u.fillna(0)
df_avg_item_per_u

**In below plot, we see that most of users have small number of transactions**

In [None]:
df_avg_item_per_u.num_transactions.hist(bins=100)
plt.show()
plt.close()
df_avg_item_per_u.boxplot('num_transactions')
plt.show()
plt.close()

In [None]:
df_avg_item_per_u.num_transactions.value_counts(bins=[-1, 0, 10, 100, 1000])

In [None]:
df_avg_item_per_u.num_transactions.describe()

**we mask users with num transaction < 10 as cold start user**

In [None]:
df_avg_item_per_u['cold_start_status'] = 'cold_start'
df_avg_item_per_u.loc[(df_avg_item_per_u.num_transactions >= 10),'cold_start_status']='non_cold_start'
df_coldstart_user = df_avg_item_per_u.copy()
df_coldstart_user

# Find about frequent transaction of user in month 

In [None]:
df_month_avg_item_per_u = df.groupby(['customer_id', 'month'])['price'].count().unstack().reset_index()
df_month_avg_item_per_u

In [None]:
def find_active_month(x):
    float_x = x.values[1:].astype(float)
    return float_x[~np.isnan(float_x)]
df_month_avg_item_per_u['transactions_in_active_month'] = df_month_avg_item_per_u.apply(
    lambda x: find_active_month(x), axis=1)
df_month_avg_item_per_u

In [None]:
df_month_avg_item_per_u['mean_transactions_in_active_month'] = df_month_avg_item_per_u.apply(
    lambda x: x['transactions_in_active_month'].mean(), axis=1)
df_month_avg_item_per_u

In average, each user only buy 4 items in a month/1 item in a week. It is another challenge

In [None]:
print(df_month_avg_item_per_u.mean_transactions_in_active_month.describe())
df_month_avg_item_per_u.mean_transactions_in_active_month.hist(bins=100)

# Create dataframe for all metadata for user: active status/cold start status

In [None]:
df_transaction_frequent = df_month_avg_item_per_u[['customer_id', 'mean_transactions_in_active_month']].copy()
df_transaction_frequent

In [None]:
result = pd.merge(df_active_user, df_coldstart_user, on='customer_id', how='outer')
result = pd.merge(result, df_transaction_frequent, on='customer_id', how='outer')
result

In [None]:
result[(result.active_status == 'active') & (result.cold_start_status == 'non_cold_start')].shape

In [None]:
result.to_csv('metadata_customer_id.csv', index=False)

In [None]:
result.shape

In [None]:
121843/1371980