`article_id`s are not random integers. They are strongly tied to the release dates of the articles; articles with smaller ids are older and larger values are newer, and I also estimate that the values are proportional to time.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

di = '/kaggle/input/h-and-m-personalized-fashion-recommendations/'

In [None]:
transactions = pd.read_csv(di + 'transactions_train.csv')
transactions['t_dat'] = pd.to_datetime(transactions['t_dat'])

articles = pd.read_csv(di + 'articles.csv')

## Time notation

Days from the first date in the transaction:

* **t = date - 2018-09-20**  in days

In [None]:
# Define time t by the time from the first transaction in days
t0 = pd.to_datetime('2018-09-20')
transactions['t'] = (transactions['t_dat'] - t0).dt.days

## Time of first purchase

In [None]:
# First t that each article was purchased
# t_release <= t_first
t_first = transactions.groupby(['article_id'])[['t']].min()

plt.title('When does each article purchased for the first time?')
plt.xlabel('article index / 1000')
plt.ylabel('t_first [day]')
plt.plot(t_first.values[::1000])
plt.show()

Clearly, the articles later in the data are first sold later in time. Since unpopular items are not sold immediately after the release, there is randomness in the plot.

Article index ~40 x 1000 seems to correspond to the beginning of the training data period (2018-09-20).


## Relase date estimate

Assuming the article data are in order of release, we can clean the plot by imposing monotonically increasing t_release_estimate.

```
t_release <= t_release_est <= t_first
t_release_est[i] <= t_release_est[i + 1]
```

In [None]:
len(articles), len(transactions['article_id'].unique())

In [None]:
# Include items never purchced, do we have items not relased by the end of train?

df = articles.merge(t_first, how='left', left_on='article_id', right_index=True)
df['t'].fillna(999, inplace=True) # ~1000 items never purchased
df.rename(columns={'t': 't_first'}, inplace=True)

In [None]:
t_first = df['t_first'].values

t_min = 999
t_release_est = np.zeros(len(t_first), dtype=int)
n = len(t_first)

for i in range(n - 1, -1, -1):
    t = t_first[i]
    t_min = min(t, t_min)  # t_release_est[i] = min(t_first[i], t_release_est[i + 1])
    t_release_est[i] = t_min

In [None]:
plt.xlabel('article index')
plt.ylabel('release date [day]')
plt.plot(t_release_est)
plt.show()

In [None]:
t_release_est[-8:]

Only 2 items have t_release_est = 999; I conclude that articles released after the training period (2020-09-22) are *not* in our article data, and we do not need to recommend such new items.

However, notice that new items have short period of time to purchase and therefore have larger possibility that no one has purchased yet.

In [None]:
idx = df['t_first'] == 999
zero_purchased = df.index[idx]

plt.xlabel('article_id index')
plt.ylabel('number of items never purchased')
plt.hist(zero_purchased, 101)
plt.show()

## Article index vs article_id

In [None]:
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.ylim(0, 733)
plt.xlabel('article index')
plt.ylabel('t_release_est [day]')
plt.plot(t_release_est)

plt.subplot(1, 2, 2)
plt.xlabel('article_id')
plt.ylim(0, 633)
plt.plot(df['article_id'], t_release_est)

tt = np.linspace(0.73e9, 0.95e9)
plt.plot(tt, 3.1e-6*(tt - tt[0]))

plt.show()

The relation of article_id - t_release_est looks more linear and I speculate article_ids are release dates.

Not all digits in article_ids are dates; the last 3 digits are not uniform and it must be some index assigned from 1.

In [None]:
articles['article_id'].head()

In [None]:
last3 = articles['article_id'].apply(lambda x: int(str(x)[-3:]))

plt.xlabel('Last 3 digits in article_id')
plt.hist(last3, 41);