# Data Sets / Structure

I created a notebook that modifies the provided datasets in order to simplify model training afterwards. Additional attributes are created and explained below.

## transactions.pqt

- `t_dat`
- `customer_id_int` - converted `customer_id`
- `article_id`
- `price` - multiplied by 590, units are probably Euro
- `price_min_a` - minimum price of article up to timepoint `t_dat` (WITHOUT DISCOUNTED MEMBER PRICES)
- `price_max_a` - maximum price of article up to timepoint `t_dat`
- `price_min_u` - minimum price the customer has paid up to timepoint `t_dat` **TODO**
- `price_mean_u` - average price the customer has paid up to timepoint `t_dat` **TODO**
- `price_max_u` - maximum price the customer has paid up to timepoint `t_dat` **TODO**
- `weekly_quotient` - number of articles sold in last week of transactions divided by number of articles sold in week of `t_dat`
- `art_age_weeks` - the article's number of weeks since its first apperance, e.g. 5 means this is the 5th week of availability (min=1)
- `cust_last_purch_weeks` - the number of weeks between the customers last purchase and the current transaction **TODO**

## transactions_week.pqt



## customers.pqt

- `customer_id_int` - converted `customer_id`
- `FN`
- `Active`
- `club_member_status`
- `fashion_news_frequency`
- `age`
- `postal_code`
- `t_dat_min` - date first transaction in transactions.pqt
- `t_dat_max` - date last transaction in transactions.pqt
- `ct_carts` - count of different carts in transactions.pqt (==0 -> cold user)
- `price_min`, `price_mean`, `price_max` - the minimum/average/maximum price the customer has paid in transactions (i.e. for a single article)
- `cart_value_min`, `cart_value_mean`, `cart_value_max` - the minimum/average/maximum cart value of the customer
- `index_group_name_*` - quotient how many different articles (not transactions) the customer has bought from each index group name -> proxy for gender

**For model training (e.g. NN), the last attributes of customers suffer from leakage ... need to create rolling attributes also**

## articles.pqt

- ...
- `hist` - does article show up in transactions.pqt? (==0 -> cold item) **REDUNDANT? -> t_dat_min != NaN**
- `t_dat_min` - date first appearance in transactions.pqt
- `t_dat_max` - date last appearance in transactions.pqt
- `ct_carts` - count of different carts in transactions.pqt, "How well distributed was this article among different carts?" **TODO**
- `price_min` - minimum price in transactions.pqt (WITHOUT DISCOUNTED MEMBER PRICES)
- `price_max` - maximum price in transactions.pqt

## customers_article_df.pqt

Interaction between customer `c` and article `a`

- `customer_id_int`
- `article_id`
- `ct` - number of times customer `c` bought article `a`
- `ct_carts` - number of carts of customer `c` that contained article `a` 
- `t_dat_min` - first time customer `c` bought article `a`
- `t_dat_max` - last time customer `c` bought article `a`


# Import Packages

In [None]:
import cudf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
import time
import math

PREFIX = 'm_'

print('RAPIDS version',cudf.__version__)

# Load Data Sets and Convert Data

Note that `transactions_train.csv` ~ **3.49GB** -> use `cudf` to import and handle data manipulation.

We remove `customer_id` and replace/work with `customer_id_int`.

Reference: 
- https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635
- https://www.kaggle.com/cdeotte/recommend-items-purchased-together-0-021/notebook

## Transactions

Perform memory saving operations and remove `price` & `sales_channel_id`.

In [None]:
trans_df = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')

In [None]:
trans_df['customer_id_int'] = trans_df['customer_id'].str[-16:].str.hex_to_int().astype('int64')
trans_df['article_id'] = trans_df.article_id.astype('int32')
trans_df.t_dat = cudf.to_datetime(trans_df.t_dat)

In [None]:
trans_df['price'] = trans_df['price'].astype('float32')*590 # from discussions on competition
trans_df = trans_df.round({'price':2})

In [None]:
trans_df = trans_df[['t_dat','customer_id_int','article_id', 'price', 'sales_channel_id']]

In [None]:
trans_df.info()

### Trending
Reference:
- https://www.kaggle.com/ebn7amdi/trending/notebook

In [None]:
df = trans_df[['t_dat', 'customer_id_int', 'article_id']].to_pandas()
last_ts = df['t_dat'].max()
df = df.join(df.groupby('t_dat')['t_dat'].max().transform(lambda d: last_ts - (last_ts - d).floor('7D')), on="t_dat", how="left", rsuffix="_ldbw")
df = df.rename(columns={'t_dat_ldbw': 'ldbw'})
weekly_sales = df.drop('customer_id_int', axis=1).groupby(['ldbw', 'article_id']).count()
weekly_sales = weekly_sales.rename(columns={'t_dat': 'count'})
df = df.join(weekly_sales, on=['ldbw', 'article_id'])
weekly_sales = weekly_sales.reset_index().set_index('article_id')
df = df.join(
    weekly_sales.loc[weekly_sales['ldbw']==last_ts, ['count']],
    on='article_id', rsuffix="_targ")
df['count_targ'].fillna(0, inplace=True)
# del weekly_sales
df['quotient'] = df['count_targ'] / df['count'] # quotient is high when lot of particular article is bought in last week, compared to current week

trans_df['weekly_quotient'] = df['quotient'].astype('float32')


# del df

# e.g. can be used like this
# trans_df.groupby('article_id')['weekly_quotient'].sum().nlargest(20).index.to_arrow()

In [None]:
# Article's age in weeks
df = df.merge(df.groupby('article_id')['t_dat'].min(), on='article_id', how='left')
df['art_age_weeks'] = (df['t_dat_x'] - df['t_dat_y']).astype('timedelta64[D]').astype('int16').apply(lambda d: 1 + math.floor(d/7))
df['art_age_weeks'] = df['art_age_weeks'].astype('int16')
trans_df['art_age_weeks'] = df['art_age_weeks']

In [None]:
# Time since customer's last purchase in weeks cust_last_purch_weeks
#df = df.drop(['count_targ', 'quotient', 't_dat_y', 'art_age_weeks', 'ldbw', 'count'], axis=1)
#df = df.rename(columns={"t_dat_x": "t_dat"})
#df['customer_last_purchase_t_dat'] = df.groupby(['customer_id_int'])['t_dat'].cummax()
#df['okay'] = df.groupby(['customer_id_int'])['t_dat'].shift()
#del df
# df.merge(df.groupby('customer_id_int')['t_dat'].cummax(), on='customer_id_int', how='left')

In [None]:
trans_df.info()

## Customers

**From customers.csv**

Convert `customer_id` and fill missing values **note that filling missing values might be questionable, e.g. NaN on FN means we don't know his status -> customer did not actively withdrawn from newsletter**

In [None]:
cust_df = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv")
cust_df['customer_id_int'] = cust_df['customer_id'].apply(lambda x: int(x[-16:],16) ).astype('int64') # needed since transaction also converted
cust_df[['FN', 'Active']] = cust_df[['FN', 'Active']].fillna(0).astype(np.int8) # Replace NaNs in Newsletter and Active with 0
cust_df[['club_member_status']] = cust_df[['club_member_status']].fillna('None')
cust_df['fashion_news_frequency'] = cust_df['fashion_news_frequency'].replace('NONE', 'None')
del cust_df['customer_id']

In [None]:
# Fill missing age data
mean_age = cust_df["age"].mean() # TODO: use other attributes to fill age
cust_df["age"] = cust_df["age"].fillna(mean_age).astype(np.int8) # age "intable"
print(f"Customers age isna after replace with mean ({int(mean_age)}): {sum(cust_df.age.isna())}")

In [None]:
# Get first & last transaction of user, number of carts in transactions (if cold user ct_carts=0)
cust_df = cust_df.merge(trans_df.to_pandas().groupby('customer_id_int')['t_dat'].agg(
    t_dat_min='min', t_dat_max='max', ct_carts='nunique'), how="left", left_on="customer_id_int", right_index=True)
cust_df['ct_carts'] = cust_df['ct_carts'].fillna(0).astype('int8')

In [None]:
# Get maximum and mean price customer pays for an article
cust_df = cust_df.merge(trans_df.to_pandas().groupby('customer_id_int')['price'].agg(price_min='min', price_mean='mean', price_max='max'), on='customer_id_int', how='left')
cust_df = cust_df.merge(trans_df.to_pandas().groupby(['customer_id_int', 't_dat'])['price'].agg(cart_value='sum').groupby('customer_id_int')['cart_value'].agg(cart_value_min='min', cart_value_mean='mean', cart_value_max='max'), on='customer_id_int', how='left')

In [None]:
# Simplify postal code
postal_code_map = {pc: pc_simple for pc_simple, pc in enumerate(cust_df.postal_code.unique().tolist())}
cust_df['postal_code'] = cust_df['postal_code'].map(postal_code_map).astype('int32')

In [None]:
cust_df.info()

### Transactions Minimum/Maximum Article Price
Get the rolling maximum & minimum price in transactions (exclude discounted member prices for minimum).

In [None]:
# TODO: merge cust_df to get club_member_status and exclude from price counting
trans_df['price_min'] = trans_df.groupby('article_id')['price'].cummin()

In [None]:
trans_df['price_max'] = trans_df.groupby('article_id')['price'].cummax()

## Articles

In [None]:
art_df = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
art_df.info()

In [None]:
art_df['hist'] = art_df.article_id.isin(trans_df.article_id.to_pandas().unique()).astype(np.int8)

In [None]:
# Check if article has transaction history, i.e. does show up in transactions (0=cold item)
art_df['hist'] = art_df.article_id.isin(trans_df.article_id.to_pandas().unique()).astype(np.int8)

In [None]:
helper = trans_df.to_pandas().groupby('article_id')['t_dat'].agg(t_dat_min='min', t_dat_max='max')
art_df = art_df.merge(helper, on="article_id", how='left')
del helper

### Is Article Relevant Next Week?

In [None]:
weekly_unique_articles = weekly_sales.reset_index().groupby('ldbw')['article_id'].unique() # 105 rows
weekly_unique_articles = pd.DataFrame(
    list(zip(weekly_unique_articles.values, weekly_unique_articles.shift(-1).values)), 
    columns=['articles', 'articles_next'], index=weekly_unique_articles.index)
weekly_unique_articles.loc['2020-09-22', 'articles_next'] = [] # last week

weekly_unique_articles['res'] = np.empty((len(weekly_unique_articles), 0)).tolist()
for index, row in weekly_unique_articles.iterrows():
    for a in row['articles']:
        if a in row['articles_next']:
            weekly_unique_articles.loc[index, 'res'].append((a, 1))
        else:
            weekly_unique_articles.loc[index, 'res'].append((a, 0))
            
weekly_unique_articles = weekly_unique_articles.apply(lambda x: pd.Series(x['res']),axis=1).stack().reset_index(level=1, drop=True)
weekly_unique_articles = pd.DataFrame(list(weekly_unique_articles), 
                                      columns=['article_id', 'rel_next_week'],
                                      index=weekly_unique_articles.index)
weekly_sales = weekly_sales.merge(weekly_unique_articles, how='left', on=['article_id', 'ldbw'])
weekly_sales['rel_next_week'] = weekly_sales['rel_next_week'].astype('int8')
weekly_sales['count'] = weekly_sales['count'].astype('int16')
del weekly_unique_articles

In [None]:
weekly_sales.info()

# Helper Datasets

**Customer/Article Interaction**

In [None]:
# Dataframe Index: (cust_id, art_id); Values: how often occured combination
MIN_PURCHASES = 0 # Cap, ignore combinations where customer only bought article once (REMOVED)

cust_art_df = trans_df.groupby(['customer_id_int', 'article_id'])['t_dat'].agg(['count', 'nunique', 'min', 'max'])
cust_art_df.columns = ['ct', 'ct_carts', 't_dat_min', 't_dat_max']
cust_art_df = cust_art_df[cust_art_df.ct>MIN_PURCHASES].sort_values(by="ct", ascending=False).to_pandas()
cust_art_df.head()

In [None]:
cust_art_df_h = cust_art_df.reset_index().merge(art_df[['article_id', 'index_group_name']], on='article_id', how='left')
cust_art_df_h = pd.get_dummies(cust_art_df_h, columns=['index_group_name']).groupby(
    'customer_id_int').sum().filter(regex=("index.*"))
cust_art_df_h = cust_art_df_h.filter(regex=("index.*")).astype('int16')

cust_art_df_h["helper_sum"] = cust_art_df_h.sum(axis=1).astype('int16')

for col in ['index_group_name_Baby/Children', 'index_group_name_Divided',
       'index_group_name_Ladieswear', 'index_group_name_Menswear',
       'index_group_name_Sport']:
    cust_art_df_h.loc[:, col] = (cust_art_df_h[col]/cust_art_df_h['helper_sum']).astype('float32')

del cust_art_df_h['helper_sum']
cust_df = cust_df.merge(cust_art_df_h, on='customer_id_int', how='left')

del cust_art_df_h

cust_df.info()

In [None]:
cust_df

# Store results

In [None]:
trans_df.to_parquet(f'{PREFIX}transactions.pqt', index=False)
weekly_sales.to_parquet(f'{PREFIX}weekly_sales.pqt', index=False)
cust_df.to_parquet(f"{PREFIX}customers.pqt", index=False)
art_df.to_parquet(f"{PREFIX}articles.pqt", index=False)
cust_art_df.to_parquet(f"{PREFIX}customers_article_df.pqt")