# [H&M Fashion] GPU-accelerated RecSys Dataset Profiler by NVIDIA Merlin

### Context

This notebook is based on a [*RecSys Dataset Profiler* template](https://github.com/NVIDIA-Merlin/competitions/blob/main/SIGIR_eCommerce_Challenge_2021/task1_session_based_rec/0-eda/coveo_retail_recsys_dataset_profiler.ipynb) created by [NVIDIA Merlin](https://developer.nvidia.com/nvidia-merlin) team. 
It is useful for a first assessment of datasets for recommender systems, generating some useful statistics and plots from user interactions datasets, that are useful to decide important aspects of recommender systems like:

- May Collaborative Filtering algorithms (e.g., Matrix Factorization (MF), Neural Collaborative Filtering (NCF)) be suitable for this dataset (e.g. based on the user-item matrix sparsity and on items and users long-tail distribution)?
- Should the recommender system care about recommending or not items that users have already interacted/purchased in the past?
- How many new items became available daily and how many new users are observed every day? Based on those answers, you can understand the level of User and Item cold-start problem in the dataset
- How fast do the items loose relevance for users? Should we care about recommending fresher items rather than the old ones?

Feel free to use this notebook for other RecSys datasets :)

### How to use this notebook for a RecSys dataset?
You just need to set some variables from the **Config** section and run the full notebook :)

### Requirements
This notebook uses NVIDIA RAPIDS for GPU-accelerated data analysis, which is pre-installed in Kaggle Kernels (remember to enable the GPU Accelerator).   

The `cudf` is a GPU-accelerated dataframe library equivalent to `pandas`. The `dask_cudf` allow for distributing the data frame operations across multiple GPUs, for lightining fast processing for large datasets. In Kaggle Kernels we only have access to a single GPU, but we keep `dask_cudf` dependency to enable running such EDA in a distributed fashion to take advantage of multiple GPU if available.

If you are running this notebook in another environment, you can install RAPIDS on your own environment using ```conda``` following the examples [here](https://rapids.ai/).

### NVIDIA Merlin
This notebook was created by the NVIDIA Merlin team. 
NVIDIA Merlin is an open-source framework for building large-scale deep learning recommender system. You can find more resources about Merlin here:
- https://developer.nvidia.com/nvidia-merlin
- https://medium.com/nvidia-merlin
- https://github.com/NVIDIA-Merlin

## Imports

In [None]:
import os
import shutil
import pandas as pd
import numpy as np
from collections import OrderedDict
from IPython.display import display
import matplotlib.pyplot as plt

In [None]:
# These dependencies require a GPU
import cupy as cp
import cudf
import dask as dask, dask_cudf

# Config

## Dataset metadata

In [None]:
#If you change the order of the creation of new keys in the "dataset_info" OrderedDict, 
#the order of the final CSV columns with the dataset profile will change accordingly
dataset_info = OrderedDict()
dataset_info['name'] = 'H&B_fashion'
dataset_info['domain'] = 'ecommerce'
dataset_info['description'] = ''
dataset_info['source'] = 'https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations'
dataset_info['event_types'] = 'purchases'

## Data format and path

In [None]:
#Accepted data formats are: csv | tsv | parquet
DATA_FORMAT = 'csv' 
#List of columns names to be used for CSV / TSV files without the header line
HEADLESS_CSV_COLUMN_NAMES = None #Example: ['col1','col2']

In [None]:
DATA_FOLDER = "../input/h-and-m-personalized-fashion-recommendations/"
FILENAME_PATTERN = 'transactions_train.csv'
DATA_PATH = os.path.join(DATA_FOLDER, FILENAME_PATTERN)

!ls $DATA_PATH

## Dataset columns

In [None]:
HAS_TIMESTAMP = True
HAS_USER = True

# Set these column names from your input dataset
COL_ITEM_ID = 'article_id'
COL_USER_ID = 'customer_id'
COL_DATETIME = 't_dat' #Can be None if HAS_TIMESTAMP=False

In [None]:
#Do not change
cols_origin = [COL_ITEM_ID]
if HAS_USER:
    cols_origin.append(COL_USER_ID)
if HAS_TIMESTAMP:
    cols_origin.append(COL_DATETIME)
    
dataset_info['has_timestamp'] = HAS_TIMESTAMP

## Dates config

### Datetime / timestamp conversion

In [None]:
#The rest of the notebook expects the Time column to be in 'datetime64' dtype
#Possible values:
# - None - For a datetime column in a parquet file, keep datetimes as they are (no conversion required)
# - 's' - For timestamp in seconds or general date represented as string like: '2016-04-09' or '2019-10-01 02:15:47 UTC'
# - 'ms' - For timestamp in miliseconds
DATETIME_CONVERTION = 's'

### Dates filtering

In [None]:
#Whether to filter the dataset by date times (inclusive)
MIN_DATETIME = None   #pd.Timestamp(2017, 10, 01)
MAX_DATETIME = None #Including hour

### Outputs

In [None]:
OUTPUT_PATH = '/kaggle/working'
#Wheather to save the dataset info profile to CSV
SAVE_DATASET_INFO_CSV = True
#Whether to save the item and user frequency cumulative distributions (e.g. for later plotting of multiple datasets distributions in the same chart)
SAVE_USERS_ITEMS_CUM_FREQ_DISTR = True

## GPU config

In [None]:
#The list of GPU devices to be used by Dask-cuDF. Should be comma-separated, e.g., "0,1,2,3"
CUDA_VISIBLE_DEVICES = "0"
# Caches the dataset into GPU memory. Disable this if you have a dataset larger than GPU memory and you start getting CUDA OOM erros
CACHE_DATASET_ON_GPU = True

### Listing the available GPUs

In [None]:
!nvidia-smi

In [None]:
os.environ["CUDA_VISIBLE_DEVICES"]=CUDA_VISIBLE_DEVICES

In [None]:
#Uses a RAID folder if it is available (DGX), if not it uses the /tmp folder for DASK workspace
tmp_directory = '/tmp'    
dask_workdir = os.path.join(tmp_directory, 'dask-workdir')    
print('Dask dir:', dask_workdir)

In [None]:
# Make sure we have a clean worker space for Dask
if os.path.isdir(dask_workdir):
    shutil.rmtree(dask_workdir)
os.mkdir(dask_workdir)

# Data loading & preproc

In [None]:
#Do not change the dest column names
cols_dest = ['ItemId']
if HAS_USER:
    cols_dest.append('UserId')
if HAS_TIMESTAMP:
    cols_dest.append('Time')

In [None]:
if DATA_FORMAT == 'parquet':
    ddf = dask_cudf.read_parquet(DATA_PATH)
    
elif DATA_FORMAT in ['csv', 'tsv']:
    ddf = dask_cudf.read_csv(DATA_PATH,                              
                             sep='\t' if DATA_FORMAT == 'tsv' else ',',
                             names=HEADLESS_CSV_COLUMN_NAMES
                            )
else:
    ValueError('Acceptable data formats are: parquet | csv | tsv')

In [None]:
print(ddf.dtypes)

In [None]:
dataset_info['columns'] = ','.join(list(ddf.columns))
dataset_info['columns']

In [None]:
#Keep only the required columns and renaming to fixed names
ddf = ddf[cols_origin]
ddf.columns = cols_dest

## Datetime processing

In [None]:
if HAS_TIMESTAMP:
    display(ddf['Time'].head())

In [None]:
if HAS_TIMESTAMP:
    #Converts date time if configured to do so
    if DATETIME_CONVERTION is not None:
        ddf['Time'] = ddf['Time'].astype(f'datetime64[{DATETIME_CONVERTION}]')

In [None]:
if HAS_TIMESTAMP:
    display(ddf['Time'].head())

In [None]:
if HAS_TIMESTAMP:
    #Creates a string representation of the dates
    ddf['DateStr'] = ddf['Time'].dt.strftime("%Y-%m-%d")

In [None]:
if HAS_TIMESTAMP:
    #Filtering the dataset based on minimum and maximum datetimes (inclusive)
    if MIN_DATETIME is not None:
        ddf = ddf[ddf['Time'] >= MIN_DATETIME]

    if MAX_DATETIME is not None:
        ddf = ddf[ddf['Time'] <= MAX_DATETIME]    

## Dataset caching

In [None]:
#Caches the dataset into GPU memory (if it fits). This is a lazy op, so caching will happen in the next compute() op
if CACHE_DATASET_ON_GPU:
    ddf, = dask.persist(ddf)

## Extracting columns from date

In [None]:
%%time
if HAS_TIMESTAMP:
    min_date = ddf['Time'].min().compute().date()
    dataset_info['first_date'] = min_date.strftime('%Y-%m-%d')
    print(min_date)
else:
    dataset_info['first_date'] = None

In [None]:
%%time
if HAS_TIMESTAMP:
    max_date = ddf['Time'].max().compute().date()
    dataset_info['last_date'] = max_date.strftime('%Y-%m-%d')
    print(max_date)
else:
    dataset_info['last_date'] = None

In [None]:
if HAS_TIMESTAMP:
    dataset_info['num_days'] = (max_date - min_date).days + 1
    print(dataset_info['num_days'])
else:
    dataset_info['num_days'] = None

In [None]:
%%time
if HAS_TIMESTAMP:
    #Creating an auxiliary table with Pandas to extract weekofyear, because it is not available yet on cudf (at 0.17 version)
    dates_df = pd.DataFrame(pd.date_range(start=min_date, end=max_date), columns=['date'])
    dates_df['weekofyear'] = dates_df['date'].dt.weekofyear
    dates_df['month'] = dates_df['date'].dt.month
    dates_df['year'] = dates_df['date'].dt.year
    dates_df['year-week'] = dates_df['year'].astype('str') + "-w" + \
                            dates_df['weekofyear'].apply(lambda x: "{:02d}".format(x))
    dates_df['year-month'] = dates_df['year'].astype('str') +"-" +  \
                             dates_df['month'].apply(lambda x: "{:02d}".format(x))
    dates_df['DateStr'] = dates_df['date'].dt.strftime("%Y-%m-%d")
    del dates_df['weekofyear'], dates_df['month'], dates_df['year'], dates_df['date']
    dates_df.set_index('DateStr', inplace=True)
    dates_df = cudf.from_pandas(dates_df)
    display(dates_df.head())

In [None]:
# Includes year-week and year-month into the main dataframe
if HAS_TIMESTAMP:
    ddf = ddf.merge(dates_df,  left_on='DateStr', right_index=True)
    display(ddf.head(10))

# Basic stats

## Counts

#### Number of interactions

In [None]:
%%time
nrows = len(ddf)
dataset_info['num_interactions'] = nrows
print(nrows)

#### Number of items

In [None]:
%%time
n_items = ddf['ItemId'].nunique().compute()
dataset_info['num_items'] = n_items
n_items

#### Number of users

In [None]:
%%time
if HAS_USER:
    n_users = ddf['UserId'].nunique().compute()
else:
    n_users = 0
dataset_info['num_users'] = n_users
n_users

#### User-Item Matrix Sparcity

In [None]:
%%time
if HAS_USER:
    n_unique_user_item_pairs = len(ddf[['UserId', 'ItemId']].drop_duplicates())
else:
    n_unique_user_item_pairs = 0
n_unique_user_item_pairs

In [None]:
if HAS_USER:
    sparsity = 1.0 - (n_unique_user_item_pairs / (n_items * n_users))
else:
    sparsity = None
dataset_info['sparsity_user_item_matrix'] = sparsity
sparsity

## Distributions

In [None]:
def gini_index_cupy(array):
    """Calculate the Gini coefficient of a numpy array."""
    # based on bottom eq:
    # http://www.statsdirect.com/help/generatedimages/equations/equation154.svg
    # from:
    # http://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm
    # All values are treated equally, arrays must be 1d:
    array = array.flatten()
    if cp.amin(array) < 0:
        # Values cannot be negative:
        array -= cp.amin(array)
    # Values cannot be 0:
    array += 0.0000001
    # Values must be sorted:
    array = cp.sort(array)
    # Index per array element:
    index = cp.arange(1,array.shape[0]+1)
    # Number of array elements:
    n = array.shape[0]
    # Gini coefficient:
    return float(((cp.sum((2 * index - n  - 1) * array)) / (n * cp.sum(array))))

In [None]:
def set_distr_percentiles(data, dataset_info, prefix, dtype='int'):
    if data is None:
        #If the dataframe with stats is not available, fill columns with None
        dataset_info[f"min_{prefix}"] = None
        dataset_info[f"p25_{prefix}"] = None
        dataset_info[f"p50_{prefix}"] = None
        dataset_info[f"p75_{prefix}"] = None
        dataset_info[f"p90_{prefix}"] = None
        dataset_info[f"p95_{prefix}"] = None
        dataset_info[f"p99_{prefix}"] = None
        dataset_info[f"max_{prefix}"] = None
    else: 
        dataset_info[f"min_{prefix}"] = data['min'].astype(dtype)
        dataset_info[f"p25_{prefix}"] = data['25%'].astype(dtype)
        dataset_info[f"p50_{prefix}"] = data['50%'].astype(dtype)
        dataset_info[f"p75_{prefix}"] = data['75%'].astype(dtype)
        dataset_info[f"p90_{prefix}"] = data['90%'].astype(dtype)
        dataset_info[f"p95_{prefix}"] = data['95%'].astype(dtype)
        dataset_info[f"p99_{prefix}"] = data['99%'].astype(dtype)
        dataset_info[f"max_{prefix}"] = data['max'].astype(dtype)

In [None]:
#The percentiles that will be extracted for all distributions
PERCENTILES=np.concatenate([np.arange(0.0, 1.1, 0.1), np.array([0.25, 0.75, 0.95, 0.99])])
PERCENTILES

### # Interactions per Item distribution

In [None]:
%%time
items_freq_df = ddf.groupby('ItemId').size().to_frame('freq').compute().sort_values('freq', ascending=False)
items_freq_df.head(10)

In [None]:
items_freq_gini_index = gini_index_cupy(items_freq_df['freq'].values.astype('float'))
dataset_info['items_freq_gini_index'] = items_freq_gini_index

In [None]:
%%time
items_freq_cum_perc = (items_freq_df['freq'].cumsum() / items_freq_df['freq'].sum()).to_frame('cum_interactions_by_item_freq')
items_freq_cum_perc['dummy'] = 1
items_freq_cum_perc['cum_perc_items'] = items_freq_cum_perc['dummy'].cumsum() / items_freq_cum_perc['dummy'].sum()
del items_freq_cum_perc['dummy']
items_freq_cum_perc

In [None]:
items_freq_cum_perc_pdf = items_freq_cum_perc.set_index('cum_perc_items').to_pandas()

In [None]:
ax = items_freq_cum_perc_pdf.plot.line(figsize=(15,8))

ax.set_title('Cumulative distribution of Items Frequency')
ax.set_ylabel('% of interactions')
ax.set_xlabel('% of items')

In [None]:
#Reindexing the distribution by 1% increments
items_freq_cum_perc_reindexed_pdf = items_freq_cum_perc_pdf.reindex(np.arange(0.0, 1.01, 0.01), method='pad').fillna(0.)
items_freq_cum_perc_reindexed_pdf.head()

In [None]:
#Extracting some percentiles to save
items_freq_cum_perc_selected_pdf = items_freq_cum_perc_reindexed_pdf.loc[[0.01, 0.05, 0.10, 0.25, 0.50]]
dataset_info['top-01%_item_cum_freq'] = items_freq_cum_perc_selected_pdf.loc[0.01][0]
dataset_info['top-05%_item_cum_freq'] = items_freq_cum_perc_selected_pdf.loc[0.05][0]
dataset_info['top-10%_item_cum_freq'] = items_freq_cum_perc_selected_pdf.loc[0.10][0]
dataset_info['top-25%_item_cum_freq'] = items_freq_cum_perc_selected_pdf.loc[0.25][0]
dataset_info['top-50%_item_cum_freq'] = items_freq_cum_perc_selected_pdf.loc[0.50][0]
items_freq_cum_perc_selected_pdf

In [None]:
items_freq_percentiles_df = items_freq_df['freq'].describe(percentiles=PERCENTILES)
set_distr_percentiles(items_freq_percentiles_df, dataset_info, prefix='item_freq')
items_freq_percentiles_df

In [None]:
ax = items_freq_df.groupby('freq').size().to_pandas().hist(bins=100, figsize=(15,8))
ax.set_title('Items Frequency histogram')

### # Interactions per User distribution

In [None]:
%%time
if HAS_USER:
    user_freq_series = ddf.groupby('UserId').size().compute()
    display(user_freq_series.head(10))

In [None]:
if HAS_USER:
    users_freq_gini_index = gini_index_cupy(user_freq_series.values.astype('float'))
else:
    users_freq_gini_index = None
dataset_info['users_freq_gini_index'] = users_freq_gini_index

In [None]:
%%time
if HAS_USER:
    users_freq_df = ddf.groupby('UserId').size().to_frame('freq').compute().sort_values('freq', ascending=False)
    display(users_freq_df.head(10))

In [None]:
%%time
if HAS_USER:
    users_freq_cum_perc = (users_freq_df['freq'].cumsum() / users_freq_df['freq'].sum()).to_frame('cum_interactions_by_user_freq')
    users_freq_cum_perc['dummy'] = 1
    users_freq_cum_perc['cum_perc_users'] = users_freq_cum_perc['dummy'].cumsum() / users_freq_cum_perc['dummy'].sum()
    del users_freq_cum_perc['dummy']
    display(users_freq_cum_perc)

In [None]:
if HAS_USER:
    users_freq_cum_perc_pdf = users_freq_cum_perc.set_index('cum_perc_users').to_pandas()

In [None]:
if HAS_USER:
    ax = users_freq_cum_perc_pdf.plot.line(figsize=(15,8))

    ax.set_title('Cumulative distribution of Users Frequency')
    ax.set_ylabel('% of interactions')
    ax.set_xlabel('% of users')

In [None]:
if HAS_USER:
    users_freq_cum_perc_reindexed_pdf = users_freq_cum_perc_pdf.reindex(np.arange(0.0, 1.01, 0.01), method='pad').fillna(0.)
    display(users_freq_cum_perc_reindexed_pdf.head())

In [None]:
if HAS_USER:
    users_freq_cum_perc_selected_pdf = users_freq_cum_perc_reindexed_pdf.reindex(np.arange(0.0, 1.01, 0.01), method='pad') \
                            .loc[[0.01, 0.05, 0.10, 0.25, 0.50]]
    dataset_info['top-01%_user_cum_freq'] = users_freq_cum_perc_selected_pdf.loc[0.01][0]
    dataset_info['top-05%_user_cum_freq'] = users_freq_cum_perc_selected_pdf.loc[0.05][0]
    dataset_info['top-10%_user_cum_freq'] = users_freq_cum_perc_selected_pdf.loc[0.10][0]
    dataset_info['top-25%_user_cum_freq'] = users_freq_cum_perc_selected_pdf.loc[0.25][0]
    dataset_info['top-50%_user_cum_freq'] = users_freq_cum_perc_selected_pdf.loc[0.50][0]
    users_freq_cum_perc_selected_pdf
else:
    dataset_info['top-01%_user_cum_freq'] = None
    dataset_info['top-05%_user_cum_freq'] = None
    dataset_info['top-10%_user_cum_freq'] = None
    dataset_info['top-25%_user_cum_freq'] = None
    dataset_info['top-50%_user_cum_freq'] = None

In [None]:
if HAS_USER:
    users_freq_percentiles_df = users_freq_df['freq'].describe(percentiles=PERCENTILES)
else:    
    users_freq_percentiles_df = None
    
set_distr_percentiles(users_freq_percentiles_df, dataset_info, prefix='user_freq')
display(users_freq_percentiles_df)

In [None]:
if HAS_USER:
    ax = users_freq_df.groupby('freq').size().to_pandas().hist(bins=100, figsize=(15,8))
    ax.set_title('Users Frequency histogram')

### # User-Item repeated interaction distribution

In [None]:
%%time
if HAS_USER:
    items_per_user_df = ddf.groupby('UserId')['ItemId'].size().to_frame('count')
    display(items_per_user_df)

In [None]:
%%time
if HAS_USER:
    unique_items_per_user_df = ddf[['UserId', 'ItemId']].drop_duplicates().groupby('UserId').size().to_frame('nunique')
    display(unique_items_per_user_df)

In [None]:
%%time
if HAS_USER:
    items_per_user_joined_df = items_per_user_df.merge(unique_items_per_user_df)
    items_per_user_joined_df['user_perc_repeated_interactions'] = (items_per_user_joined_df['count'] - items_per_user_joined_df['nunique']) / items_per_user_joined_df['count']
    display(items_per_user_joined_df)

In [None]:
%%time
if HAS_USER:
    repeat_user_item_interactions_percentiles_df = \
            items_per_user_joined_df['user_perc_repeated_interactions'].compute().describe(percentiles=PERCENTILES) #.compute()
else:
    repeat_user_item_interactions_percentiles_df = None
set_distr_percentiles(repeat_user_item_interactions_percentiles_df, dataset_info, prefix='user_perc_repeated_interactions', dtype='float')
display(repeat_user_item_interactions_percentiles_df)

## Temporal aspects

### # Interactions per day, week, month

In [None]:
%%time
if HAS_TIMESTAMP:
    ddf.groupby('DateStr').size().compute().to_pandas().sort_index().plot.line(figsize=(15,8))

In [None]:
%%time
if HAS_TIMESTAMP:
    ddf.groupby('year-week').size().compute().to_pandas().sort_index().plot.line(figsize=(15,8))

In [None]:
%%time
if HAS_TIMESTAMP:
    ddf.groupby('year-month').size().compute().to_pandas().sort_index().plot.line(figsize=(15,8))

### Items lifetime: Items interactions decay over time

In [None]:
%%time
if HAS_TIMESTAMP:
    # Getting the first time each item was seen in the dataset
    first_date_item_df = ddf.groupby('ItemId')[['Time', 'DateStr', 'year-week', 'year-month']].min()
    first_date_item_df = first_date_item_df.compute().rename(
                              {'Time': 'first_Time',
                               'DateStr': 'first_DateStr',
                               'year-week': 'first_year-week',
                               'year-month': 'first_year-month'}, axis=1)

In [None]:
%%time
if HAS_TIMESTAMP:
    #Computing for each interaction how many days since the item was first seen
    item_interactions_time_df = ddf[['ItemId', 'Time']].merge(first_date_item_df, left_on='ItemId', right_index=True)#.compute()
    item_interactions_time_df['item_elapsed_days_since_first_seen'] = (item_interactions_time_df['Time'] - item_interactions_time_df['first_Time']).dt.days
    display(item_interactions_time_df.head())

In [None]:
%%time
if HAS_TIMESTAMP:
    item_elapsed_days_since_available_describe_df = item_interactions_time_df['item_elapsed_days_since_first_seen'] \
                                    .compute().describe(percentiles=PERCENTILES)  #.compute()
    set_distr_percentiles(item_elapsed_days_since_available_describe_df, dataset_info, prefix='item_interactions_by_age_days')
    display(item_elapsed_days_since_available_describe_df)

In [None]:
if HAS_TIMESTAMP:
    item_interactions_time_df.groupby('item_elapsed_days_since_first_seen').size().compute() \
                .to_pandas().sort_index().plot.bar(figsize=(100,15))

### Item Cold-start

#### How many new items every week / month

In [None]:
%%time
if HAS_TIMESTAMP:
    new_items_by_day = first_date_item_df.groupby('first_DateStr').size().to_pandas()
    display(new_items_by_day)

In [None]:
%%time
if HAS_TIMESTAMP:
    #Ignoring first day
    dataset_info['new_items_by_day_p50'] = new_items_by_day[1:].median()
    print(dataset_info['new_items_by_day_p50'])

In [None]:
if HAS_TIMESTAMP:
    display(new_items_by_day[1:].sort_index().plot.bar(figsize=(100,10)))

In [None]:
%%time
if HAS_TIMESTAMP:
    new_items_by_week = first_date_item_df.groupby('first_year-week').size().to_pandas()
    display(new_items_by_week)

In [None]:
if HAS_TIMESTAMP:
    dataset_info['new_items_by_week_p50'] = new_items_by_week[1:].median()
    print(dataset_info['new_items_by_week_p50'])

In [None]:
if HAS_TIMESTAMP:
    if len(new_items_by_week[1:]) > 0:
        new_items_by_week[1:].sort_index().plot.bar(figsize=(20,8))

In [None]:
%%time
if HAS_TIMESTAMP:
    new_items_by_month = first_date_item_df.groupby('first_year-month').size().to_pandas()
    display(new_items_by_month)

In [None]:
%%time
if HAS_TIMESTAMP:
    #Ignoring first month
    dataset_info['new_items_by_month_p50'] = new_items_by_month[1:].median()
    print(dataset_info['new_items_by_month_p50'])

In [None]:
%%time
if HAS_TIMESTAMP:
    if len(new_items_by_month[1:]) > 0:
        new_items_by_month[1:].sort_index().plot.bar(figsize=(15,8))

#### % of interactions on new items first seen in the same day, week, month

In [None]:
%%time
if HAS_TIMESTAMP:
    item_cold_start_df = ddf.merge(first_date_item_df,  left_on='ItemId', right_index=True)[['DateStr', 'first_DateStr',
                                                                        'year-week', 'first_year-week',
                                                                        'year-month', 'first_year-month']] #.compute()
    min_year_week = item_cold_start_df['year-week'].compute().min()
    #Ignoring first week (where most of the items will occur first)
    item_cold_start_df = item_cold_start_df[item_cold_start_df['year-week'] != min_year_week]
    #Checking if the item was created in the same day, week or month of the interaction
    item_cold_start_df['item_created_same_day'] = (item_cold_start_df['DateStr'] == item_cold_start_df['first_DateStr'])
    item_cold_start_df['item_created_same_week'] = (item_cold_start_df['year-week'] == item_cold_start_df['first_year-week'])
    item_cold_start_df['item_created_same_month'] = (item_cold_start_df['year-month'] == item_cold_start_df['first_year-month'])

In [None]:
%%time
if HAS_TIMESTAMP:
    perc_interactions_on_items_created_same_day = item_cold_start_df['item_created_same_day'].mean().compute()
    dataset_info['perc_interact_items_created_same_day'] = perc_interactions_on_items_created_same_day
    print(perc_interactions_on_items_created_same_day)

In [None]:
%%time
if HAS_TIMESTAMP:
    perc_interactions_on_items_created_same_week = item_cold_start_df['item_created_same_week'].mean().compute()
    dataset_info['perc_interact_items_created_same_week'] = perc_interactions_on_items_created_same_week
    print(perc_interactions_on_items_created_same_week)

In [None]:
%%time
if HAS_TIMESTAMP:
    perc_interactions_on_items_created_same_month = item_cold_start_df['item_created_same_month'].mean().compute()
    dataset_info['perc_interact_items_created_same_month'] = perc_interactions_on_items_created_same_month
    print(perc_interactions_on_items_created_same_month)

### Users lifetime: User interactions decay over time

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    first_date_user_df = ddf.groupby('UserId')[['Time', 'DateStr', 'year-week', 'year-month']].min()
    first_date_user_df = first_date_user_df.compute().rename(
                              {'Time': 'first_Time',
                               'DateStr': 'first_DateStr',
                               'year-week': 'first_year-week',
                               'year-month': 'first_year-month'}, axis=1)
    display(first_date_user_df)

In [None]:
%%time
if HAS_USER and  HAS_TIMESTAMP:
    user_interactions_time_df = ddf[['UserId', 'Time']].merge(first_date_user_df, left_on='UserId', right_index=True) #.compute()
    user_interactions_time_df['user_elapsed_days_since_first_seen'] = (user_interactions_time_df['Time'] - user_interactions_time_df['first_Time']).dt.days

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    user_elapsed_days_since_first_seen_describe_df = \
            user_interactions_time_df['user_elapsed_days_since_first_seen'].compute().describe(percentiles=PERCENTILES) #.compute()
else:
    user_elapsed_days_since_first_seen_describe_df = None
    
set_distr_percentiles(user_elapsed_days_since_first_seen_describe_df, dataset_info, prefix='user_interactions_by_age_days')
display(user_elapsed_days_since_first_seen_describe_df)

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    user_interactions_time_df.groupby('user_elapsed_days_since_first_seen').size().compute().to_pandas().sort_index().plot.bar(figsize=(100,8))#

### User Cold-start

#### How many new users every week / month

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    new_users_by_day = first_date_user_df.groupby('first_DateStr').size()
    display(new_users_by_day)

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    #Ignoring first day
    dataset_info['new_users_by_day_p50'] = new_users_by_day[1:].median()
    print(dataset_info['new_users_by_day_p50'])
else:
    dataset_info['new_users_by_day_p50'] = None

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    new_users_by_week = first_date_user_df.groupby('first_year-week').size().to_pandas()
    display(new_users_by_week)

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    #Ignoring first week
    dataset_info['new_users_by_week_p50'] = new_users_by_week[1:].median()
    print(dataset_info['new_users_by_week_p50'])
else:
    dataset_info['new_users_by_week_p50'] = None

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    if len(new_users_by_week[1:]) > 0:
        new_users_by_week[1:].sort_index().plot.bar(figsize=(20,8))

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    new_users_by_month = first_date_user_df.groupby('first_year-month').size().to_pandas()
    display(new_users_by_month)

In [None]:
if HAS_USER and HAS_TIMESTAMP:
    #Ignoring first month
    dataset_info['new_users_by_month_p50'] = new_users_by_month[1:].median()
    print(dataset_info['new_users_by_month_p50'])
else:
    dataset_info['new_users_by_month_p50'] = None

In [None]:
if HAS_USER and HAS_TIMESTAMP:
    if len(new_users_by_month[1:]) > 0:
        new_users_by_month[1:].sort_index().plot.bar(figsize=(15,8))

#### % of interactions on new items first seen in the same day, week, month

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    user_cold_start_df = ddf.merge(first_date_user_df,  left_on='UserId', right_index=True)[['DateStr', 'first_DateStr',
                                                                        'year-week', 'first_year-week',
                                                                        'year-month', 'first_year-month']] #.compute()
    min_year_week = user_cold_start_df['year-week'].compute().min()
    #Ignoring first week (where most of the items will occur first)
    user_cold_start_df = user_cold_start_df[user_cold_start_df['year-week'] != min_year_week]
    #Checking if the item was created in the same day, week or month of the interaction
    user_cold_start_df['user_created_same_day'] = (user_cold_start_df['DateStr'] == user_cold_start_df['first_DateStr'])
    user_cold_start_df['user_created_same_week'] = (user_cold_start_df['year-week'] == user_cold_start_df['first_year-week'])
    user_cold_start_df['user_created_same_month'] = (user_cold_start_df['year-month'] == user_cold_start_df['first_year-month'])

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    perc_interactions_by_users_first_seen_same_day = user_cold_start_df['user_created_same_day'].mean().compute()
    dataset_info['perc_interact_users_created_same_day'] = perc_interactions_by_users_first_seen_same_day
    print(perc_interactions_by_users_first_seen_same_day)
else:
    dataset_info['perc_interact_users_created_same_day'] = None

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    perc_interactions_by_users_first_seen_same_week = user_cold_start_df['user_created_same_week'].mean().compute()
    dataset_info['perc_interact_users_created_same_week'] = perc_interactions_by_users_first_seen_same_week
    print(perc_interactions_by_users_first_seen_same_week)
else:
    dataset_info['perc_interact_users_created_same_week'] = None

In [None]:
%%time
if HAS_USER and HAS_TIMESTAMP:
    perc_interactions_by_users_first_seen_same_month = user_cold_start_df['user_created_same_month'].mean().compute()
    dataset_info['perc_interact_users_created_same_month'] = perc_interactions_by_users_first_seen_same_month
    print(perc_interactions_by_users_first_seen_same_month)
else:
    dataset_info['perc_interact_users_created_same_month'] = None

# Exporting

In this last section we generate and export two CSV files under `\kaggle\working` folder:  
- `H&B_fashion_dataset_profile.csv` - Main statistics collected for this dataset
- `H&B_fashion_dataset_user_item_cum_freq_distr.csv"` - The cumulative distributions of interactions per users and per items

## Exporting dataset info profile

In [None]:
dataset_info_df = pd.DataFrame.from_dict(dataset_info, orient='index').T
dataset_info_df

In [None]:
print(f'Creating output path if it does not exists: {OUTPUT_PATH}')
os.makedirs(OUTPUT_PATH, exist_ok=True)

In [None]:
if SAVE_DATASET_INFO_CSV:
    output_path = os.path.join(OUTPUT_PATH, f"{dataset_info['name']}_dataset_profile.csv")
    print(f'Saving the dataset info to: {output_path}')
    dataset_info_df.to_csv(output_path, index=False)

## Exporting items and users freq. cumulative distributions

In [None]:
if HAS_USER:
    user_item_cum_freq_distr_df = items_freq_cum_perc_reindexed_pdf.merge(users_freq_cum_perc_reindexed_pdf, 
                                                                          left_index=True, right_index=True)
else:
    user_item_cum_freq_distr_df = items_freq_cum_perc_reindexed_pdf
    user_item_cum_freq_distr_df['cum_interactions_by_user_freq'] = 0.0
user_item_cum_freq_distr_df

In [None]:
if SAVE_USERS_ITEMS_CUM_FREQ_DISTR:
    output_path = os.path.join(OUTPUT_PATH, f"{dataset_info['name']}_dataset_user_item_cum_freq_distr.csv")
    print(f'Saving the users and items cumulative freq. distribution to: {output_path}')
    user_item_cum_freq_distr_df.to_csv(output_path)