# Feature generators tutorial

This notebook presents the RePlay functionality for features preprocessing and generation of new users and item features based on existing features and interactions history. RePlay offers classes:

* CatFeaturesTransformer
* LogStatFeaturesProcessor


### Fit 

To train a feature generator use the method `.fit()`

### Transform the data

Method `.transform()` allows you to transform the data based on the train dataset statistics.

In [3]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from pyspark.sql import functions as sf

from replay.data_preparator import DataPreparator, Indexer
from replay.utils import convert2spark

## Get started

Download the dataset **MovieLens** and preprocess it with `DataPreparator` and `Indexer`

In [5]:
ratings = pd.read_csv("./data/ml1m_ratings.dat", sep="\t", names=["userId", "itemId","relevance","timestamp"])

For each user, we will add the categorical variable `month`

In [6]:
new_val = pd.to_datetime(ratings["timestamp"], unit='s').map(lambda x: x.month)
ratings.loc[:,"month"] = new_val

In [9]:
dp = DataPreparator()
log = dp.transform(data=ratings,
                  columns_mapping={
                      "user_id": "userId",
                      "item_id":  "itemId",
                      "relevance": "relevance",
                      "timestamp": "timestamp"
                  })

log.show(2)

06-Oct-22 17:52:34, replay, INFO: Columns with ids of users or items are present in mapping. The dataframe will be treated as an interactions log.


+-------+-------+---------+-------------------+-----+
|user_id|item_id|relevance|          timestamp|month|
+-------+-------+---------+-------------------+-----+
|      1|   1193|      5.0|2001-01-01 01:12:40|   12|
|      1|    661|      3.0|2001-01-01 01:35:09|   12|
+-------+-------+---------+-------------------+-----+
only showing top 2 rows



In [10]:
indexer = Indexer(user_col='user_id', item_col='item_id')
indexer.fit(users=log.select('user_id'),
            items=log.select('item_id'))
log = indexer.transform(df=log)
log.show(2)

+--------+--------+---------+-------------------+-----+
|user_idx|item_idx|relevance|          timestamp|month|
+--------+--------+---------+-------------------+-----+
|    4131|      43|      5.0|2001-01-01 01:12:40|   12|
|    4131|     585|      3.0|2001-01-01 01:35:09|   12|
+--------+--------+---------+-------------------+-----+
only showing top 2 rows



We will leave only the first 20 users and will not take the 12th month

In [11]:
log_20_users = log.where("user_idx < 20 and month != 12")

## class CatFeaturesTransformer()

Transform categorical features in ``cat_cols_list`` with one-hot encoding and remove original columns.
    
Parameters:
* `cat_cols_list` - List of categorical columns
* `alias` - Prefix of the generated column names (default is "ohe")

In [12]:
from replay.data_preparator import CatFeaturesTransformer
cft = CatFeaturesTransformer(["month"])
cft.fit(log_20_users)

In [13]:
log_trsfrm = cft.transform(log_20_users)

#### Before

In [14]:
log_20_users.show(1, vertical=True)

-RECORD 0------------------------
 user_idx  | 16                  
 item_idx  | 366                 
 relevance | 4.0                 
 timestamp | 2001-01-10 21:07:43 
 month     | 1                   
only showing top 1 row



#### After

In [15]:
log_trsfrm.show(1, vertical=True)

-RECORD 0---------------------------
 user_idx     | 16                  
 item_idx     | 366                 
 relevance    | 4.0                 
 timestamp    | 2001-01-10 21:07:43 
 ohe_month_9  | 0                   
 ohe_month_1  | 1                   
 ohe_month_5  | 0                   
 ohe_month_2  | 0                   
 ohe_month_6  | 0                   
 ohe_month_3  | 0                   
 ohe_month_10 | 0                   
 ohe_month_7  | 0                   
 ohe_month_4  | 0                   
 ohe_month_11 | 0                   
 ohe_month_8  | 0                   
only showing top 1 row



### Processing of cold users and items
If the dataframe contains new values, not presented in train, those values are ignored (encoded columns will be all zeros)

To show this, add a user to the DataFrame with the value 12 in the "month" attribute. The value was absent in the training data.

In [16]:
user_with_12_month_attriubute = log.where("month == 12").limit(1)

user_idx, item_idx = user_with_12_month_attriubute.select("user_idx", "item_idx").first()

log_21_users  = log_20_users.union(user_with_12_month_attriubute)

In [17]:
log_trsfrm_21_users = cft.transform(log_21_users)

As we can see, for a user with a month value of 12, all attributes are **0**

In [18]:
log_trsfrm_21_users.where(f"user_idx == {user_idx} and item_idx == {item_idx}").show(vertical=True)

-RECORD 0---------------------------
 user_idx     | 4131                
 item_idx     | 43                  
 relevance    | 5.0                 
 timestamp    | 2001-01-01 01:12:40 
 ohe_month_9  | 0                   
 ohe_month_1  | 0                   
 ohe_month_5  | 0                   
 ohe_month_2  | 0                   
 ohe_month_6  | 0                   
 ohe_month_3  | 0                   
 ohe_month_10 | 0                   
 ohe_month_7  | 0                   
 ohe_month_4  | 0                   
 ohe_month_11 | 0                   
 ohe_month_8  | 0                   



## class LogStatFeaturesProcessor()

Calculate user and item features based on historical interactions.

**Features can start with:**

1. `u_` - for user

2. `i_` - for item

3. `u_i` - user regarding the items

4.  `i_u` - item regarding the users

**Features:**

* `log_num_interact` - logarithm of the number of interactions
* `log_interact_days_count` - logarithm of the number of unique dates with user-item interactions 
* `min_interact_date` - min interaction timestamp
* `max_interact_date` - max interaction timestamp
* `std` - standard deviation of relevance values for a user/item
* `mean` - mean relevance values for a user/item
* `quantile_05` - 0.05 percentile of relevance relevance for a user/item
* `quantile_5` - 0.5 percentile of relevance relevance for a user/item
* `quantile_95` - 0.95 percentile of relevance relevance for a user/item
* `history_length_days` - history length
* `last_interaction_gap_days` - number of days since last interaction
* `mean_log_num_interact` - average logarithm of the number of user/item interactions
* `log_num_interact_diff` - difference in the number of interactions
* `na_u_log_features` - flag, indicating cold user, absent in training log
* `na_i_log_features` - flag, indicating cold item, absent in training log


<br>
<br>
* `abnormality`:
  $$Abnormality(u) = \frac{\Sigma_{r \in R_u} | n_{u,r} - \overline{n_{r}} | }{\| R_u \|}$$ <br>
  $n_{u,r}$ represents the rating that user $u$ assigned to resource $r$<br>$\overline{n_{r}}$ is the average rating of $r$<br>$R_u$ is the set of resources rated by $u$ and $\|Ru\|$ is their number.<br><br>

* `abnormalityCR` 
  $$Abnormality(u) = \frac{\Sigma_{r \in R_u} (( n_{u,r} - \overline{n_{r}} ) * contr(r))^2 }{\| R_u \|}$$ <br>
  
  $$contr(r) = 1 - \frac{\sigma_r - \sigma_{min} }{\sigma_{max} - \sigma_{min}}$$<br>
  $\sigma_r$ is the standard deviation of the ratings associated with the resource $r$<br>$\sigma_{min}$ and $\sigma_{max}$ are respectively the smallest and the largest possible stanard deviation values, among resources.
  
[More about abnormality, abnormalityCR](https://hal.inria.fr/hal-01254172/document)

In [19]:
from replay.history_based_fp import LogStatFeaturesProcessor
lf = LogStatFeaturesProcessor()
lf.fit(log_20_users)

In [20]:
log_trsfrm = lf.transform(log_20_users)

#### Before

In [21]:
log_20_users.show(1, vertical=True)

-RECORD 0------------------------
 user_idx  | 16                  
 item_idx  | 366                 
 relevance | 4.0                 
 timestamp | 2001-01-10 21:07:43 
 month     | 1                   
only showing top 1 row



#### After

In [23]:
log_trsfrm.show(1, vertical=True)

-RECORD 0-------------------------------------------
 item_idx                    | 366                  
 user_idx                    | 16                   
 relevance                   | 4.0                  
 timestamp                   | 2001-01-10 21:07:43  
 month                       | 1                    
 u_log_num_interact          | 6.736966958001855    
 u_log_interact_days_count   | 4.795790545596741    
 u_min_interact_date         | 2001-01-10 20:59:24  
 u_max_interact_date         | 2003-02-27 15:31:39  
 u_std                       | 0.9460530559956203   
 u_mean                      | 3.561091340450771    
 u_quantile_05               | 2.0                  
 u_quantile_5                | 4.0                  
 u_quantile_95               | 5.0                  
 u_history_length_days       | 778                  
 u_last_interaction_gap_days | 0                    
 abnormality                 | 0.5423858158630311   
 abnormalityCR               | 0.1907112888748

### Processing of cold users and items

There are 3 possible scenarios:

1. Cold user - a user which was not presented in the training log.
    All items' statistics will be present, but the user statistics will be `0`.<br>Flag `na_u_log_features` will be `1`.
<br>
<br>
2. Cold item - an item which was not presented in the training log.
    All the user statistics will be present, but the item statistics will be `0`.<br>Flag `na_i_log_features` will be `1`.
<br>
<br>
3. A pair of cold user cold item - user and item which were not presented in the training log.
    All statistics will be `0`.<br>Flags `na_u_log_features`, `na_i_log_features` will be `1`.

#### Add cold user

In [27]:
user_cold = (
    log.join(
        log_20_users.select(sf.col("user_idx").alias("u_idx"), sf.col("item_idx").alias("i_idx")),
        on=sf.col("u_idx") == sf.col("user_idx"),
        how="left"
    )
    .filter(sf.col("i_idx").isNull())
    .select(log.columns)
    .limit(1)
)

user_idx, item_idx = user_cold.select("user_idx", "item_idx").first()

log_21_users  = log_20_users.union(user_cold)

In [28]:
log_trsfrm = lf.transform(log_21_users)

In [29]:
log_trsfrm.where(f"user_idx == {user_idx}").show(1, vertical=True)



-RECORD 0------------------------------------------
 item_idx                    | 2001                
 user_idx                    | 38                  
 relevance                   | 3.0                 
 timestamp                   | 2000-07-12 08:05:23 
 month                       | 7                   
 u_log_num_interact          | 0.0                 
 u_log_interact_days_count   | 0.0                 
 u_min_interact_date         | 1970-01-01 03:00:00 
 u_max_interact_date         | 1970-01-01 03:00:00 
 u_std                       | 0.0                 
 u_mean                      | 0.0                 
 u_quantile_05               | 0.0                 
 u_quantile_5                | 0.0                 
 u_quantile_95               | 0.0                 
 u_history_length_days       | 0                   
 u_last_interaction_gap_days | 0                   
 abnormality                 | 0.0                 
 abnormalityCR               | 0.0                 
 u_mean_i_lo



#### Add cold item

In [30]:
item_cold = (
    log.join(
        log_20_users.select(sf.col("user_idx").alias("u_idx"), sf.col("item_idx").alias("i_idx")),
        on=sf.col("i_idx") == sf.col("item_idx"),
        how="left"
    )
    .filter(sf.col("i_idx").isNull())
    .select(log.columns)
    .filter("user_idx < 20")
    .limit(1)
)

user_idx, item_idx = item_cold.select("user_idx", "item_idx").first()

log_21_users  = log_20_users.union(item_cold)

In [31]:
log_trsfrm = lf.transform(log_21_users)

In [32]:
log_trsfrm.where(f"item_idx == {item_idx}").show(1, vertical=True)

-RECORD 0------------------------------------------
 item_idx                    | 3078                
 user_idx                    | 4                   
 relevance                   | 2.0                 
 timestamp                   | 2000-12-07 03:23:32 
 month                       | 12                  
 u_log_num_interact          | 6.911747300251674   
 u_log_interact_days_count   | 2.6390573296152584  
 u_min_interact_date         | 2000-11-22 03:47:32 
 u_max_interact_date         | 2002-05-13 02:30:58 
 u_std                       | 0.8582482337323498  
 u_mean                      | 3.0468127490039842  
 u_quantile_05               | 2.0                 
 u_quantile_5                | 3.0                 
 u_quantile_95               | 4.0                 
 u_history_length_days       | 537                 
 u_last_interaction_gap_days | 290                 
 abnormality                 | 0.696032092299538   
 abnormalityCR               | 0.30599330777137257 
 u_mean_i_lo

#### Add cold item and user

In [33]:
item_user_cold = (
    log.join(
        log_20_users.select(sf.col("user_idx").alias("u_idx"), sf.col("item_idx").alias("i_idx")),
        on=sf.col("i_idx") == sf.col("item_idx"),
        how="left"
    )
    .filter(sf.col("i_idx").isNull())
    .select(log.columns)
    .filter("user_idx > 20")
    .limit(1)
)

user_idx, item_idx = item_user_cold.select("user_idx", "item_idx").first()

log_21_users  = log_20_users.union(item_user_cold)

In [34]:
log_trsfrm = lf.transform(log_21_users)

In [36]:
log_trsfrm.where(f"item_idx == {item_idx} and user_idx == {user_idx}").show(1, vertical=True)

-RECORD 0------------------------------------------
 item_idx                    | 3078                
 user_idx                    | 1335                
 relevance                   | 1.0                 
 timestamp                   | 2000-12-02 02:41:14 
 month                       | 12                  
 u_log_num_interact          | 0.0                 
 u_log_interact_days_count   | 0.0                 
 u_min_interact_date         | 1970-01-01 03:00:00 
 u_max_interact_date         | 1970-01-01 03:00:00 
 u_std                       | 0.0                 
 u_mean                      | 0.0                 
 u_quantile_05               | 0.0                 
 u_quantile_5                | 0.0                 
 u_quantile_95               | 0.0                 
 u_history_length_days       | 0                   
 u_last_interaction_gap_days | 0                   
 abnormality                 | 0.0                 
 abnormalityCR               | 0.0                 
 u_mean_i_lo