# Filtering log data with RePlay

This notebook presents the RePlay functionality for log data filtering by interaction and time.<br><br>You can use the following functions:

* filter_by_min_count
* filter_out_low_ratings
* take_num_user_interactions
* take_num_days_of_user_hist
* take_time_period
* take_num_days_of_global_hist


In [1]:
import pandas as pd
from rs_datasets import MovieLens

from replay.data_preparator import DataPreparator, Indexer
from replay.session_handler import State
from rs_datasets import MovieLens
from replay.utils import get_log_info

from replay.filters import (
    filter_by_min_count,
    filter_out_low_ratings,
    take_num_user_interactions,
    take_num_days_of_user_hist,
    take_time_period,
    take_num_days_of_global_hist
)

In [3]:
spark = State().session
spark.sparkContext.setLogLevel('ERROR')
spark

## Get started

Download the dataset **MovieLens** and preprocess it with `DataPreparator` and `Indexer`

In [4]:
data = ratings = pd.read_csv('./data/ml1m_ratings.dat',sep="\t",names=["userId", "itemId", "relevance", "timestamp"],engine='python')

In [5]:
preparator = DataPreparator()
log = preparator.transform(
    columns_mapping={
        'user_id': 'userId',
        'item_id': 'itemId',
        'relevance': 'relevance',
        'timestamp': 'timestamp'
    }, 
    data=data
)

26-Dec-22 15:57:37, replay, INFO: Columns with ids of users or items are present in mapping. The dataframe will be treated as an interactions log.


In [6]:
indexer = Indexer(user_col='user_id', item_col='item_id')
indexer.fit(users=log.select('user_id'), items=log.select('item_id'))
log_replay = indexer.transform(df=log)

In [7]:
log_replay.show(4)

+--------+--------+---------+-------------------+
|user_idx|item_idx|relevance|          timestamp|
+--------+--------+---------+-------------------+
|    4131|      43|      5.0|2001-01-01 01:12:40|
|    4131|     585|      3.0|2001-01-01 01:35:09|
|    4131|     461|      3.0|2001-01-01 01:32:48|
|    4131|     105|      4.0|2001-01-01 01:04:35|
+--------+--------+---------+-------------------+
only showing top 4 rows



## function filter_by_min_count()

Filtering by the minimum number of interactions.

Parameters:

* `data_frame` - DataFrame of interaction between users and elements

* `num_entries` - threshold value for the number of interactions to filter, includes a specified threshold
    
* `group_by` - column relative to which filtering occurs
    

![title](data/img/filter_by_min_count.jpg)

Let's leave users who have `>= 10` interactions with items

In [8]:
log_filter = filter_by_min_count(log_replay, 
                                 num_entries=10,
                                 group_by="user_idx")

26-Dec-22 15:57:51, replay, INFO: current threshold removes 0.0% of data


Let's leave the items that `>= 100` users interacted with

In [9]:
log_filter = filter_by_min_count(log_replay, 
                                 num_entries=100,
                                 group_by="item_idx")

26-Dec-22 15:57:52, replay, INFO: current threshold removes 0.05797188387626986% of data


## function filter_out_low_ratings()

Filtering relevance with a minimum threshold.

Parameters:

* `data_frame` - DataFrame of interaction between users and elements
    
* `value` - threshold value of the relevance for filtering, includes a specified threshold

* `rating_column` - relevance column
    

![title](img/filter_by_min_count.jpg)

Let's leave interactions with relevance `5`

In [10]:
log_filter = filter_out_low_ratings(log_replay, 
                                    value=5)

## function take_num_user_interactions()

Filtering the number of interactions for each user. The order of interaction is determined by the date.

Parameters:


* `log` - DataFrame of interaction between users and elements
    
* `num_interactions` - threshold value for the number of interactions for each user
    
* `first` - indicator, if the value is `true`,  counting of user's interactions start from the beginning, if the value is `false`, the counting of interactions start from the end
    
* `date_col` - date column name
    
* `user_col` - user id column name
    
* `item_col` - item id column name


![title](img/take_num_user_interactions.jpg)

Let's leave the last `2` interactions for each user

In [11]:
log_filter = take_num_user_interactions(log_replay, 
                                        num_interactions=2,
                                        first=False)

## function take_num_days_of_user_hist()

Filtering by the time period of interactions for each user.

Parameters:

* `log` - DataFrame of interaction between users and elements
    
* `days` - time interval in days
    
* `first` - indicator, if the value is `true`, the interval starts counting from the first user's interaction, if the value is `false`, the interval starts counting from the last interaction
    
* `date_col` - item id column name
    
* `user_col` - user id column name

![title](img/take_num_days_of_user_hist.jpg)

Let's leave the last `week` of each user's interactions

In [12]:
log_filter = take_num_days_of_user_hist(log_replay, 
                                        days=7,
                                        first=False)

## function take_num_days_of_global_hist()

Filtering by the time period of interactions of all users.

Parameters:

* `log` - DataFrame of interaction between users and elements
    
* `duration_days` - time interval in days
    
* `first` - indicator, if the value is `true`, the interval starts counting from the first interaction among all users, if the value is `false`, the interval starts counting from the last interaction
    
* `date_column` - date column name

![title](img/take_num_days_of_global_hist.jpg)

Let's leave the last `2 weeks`of interactions.

In [14]:
log_filter = take_num_days_of_global_hist(log_replay, 
                                          duration_days=14,
                                          first=False)

## function take_time_period()

Filtering by a given time period

Parameters:
    
* `log` - DataFrame of interaction between users and elements
    
* `start_date` - beginning of the time period
    
* `end_date` - end of time period
    
* `date_column` - date column name

![title](img/take_time_period.jpg)

Let's leave the data for `March` only

In [15]:
log_filter = take_time_period(log_replay, 
                              start_date="2001-03-01",
                              end_date="2001-04-01")