In [1]:
# run only once
import os
os.chdir('../')

# Creating a SeqDataset

The main class used to create a  is `SeqData` from `dataset.preprocesing`.

In [2]:
from dataset.preprocessing import SeqDataset

When instantiating the dataset we have to set the size (in MBs) of chunks that dask will use.

In [3]:
data = SeqDataset(chunksize=120)

## Interactions
The fundamental and mandatory part of `SeqDataset` is the interactions. The interaction file(s) contain interactions between a user and an item and can be loaded using the `load_interactions` function.

Currently we support only interaction file(s) in the form of jsonl.

In [4]:
# File path of interactions
path = './data/amazon_reviews_2023/reviews/All_Beauty.jsonl'
# Column name indicating item id
c_iid= 'asin'
# Column name indicating user id
c_uid= 'user_id'
# Column name indicating timestamp of interaction in unix format
c_timestamp= 'timestamp'
# Column name indicating score assigned to item by user (Optional)
c_score= 'average_rating'


data.load_interactions(
    path=path,
    c_uid=c_uid,
    c_iid=c_iid,
    c_timestamp=c_timestamp,
    c_score=c_score,
)

2024-11-06 11:40:31,605 - preprocessing.py:104 - INFO - Total size of interactions: 311 MB. Repartitioning interactions to 2 partitions
2024-11-06 11:40:36,844 - preprocessing.py:124 - INFO - Dropping duplicates in interactions
2024-11-06 11:40:38,138 - preprocessing.py:134 - INFO - Interactions loaded


In [5]:
data.interactions.head()

Unnamed: 0,item_id,user_id,timestamp
0,B00YQ6X8EO,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588687728923
1,B081TJ8YS3,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588615855070
2,B07PNNCSP9,AE74DYR3QUGVPZJ3P7RFWBGIX7XQ,1589665266052
3,B09JS339BZ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1643393630220
4,B08BZ63GMJ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1609322563534


## Metadata
We can also load and add metadata to items if there are available using the `load_metadata` fucntion. The metadata file(s) contain information regarding an item.

The user can select which columns to consider by passing a list with the name of the columns (metadata parameter). All the information will concatenated as strings fro example:

```
in: metadata_cols = ['title', 'brand', 'category']
out: 'title: Foobar brand: Foo category: Bar'
```

Currently we support only metadata file(s) in the form of jsonl.



In [6]:
# File path of metadata
path = './data/amazon_reviews_2023/meta/meta_All_Beauty.jsonl'
# Column name indicating item id
c_iid= 'parent_asin'
# List of column names to be loaded from metadata
metadata_cols = ['title', 'store', 'main_category']
# flag on whether to drop rows with missing values in ALL of the specified columns
dropna = True

data.load_metadata(
    path,
    c_iid=c_iid,
    metadata_cols=metadata_cols,
    dropna=dropna,
)

2024-11-06 11:40:42,497 - preprocessing.py:164 - INFO - Total size of metadata: 203MB. Repartitioning metadata to 1 partitions
2024-11-06 11:40:43,343 - preprocessing.py:172 - INFO - Dropping duplicates in metadata
2024-11-06 11:40:45,161 - preprocessing.py:227 - INFO - Metadata loaded and mapped.


In [7]:
data.interactions.head()

Unnamed: 0,item_id,user_id,timestamp,features
0,B00YQ6X8EO,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588687728923,title: Herbivore - Natural Sea Mist Texturizin...
1,B081TJ8YS3,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588615855070,title: All Natural Vegan Dry Shampoo Powder - ...
2,B09JS339BZ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1643393630220,title: muaowig Ombre Body Wave Bundles 1B Grey...
3,B08BZ63GMJ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1609322563534,title: Yinhua Electric Nail Drill Kit Portable...
4,B00R8DXL44,AGMJ3EMDVL6OWBJF7CA5RGJLXN5A,1598567408138,"title: China Glaze Nail Polish, Wanderlust 138..."


## Loading multiple files
We can select multiple files to load for both `load_interactions` and `load_metadata` functions. However, the files should share a common schema (i.e. same column names).

In [None]:
data_multiple_files = SeqDataset(chunksize=120)
data_multiple_files.load_interactions(
    path= ['./data/amazon_reviews_2018/reviews/All_Beauty.json', './data/amazon_reviews_2018/reviews/Video_Games.json'],
    c_uid=c_uid,
    c_iid=c_iid,
    c_timestamp=c_timestamp,
    c_score=c_score
)

data_multiple_files.load_metadata(
    path= ['./data/amazon_reviews_2018/meta/meta_All_Beauty.json', './data/amazon_reviews_2018/meta/meta_Video_Games.json'],
    c_iid=c_iid,
    metadata_cols=metadata_cols,
    dropna=dropna,
)

## Applying k-core filtering.

`k-core` filtering is a common preprpocessing method that aims to reduce the sparsity in the dataset by filtering users and items that appear less than `k` times.

This can be done by using the `kcore_filtering` function. The function returns `True` if the filtering was applied succesfully otherwise it returns `False`.

In [8]:
print(f"Number of interactions: {data.interactions.shape[0].compute():,}")

Number of interactions: 633,693


In [9]:
# set k threshold
kcore = 5
kcore_applied = data.kcore_filtering(kcore=kcore)

print(f"k-core filtering applied: {kcore_applied}. Number of interactions after filtering: {data.interactions.shape[0].compute():,}")


2024-11-06 11:41:19,827 - preprocessing.py:298 - INFO - Applying k-core filtering.
2024-11-06 11:41:42,314 - preprocessing.py:310 - INFO - Counted users, items and user-item interactions.
2024-11-06 11:41:42,497 - preprocessing.py:329 - INFO - Filtered out users and items with less than 5 appearances. Creating sparse matrix...
2024-11-06 11:41:43,595 - preprocessing.py:609 - INFO - Iteration 5 - Users: 284 Items: 424
2024-11-06 11:41:43,598 - preprocessing.py:609 - INFO - Iteration 10 - Users: 229 Items: 318
2024-11-06 11:41:43,600 - preprocessing.py:609 - INFO - Iteration 15 - Users: 218 Items: 313
2024-11-06 11:41:43,601 - preprocessing.py:355 - INFO - Filtered matrix computed. Applying filtering to interactions...
2024-11-06 11:42:05,016 - preprocessing.py:375 - INFO - K-core filtering applied


k-core filtering applied: True. Number of interactions after filtering: 2,149


In [10]:
data.interactions.head()

Unnamed: 0,item_id,user_id,timestamp,features
0,B08P2DZB4X,AFSKPY37N3C43SOI5IEXEK5JSIYA,1627391044559,title: NIRA Skincare Laser & Serum Bundle - In...
1,B07RBSLNFR,AFSKPY37N3C43SOI5IEXEK5JSIYA,1621184430697,title: OGANA CELL Peptide Concentrating Amazin...
2,B07SLFWZKN,AFSKPY37N3C43SOI5IEXEK5JSIYA,1619737501209,title: Keratin Secrets Do It Yourself Home Ker...
3,B08JTNQFZY,AFSKPY37N3C43SOI5IEXEK5JSIYA,1617904219785,title: GAINWELL ~ store: GAINWELL ~ main_categ...
4,B07KG1TWP5,AFSKPY37N3C43SOI5IEXEK5JSIYA,1596473351088,title: Organic Bamboo Cotton Ear Swabs by Bali...


## Encoding users and items

If needed we can encode items and users as integers by using the `encode_entries` function.

In [11]:
data.encode_entries()

2024-11-06 11:43:51,194 - preprocessing.py:252 - INFO - Number of users: 218
2024-11-06 11:43:51,199 - preprocessing.py:260 - INFO - Users encoded
2024-11-06 11:43:51,200 - preprocessing.py:271 - INFO - Number of items: 312
2024-11-06 11:43:51,204 - preprocessing.py:279 - INFO - Items encoded


In [12]:
data.interactions.head()

Unnamed: 0,item_id,user_id,timestamp,features,user_id_encoded,item_id_encoded
0,B08P2DZB4X,AFSKPY37N3C43SOI5IEXEK5JSIYA,1627391044559,title: NIRA Skincare Laser & Serum Bundle - In...,104,247
1,B07RBSLNFR,AFSKPY37N3C43SOI5IEXEK5JSIYA,1621184430697,title: OGANA CELL Peptide Concentrating Amazin...,104,35
2,B07SLFWZKN,AFSKPY37N3C43SOI5IEXEK5JSIYA,1619737501209,title: Keratin Secrets Do It Yourself Home Ker...,104,37
3,B08JTNQFZY,AFSKPY37N3C43SOI5IEXEK5JSIYA,1617904219785,title: GAINWELL ~ store: GAINWELL ~ main_categ...,104,216
4,B07KG1TWP5,AFSKPY37N3C43SOI5IEXEK5JSIYA,1596473351088,title: Organic Bamboo Cotton Ear Swabs by Bali...,104,20


## Creating sequences

We can group all the interactions for each user/session^ by using the `create_sequences` function.

`create_sequences` will join all the relevant columns (features, item_id, score, item_id_encoded) for each user/session in chronological order (older to most recent).

^*when k-core filtering is applied each user will have mutliple sessions*

In [13]:
data.create_sequences()

2024-11-06 11:43:55,876 - preprocessing.py:387 - INFO - Setting index to user_id
2024-11-06 11:43:55,878 - preprocessing.py:393 - INFO - Index set to user_id


KeyError: 'Column not found: score'

In [None]:
data.interactions.head()

##  Splitting data into train/val/test

The dataset can be slitted in train, val, test sets by using the `split_data` function. 

Currently we support only LOO (leave one out) method (train: [:-2], val: [-2], test: [-1]). In this method, depending on the length of sequences, the train/val/test sets may not be of the same size.

In [None]:
data.split_data()

## Saving the dataset

Use the `save` method to save the dataset.

The function will save:

- interactions
- metadata
- train, val, and test splits
- encoder_items.json (mapping used for encoding items)
- encoder_users.json (mapping used for encoding users)
- stats.json (dataset statistics: num_items, num_users, num_interactions)

In [None]:
# directory to save the dataset
path = './sequential_dataset'

data.save(
    './sequential_dataset',
    save_metadata=True,
)


In [None]:
!ls ./sequential_dataset/

# Creating a dataset with a yaml file

The user can also utilise the `preprocessing/create_dataset.py` script to create a SeqDataset by creating a yaml file with all the configurations. The file should be saved under "configs_hydra/dataset"

Example of yaml file:
```
c_iid: asin
c_uid: reviewerID
c_score: overall
c_timestamp: unixReviewTime
dropna: false
kcore: 5
metadata_cols:
  - title
  - brand
  - category
chunksize: 500
path_reviews: ./data/amazon_reviews_2018/reviews/Video_Games.json
path_metadata: ./data/amazon_reviews_2018/meta/meta_Video_Games.json
path_output: ./data_experiments/amazon_reviews_2018/Video_Games
```

Assuming the file is stored under 'configs_hydra/dataset/custom_data.yaml' run:

`python preprocessing/create_dataset.py dataset:new_data`

The user can pass multiple files too e.g.
```
path_reviews:
  - ./data/amazon_reviews_2018/reviews/Automotive.json
  - ./data/amazon_reviews_2018/reviews/Cell_Phones_and_Accessories.json

path_metadata:
  - ./data/amazon_reviews_2018/meta/meta_Automotive.json
  - ./data/amazon_reviews_2018/meta/meta_Cell_Phones_and_Accessories.json
```


"preprocessing/create_dataset.py" will instantiate a local dask cluster automatically. The default values can be found in "configs_hydra/dask/local.yaml"

```
n_workers: 10
threads_per_worker: 2
memory_limit: '20GB'
local_directory: /tmp/dask
dashboard_address: 8999
```

Custom settings can be passed by adding a new yaml file under "configs_hydra/dask/". For example for a config file "my_settings.yaml" run:

`python preprocessing/create_dataset.py dask:my_settings.yaml`