# Custom Dataset class implementation: Case study for sentiment text corpora

* In this document, we will implement the Dataset class that preprocesses sentiment-annotated text corpora.
* Raw data is given as the NDJSON dataset. Each line represents the single sample; `text` field and `sentiment` field.

```
{"text": "This is first text", "sentiment": "positive"}
{"text": "This is second text", "sentiment": "negative"}
{"text": "That's all", "sentiment": "positive"}
```

* Let's say We want to apply following pre-processing.
    1. tokenization
    2. word-to-index conversion
    3. sentiment class-to-index conversion.
* What we expect after preprocessing is as follows.

```
{"text:" "This is first text", "sentiment": "positive", "sequence": ["This", "is", "first", "text"], "token_ids: [1, 2, 3, 4], "sentiment": "positive", "ground_truth": 0}
{"text": "This is second ...
```

## How to design the Dataset class
* Widely accepted best practice is to decouple and modularize the preprocessing function as the transformation function.
* With that in our mind, we will implement the dataset class that has two features: 1) data (NDJSON) read and 2) sequential application of transformation function.

In [1]:
from torch.utils.data import Dataset

* Minimum requirements for Dataset class are as follows.
    1. Inherit `torch.utils.data.Dataset` class.
    2. Implement `__len__()` method.
        * It tells how many samples are in the dataset.
    3. Implement `__getitem__()` method.
* In addition to these requirements, we will implement following methods.
    1. `sample_loader()` method. It will return the iterable of samples.
    2. `transform()` method. It will apply the given transformation functions sequentially.

In [2]:
from typing import List, Dict, Callable, Any
import io, json

class SentimentDataset(Dataset):
    
    def __init__(self, path_like: str, transform_functions: Dict[str, Callable[[Dict[str, Any]], Dict[str, Any]]] = None):
        self.path_like = path_like
        self._samples = list(self.sample_loader())
        self._transform_functions = transform_functions
    
    def __len__(self):
        return len(self._samples)
    
    def __getitem__(self, idx: int):
        sample = self._samples[idx]
        preprocessed = self.transform(sample)
        return preprocessed
    
    def sample_loader(self):
        if isinstance(self.path_like, str):
            ifs = io.open(self.path_like, mode="r")
        else:
            ifs = self.path_like
        for str_json in ifs:
            yield json.loads(str_json.strip())
        ifs.close()
    
    def transform(self, sample: Dict[str, Any]) -> Dict[str, Any]:
        if self._transform_functions is None:
            return sample
        
        for output_field_name, transform_function in self._transform_functions.items():
            sample[output_field_name] = transform_function(**sample)
        return sample

## See if it works as expected

In [3]:
from pprint import pprint

In [4]:
corpora = """
{"text": "This is first text", "sentiment": "positive"}
{"text": "This is second text", "sentiment": "negative"}
{"text": "That's all", "sentiment": "positive"}
""".strip()

In [5]:
p = io.StringIO(corpora)
dataset = SentimentDataset(path_like=p)

* Congratulations! You successfully implemented your own dataset class.

In [6]:
print(f"#samples: {len(dataset)}")
print(f"first sample: {dataset[0]}")

#samples: 3
first sample: {'text': 'This is first text', 'sentiment': 'positive'}


## Implementing transform functions
* This is somewhat straitforward.
* Hint: You should use np.array instead of list for numeric types. otherwise DataLoader (in default settings) won't transform to tensors as expected.
* (ToDo: It is better parametrizing which field is used as the input of funciton)

In [7]:
import numpy as np

def whitespace_tokenizer(text, **kwargs):
    return text.split(" ")

def token_to_index_converter(sequence, **kwargs):
    mapper = {
        "This": 0,
        "is": 1,
        "first": 2,
        "text": 3
    }
    return np.array([mapper.get(token, -1) for token in sequence])

def sentiment_to_index_converter(sentiment, **kwargs):
    mapper = {
        "positive": 0,
        "negative": 1
    }
    return mapper[sentiment]

## See if transformation works as expected

In [8]:
from collections import OrderedDict

In [9]:
p = io.StringIO(corpora)
transform_functions = OrderedDict() # order: tokenize > token-to-index > sentiment-to-index
transform_functions["sequence"] = whitespace_tokenizer
transform_functions["token_ids"] = token_to_index_converter
transform_functions["ground_truth"] = sentiment_to_index_converter

dataset = SentimentDataset(path_like=p, transform_functions=transform_functions)

* Congratulations! You implemented your own preprocessing along with dataset class.

In [10]:
pprint(dataset[0], compact=True)

{'ground_truth': 0,
 'sentiment': 'positive',
 'sequence': ['This', 'is', 'first', 'text'],
 'text': 'This is first text',
 'token_ids': array([0, 1, 2, 3])}


## Appenidix: DataLoader will do almost everything
* shuffle, chunk, and conversion to tensor object.

In [11]:
from torch.utils.data import DataLoader

In [12]:
data_loader = DataLoader(dataset, batch_size=2, shuffle=False)

In [13]:
for batch in data_loader:
    pprint(batch, compact=True)
    print("---")

{'ground_truth': tensor([0, 1]),
 'sentiment': ['positive', 'negative'],
 'sequence': [('This', 'This'), ('is', 'is'), ('first', 'second'),
              ('text', 'text')],
 'text': ['This is first text', 'This is second text'],
 'token_ids': tensor([[ 0,  1,  2,  3],
        [ 0,  1, -1,  3]])}
---
{'ground_truth': tensor([0]),
 'sentiment': ['positive'],
 'sequence': [("That's",), ('all',)],
 'text': ["That's all"],
 'token_ids': tensor([[-1, -1]])}
---
