# Assignment #4

**Authors:** *Ryan Ceresani*

**Class:** *605.744 Information Retrieval*

**Term:** *Fall 2021*

This assignment submission is structured slightly different than my previous - as we are using a Notebook in place of a "main" application. 
The notebook provides a nice integration for intermediate code output, plotting, results and experimentation.

## Expected Deliverables
- Report outlining methodologies, tools used, parameter decisions, etc.
- Precision, Recall, F1 score for the `"dev"` dataset using "Title" column as features.
- Metrics for `"dev"` dataset using "Title", "Abstract", and "Keywords" as features.
- Perform an additional non-trival experimentation.
- Predictions for the `"test"` dataset.


## **Overall Methodology**
Before getting into code specifics, we will address some overall elements used during this assignment. The main approach I am choosing to perform **Text Classification** is through deep learning.

### Open Source Libraries
To facility deep learning, this experiment makes heavy use of `PyTorch` and `torchtext` for doing the underlying operations.  Additionally, `scikit-learn` is useful for their `metrics` library which offers useful calculation ability for a variety of classifation metrics (whether you use one of their estimators or not.)  
- **PyTorch**: This is the heavy-lifter for a number of backing abstract classes performing a variety of functions.
    - `data`: We created custom PyTorch `Dataset` and `DataLoader` classes for batching, sampling, and shuffling our training, validation, and test data as appropriate.
    - `nn.Module`: This is the abstract base class for most things in `PyTorch` but is the foundation for any model or classifier we will use.
    - `optim`: This provides an optimizer and learning rate scheduler for the training loops. One of the powerful utilities of PyTorch is the way it hides gradient interactions and backward propagation from a user.
    - `loss`: Combined with `optim` the loss module provides the ability to generate loss from our predictions. Specifically we use weighted CrossEntropyLoss.
    - **General PyTorch magic:** PyTorch also provides some other nice things when it comes to the gradient operations being attached to tensors or the built-in CUDA support. 
-  **Torchtext**: Extension off the official PyTorch to provide text based utilities.
   -  `vocab`: This generates a `Vocabulary` object from an iterator which can be used to map words to indices, converting tokens into values.
   -  `utils`: It also has some built in utilities for tokenizing and creating ngrams.
- **scikit-learn**: Used for the `metrics` library to generate scores and repots.

### **Dataset**
Systematic Review

- NOTE: The Document **hash:a8113f0b-6561-3178-8c2d-7b4ebac229ff** contained an odd sequence of characters in UTF8 at the beginning of the "Article" section which caused problems for the python stdlib `csv_reader`. The value was sanitized to remove the characters (`",`) - which would be removed in tokenization anyway. 

#### Dataset Challenges
The primary challenge imposed by this dataset is the vast class imbalance within training. (30:1 negative skew)

To counteract this, two approaches were used in tandem:
1. The chosen loss function (`CrossEntropyLoss`) used inverse class frequency weighting to encourage learning on the minority class.
2. The `DataLoader` used a `WeightedRandomSampler` which was weighted to essentially up-sample the minority class and create an artificial balance.

### **Major Parameters**
Across all of the libraries and custom code (to be shown below) there are a number of key parameters that influence the results. They will be broken down into categories.

- Data
  - `batch_size`: Batch size when processing data can impact the quality of training depending on application.
    - **A batch size of 32 was used to balance between training speed and overgeneralization.**
  - `data_columns`: This specifies which columns of the *.tsv* file are to be used as features. 
    - **The prompt specifies we use first *text* and then *text + abstract + keyword***
  - `ngrams`: how large the ngram value should be for the dataset. 
    - **Initial value set to `unigram` or 1.**
  - `tokenizer`: `torchtext` comes with a number of pre-made tokenizers, each having slight variations. 
    - **For the initial pass we will use `basic_english`**. 
  - `weighted`: Whether or not the dataset should be weighted to oversample minority classes.
    - **With this dataset, it greatly enhances training so it is turned on.**
  


- Model
  - The model itself counts as a parameter and in this case it is as a very simple architecture with an `EmbeddingBag` and a fully connected linear layer.
    - The hyperparameters for the model are:
      - `vocab_size`: Which is directly tied to the dataset and not exactly a parameter.
      - `embedding_dim`: The size of the embedding and consequently the input to the linear layer.
        - **Initially set to 64**
  
- Training
  - `epochs`: How many times through the dataset to train. Depends on time/resources available. 
    - **Starting with 25 epochs to evaluate training results and will adjust from there.**
  - `learning_rate`: The learning rate directly impacts the optimizer, but due to the decoupled nature, can be changed fluidly.
    - **It is starting at a very high value of 5 to encourage faster convergence.**
  - `learning_rate_scheduler`: The methodology for updating the learning rate. We will be reducing the learning rate during a plateau of metric values.
    - **Have chosen to reduce LR when plateuing on the F1 score.**

## **Custom Modules - Source Code**
A number of custom code was generated to support this experiment. 

### datasets.py
```python
import io
from typing import Callable, List

import torch
from torch.utils import data
from torch.utils.data.sampler import WeightedRandomSampler
from torchtext.data.utils import get_tokenizer
from torchtext.utils import unicode_csv_reader
from torchtext.vocab import Vocab

_default_tokenizer = get_tokenizer("basic_english")
DEFAULT_LABEL_TRANSFORM = lambda x: x
DEFAULT_TEXT_TRANSFORM = lambda x: _default_tokenizer(x)


def create_torch_dataloader(
    dataset: data.Dataset,
    vocab: Vocab,
    label_transform: Callable = DEFAULT_LABEL_TRANSFORM,
    text_transform: Callable = DEFAULT_TEXT_TRANSFORM,
    weighted=True,
    **kwargs
):
    """Creates a Pytorch style dataloader using a dataset and a precompiled vocab.

    The dataset returns "model-ready" data.

    Args:
        dataset: The raw text dataset to be used during inference
        vocab: the premade vocabulary used to index words/phrases
        label_transform: any operation used on the datasets label output
        text_transform: operation used on the raw text sentence outputs from the data
        weighted: whether to weight the samples based on class distribution
        **kwargs: any additional kwargs used by Pytorch DataLoaders.

    Returns:
        A PyTorch DataLoader to be used during training, eval, or test.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def _collate_batch(batch):
        label_list, docid_list, text_list, offsets = [], [], [], [0]
        for (_label, _docid, _text) in batch:
            label_list.append(label_transform(_label))
            processed_text = torch.tensor(
                vocab(text_transform(_text)), dtype=torch.int64
            )
            text_list.append(processed_text)
            offsets.append(processed_text.size(0))
            docid_list.append(_docid)
        label_list = torch.tensor(label_list, dtype=torch.int64)
        offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
        text_list = torch.cat(text_list)
        return label_list.to(device), text_list.to(device), offsets.to(device), docid_list

    if weighted:
        weights = dataset.sample_weights
        sampler = WeightedRandomSampler(weights=weights, num_samples=len(weights))
    else:
        sampler = None

    return data.DataLoader(
        dataset,
        collate_fn=_collate_batch,
        shuffle=(sampler is None),
        sampler=sampler,
        **kwargs
    )


class TSVRawTextIterableDataset(data.IterableDataset):
    """Dataset that loads TSV data incrementally as an iterable and returns raw text.

    This dataset must be traversed in order as it only reads from the TSV file as it is called.
    Useful if the size of data is too large to load into memory at once.
    """

    def __init__(self, filepath: str, data_columns: List[int]):
        """Loads an iterator from a file.

        Args:
            filepath: location of the .tsv file
            data_columns: the columns in the .tsv that are used as feature data
        """
        self._number_of_items = _get_tsv_file_length(filepath)
        self._iterator = _create_data_from_tsv(
            filepath, data_column_indices=data_columns
        )
        self._current_position = 0

    def __iter__(self):
        return self

    def __next__(self):
        item = next(self._iterator)
        self._current_position += 1
        return item

    def __len__(self):
        return self._number_of_items


class TSVRawTextMapDataset(data.Dataset):
    """Dataset that loads all TSV data into memory and returns raw text.

    This dataset provides a map interface, allowing access to any entry.
    Useful for modifying the sampling or order during training.
    """

    def __init__(self, filepath: str, data_columns: List[int]):
        """Loads .tsv structed data into memory.

        Args:
            filepath: location of the .tsv file
            data_columns: the columns in the .tsv that are used as feature data
        """
        self._records = list(
            _create_data_from_tsv(filepath, data_column_indices=data_columns)
        )
        self._sample_weights, self._class_weights = self._calculate_weights()

    @property
    def sample_weights(self):
        return self._sample_weights
    
    @property
    def class_weights(self):
        return self._class_weights

    def _calculate_weights(self):
        targets = torch.tensor(
            [label if label > 0 else 0 for label, *_ in self._records]
        )
        unique, sample_counts = torch.unique(targets, return_counts=True)
        weight = 1.0 / sample_counts
        sample_weights =  torch.tensor([weight[t] for t in targets])
        class_weights = weight / weight.sum()
        return sample_weights, class_weights

    def __getitem__(self, index):
        return self._records[index]

    def __len__(self):
        return len(self._records)


def _create_data_from_tsv(data_path, data_column_indices):
    with io.open(data_path, encoding="utf8") as f:
        reader = unicode_csv_reader(f, delimiter="\t")
        for row in reader:
            data = [row[i] for i in data_column_indices]
            yield int(row[0]), row[1], " ".join(data)


def _get_tsv_file_length(data_path):
    with io.open(data_path, encoding="utf8") as f:
        row_count = sum(1 for row in f)

    return row_count
```

## Experiment Walkthrough

We will now walk through the experiment notebook to see results in action.

### Imports

The open source and custom modules used are imported first.

In [1]:
%load_ext autoreload

%autoreload 2

In [2]:
import torch
from torch.utils.tensorboard import SummaryWriter
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

In [3]:
from ir_classification import datasets, models
from ir_classification import vocab as ir_vocab
from ir_classification import train

## Setup the Training and Validation Datasets

In [19]:
datafield_map = {"assessment": 0, "doc_id": 1, "title": 2, "authors": 3, "journal": 4, "issn": 5, "year": 6, "language": 7, "abstract": 8, "keywords": 9}
data_columns = [datafield_map["title"]]
ngrams = 1
batch_size = 64

# Create vocab from the training data.
# vocab = ir_vocab.create_vocab_from_tsv("../datasets/systematic_review/phase1.train.shuf.tsv", data_columns, ngrams=ngrams)
glove = ir_vocab.create_glove_with_unk_vector()
vocab = ir_vocab.create_vocab_from_glove(glove)

# Load the TSV into datasets with the appropriate feature columns.
train_dataset = datasets.TSVRawTextMapDataset("../datasets/systematic_review/phase1.train.shuf.tsv", data_columns)
val_dataset = datasets.TSVRawTextMapDataset("../datasets/systematic_review/phase1.dev.shuf.tsv", data_columns)

# Create the transforms for the dataloader to appropriately format the contents of the files.
label_transform = lambda x: x if x > 0 else 0
tokenizer = get_tokenizer("basic_english")
text_transform = lambda x: list(ngrams_iterator(tokenizer(x), ngrams))

# Instantiate the dataloaders.
train_dataloader = datasets.create_torch_dataloader(train_dataset, vocab,  label_transform, text_transform, weighted=True, batch_size=batch_size)
val_dataloader = datasets.create_torch_dataloader(val_dataset, vocab,  label_transform, text_transform, weighted=False, batch_size=batch_size)

## Instantiate the Model

In [20]:
num_classes = 2
vocab_size = len(vocab) # from vocab created earlier.
embedding_size = 300
hidden_layer_size = 10

# Enable compatability when training with GPU enabled devices.  
# (Some development work was done in Google Colab with GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# model = models.EmbeddingBagLinearModel(vocab_size, embedding_size, num_classes).to(device)
model = models.PretrainedEmbeddingMLPModel(num_classes, hidden_layer_size, glove.vectors)

# Free up memory
del glove

## Setup the top-level training loop

The custom code was meant to handle the individual `step` and `epoch` levels generically.
This setup should let us change the components experimentally in cells like below without much other hassle.

In [26]:
EPOCHS = 20
learning_rate = 5

# Create the loss function weighted to inverse class distribution
# loss_function = torch.nn.CrossEntropyLoss(weight=train_dataset.class_weights)
loss_function = torch.nn.CrossEntropyLoss()

# Instantiate a Stochastic Gradient Descent optimizer and "Auto" Learning Rate schedule.
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

# Tensorboard writing utility class.
writer = SummaryWriter()

# Perform Training
for i in range(EPOCHS):
    start_iter = len(train_dataloader) * i
    train.train_epoch(i, model, optimizer, loss_function, train_dataloader, start_iter=start_iter, writer=writer)
    validation_results = train.evaluate_epoch(i, model, loss_function, val_dataloader, writer)
    scheduler.step()


#torch.save(model.state_dict(), "model_weights/title_only_state_dict.pth")

           Epoch 0: 100%|██████████| 340/340 [00:02<00:00, 130.29 batch/s, accurracy=0.75, loss=0.265]
     Validation: 0: 100%|██████████| 76/76 [00:00<00:00, 145.98 batch/s, accurracy=0.98, loss=1.23]
           Epoch 1: 100%|██████████| 340/340 [00:02<00:00, 127.46 batch/s, accurracy=0.75, loss=0.326]
     Validation: 1: 100%|██████████| 76/76 [00:00<00:00, 171.24 batch/s, accurracy=0.92, loss=0.141]
           Epoch 2: 100%|██████████| 340/340 [00:02<00:00, 145.09 batch/s, accurracy=0.75, loss=0.43]
     Validation: 2: 100%|██████████| 76/76 [00:00<00:00, 175.75 batch/s, accurracy=0.92, loss=1.78]
           Epoch 3: 100%|██████████| 340/340 [00:02<00:00, 146.72 batch/s, accurracy=1, loss=0.0864]
     Validation: 3: 100%|██████████| 76/76 [00:00<00:00, 170.93 batch/s, accurracy=0.92, loss=0.915]
           Epoch 4: 100%|██████████| 340/340 [00:02<00:00, 151.07 batch/s, accurracy=1, loss=0.0967]
     Validation: 4: 100%|██████████| 76/76 [00:00<00:00, 153.02 batch/s, accurracy=0.98,

In [18]:
# %load_ext tensorboard

# %tensorboard --logdir runs

# Generate Metrics on the Dev Set
Here we recreate the `dev` dataloader to go a single document at a time.

We use the less robust `predict` method so we can explicitly show the values and calculations being performed.

In [27]:
dev_dataloader = datasets.create_torch_dataloader(val_dataset, vocab,  label_transform, text_transform, weighted=False, batch_size=1)

model.eval()
preds = []
labels = []
for batch in dev_dataloader:
    label, text, *_ = batch
    pred_label = train.predict(model, text)
    preds.append(pred_label)
    labels.append(label.cpu().item())

### Generate a Confusion Matrix

We can use the confusion matrix to make it easy to visualize the values for the metrics.  
       

       
|   | P0  | P1  |
|---|----|----|
| A0 | TN | FP |
| A1 | FN | TP |

In [28]:
from sklearn.metrics import confusion_matrix, average_precision_score
cm = confusion_matrix(labels, preds)
print(cm)

[[4388  312]
 [  83   67]]


### Calculate Precision, Recall, F1-Score on Dev Set
We use a confusion matrix to make it easy to map out the values for true positive, true negative, false positive, and false negative.

- Precision = tp / (tp + fp)
- Recall = tp / (tp + fn)
- F1 Score =  2 * (precision * recall) / (precision + recall)

In [29]:
tn, fp, fn, tp = cm.ravel() # Extract the components

# Calculate and print Precision
precision_string = f"{tp} / ({tp} + {fp})"
precision = round(eval(precision_string), 4)
print(f"Precision: {precision_string} = {precision}")

# Calculate and print Recall
recall_string = f"{tp} / ({tp} + {fn})"
recall = round(eval(recall_string), 4)
print(f"Recall: {recall_string} = {recall}")

# Calculate and print F1 Score
f1_string = f"2 * ({precision} * {recall}) / ({precision} + {recall})"
f1 = eval(f1_string)
print(f"F1: {f1_string} = {round(f1, 4)}")

print(f"AP: {average_precision_score(labels, preds)}")

Precision: 67 / (67 + 312) = 0.1768
Recall: 67 / (67 + 83) = 0.4467
F1: 2 * (0.1768 * 0.4467) / (0.1768 + 0.4467) = 0.2533


0.09607558324039568

## Repeat Experiment with "Title", "Abstract", and "Keyword" Data

We will keep everything exactly the same for setup, changing only the things needed.

In [32]:
use_columns = ["title", "abstract", "keywords"]
data_columns = [datafield_map[col] for col in use_columns]
ngrams = 1

# Create vocab from the training data.
vocab = ir_vocab.create_vocab_from_tsv("../datasets/systematic_review/phase1.train.shuf.tsv", data_columns, ngrams=ngrams)

# Load the TSV into datasets with the appropriate feature columns.
train_dataset = datasets.TSVRawTextMapDataset("../datasets/systematic_review/phase1.train.shuf.tsv", data_columns)
val_dataset = datasets.TSVRawTextMapDataset("../datasets/systematic_review/phase1.dev.shuf.tsv", data_columns)

# Instantiate the dataloaders.
train_dataloader = datasets.create_torch_dataloader(train_dataset, vocab,  label_transform, text_transform, weighted=True, batch_size=batch_size)
val_dataloader = datasets.create_torch_dataloader(val_dataset, vocab,  label_transform, text_transform, weighted=False, batch_size=batch_size)

# Create Model
vocab_size = len(vocab)
# model = models.EmbeddingBagLinearModel(vocab_size, embedding_size, num_classes).to(device)

In [34]:
EPOCHS = 20
learning_rate = 5

# Create the loss function weighted to inverse class distribution
# loss_function = torch.nn.CrossEntropyLoss(weight=train_dataset.class_weights)
loss_function = torch.nn.CrossEntropyLoss()

# Instantiate a Stochastic Gradient Descent optimizer and "Auto" Learning Rate schedule.
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)
# scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, "max")

# Tensorboard writing utility class.
writer = SummaryWriter()

# Perform Training
for i in range(EPOCHS):
    start_iter = len(train_dataloader) * i
    train.train_epoch(i, model, optimizer, loss_function, train_dataloader, start_iter=start_iter, writer=writer)
    validation_results = train.evaluate_epoch(i, model, loss_function, val_dataloader, writer)
    # scheduler.step(validation_results["recall"])
    scheduler.step()

           Epoch 0: 100%|██████████| 340/340 [00:09<00:00, 37.23 batch/s, accurracy=0.5, loss=0.141]
     Validation: 0: 100%|██████████| 76/76 [00:01<00:00, 46.61 batch/s, accurracy=0.02, loss=1.95]
           Epoch 1: 100%|██████████| 340/340 [00:08<00:00, 40.31 batch/s, accurracy=0.25, loss=0.395]
     Validation: 1: 100%|██████████| 76/76 [00:01<00:00, 46.99 batch/s, accurracy=0.04, loss=1.12]
           Epoch 2: 100%|██████████| 340/340 [00:08<00:00, 41.61 batch/s, accurracy=0.75, loss=0.0532]
     Validation: 2: 100%|██████████| 76/76 [00:01<00:00, 41.12 batch/s, accurracy=0.06, loss=1.39]
           Epoch 3: 100%|██████████| 340/340 [00:08<00:00, 40.65 batch/s, accurracy=0.5, loss=0.135]
     Validation: 3: 100%|██████████| 76/76 [00:01<00:00, 47.15 batch/s, accurracy=0.06, loss=1.29]
           Epoch 4: 100%|██████████| 340/340 [00:08<00:00, 40.05 batch/s, accurracy=0.25, loss=0.199]
     Validation: 4: 100%|██████████| 76/76 [00:01<00:00, 42.00 batch/s, accurracy=0.58, loss=0.

In [35]:
model.eval()
preds = []
labels = []
for batch in dev_dataloader:
    label, text, *_ = batch
    pred_label = train.predict(model, text)
    preds.append(pred_label)
    labels.append(label.cpu().item())

cm = confusion_matrix(labels, preds)
print(cm)
tn, fp, fn, tp = cm.ravel() # Extract the components

# Calculate and print Precision
precision_string = f"{tp} / ({tp} + {fp})"
precision = round(eval(precision_string), 4)
print(f"Precision: {precision_string} = {precision}")

# Calculate and print Recall
recall_string = f"{tp} / ({tp} + {fn})"
recall = round(eval(recall_string), 4)
print(f"Recall: {recall_string} = {recall}")

# Calculate and print F1 Score
f1_string = f"2 * ({precision} * {recall}) / ({precision} + {recall})"
f1 = eval(f1_string)
print(f"F1: {f1_string} = {round(f1, 4)}")


print(f"AP: {average_precision_score(labels, preds)}")

[[3556 1144]
 [ 100   50]]
Precision: 50 / (50 + 1144) = 0.0419
Recall: 50 / (50 + 100) = 0.3333
F1: 2 * (0.0419 * 0.3333) / (0.0419 + 0.3333) = 0.0744
