# Text Analytics Coursework -- Data Loader

This notebook is to help get you started with the datasets used in the coursework assignment. 

For this coursework, we recommend that you use your virtual environment that you created for the labs. 

In [None]:
%load_ext autoreload
%autoreload 2

# Use HuggingFace's datasets library to access the Emotion dataset
from datasets import load_dataset
import numpy as np

# Financial News Sentiment Classification

The Financial Phrasebank dataset contains sentences from financial news articles, classified into negative (0), neutral (1), or positive (2) sentiment. See [HuggingFace](https://huggingface.co/datasets/financial_phrasebank) for more information. 

First we need to load the data. We only have access to the original training data, so we will split this into our own validation and test sets. The _validation_ set (also called 'development' set or 'devset') can be used to compute performance of your model when tuning hyperparameters, optimising combinations of features, or looking at the errors your model makes before improving it. This allows you to hold out the test set (i.e., not to look at it at all when developing your method) to give a fair evaluation of the model and how well it generalises to new examples. This avoids tuning the model to specific examples in the test set. An alternative approach to validation is to not use a single fixed validation set, but instead use [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html). 

In [None]:
from sklearn.model_selection import train_test_split

cache_dir = "./data_cache"

train_dataset = load_dataset(
    "financial_phrasebank",
    name="sentences_50agree",
    split="train",
    cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")

# Split the test set out
train_texts, test_texts, train_labels, test_labels = train_test_split(train_dataset["sentence"], train_dataset["label"], test_size=0.25)

# Optionally, split out a validation set
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.25)

print(f"Training dataset with {len(train_labels)} instances loaded")
print(f"Validation dataset with {len(val_labels)} instances loaded")
print(f"Test dataset with {len(test_labels)} instances loaded")


# MIT Restaurants

This dataset consists of requests for restaurant recommendations given by users of a dialogue system. They are tagged with named entities in eight categories:
{
    "O": 0,
    "B-Rating": 1,
    "I-Rating": 2,
    "B-Amenity": 3,
    "I-Amenity": 4,
    "B-Location": 5,
    "I-Location": 6,
    "B-Restaurant_Name": 7,
    "I-Restaurant_Name": 8,
    "B-Price": 9,
    "B-Hours": 10,
    "I-Hours": 11,
    "B-Dish": 12,
    "I-Dish": 13,
    "B-Cuisine": 14,
    "I-Price": 15,
    "I-Cuisine": 16
}
For further details, see [the HuggingFace page](https://huggingface.co/datasets/tner/mit_restaurant).

In [None]:
ner_dataset = load_dataset(
    "tner/mit_restaurant", 
)

print(f'The dataset is a dictionary with {len(ner_dataset)} splits: \n\n{ner_dataset}')

In [None]:
# It  may be useful to obtain the data in a list format for some sequence tagging methods
train_sentences_ner = [item['tokens'] for item in ner_dataset['train']]
train_labels_ner = [[str(tag) for tag in item['tags']] for item in ner_dataset['train']]

val_sentences_ner = [item['tokens'] for item in ner_dataset['validation']]
val_labels_ner = [[str(tag) for tag in item['tags']] for item in ner_dataset['validation']]

test_sentences_ner = [item['tokens'] for item in ner_dataset['test']]
test_labels_ner = [[str(tag) for tag in item['tags']] for item in ner_dataset['test']]

In [None]:
# Show the different tag values in the dataset:
np.unique(np.concatenate(train_labels_ner))

### (Optional) Transformer Sequence Tagger

People that want to use a transformer for task 2 may want to take a look at the [Token Classification tutorial](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=vc0BSBLIIrJQ) from HuggingFace. There is no requirement to use a transformer to achieve high marks, this is one option that you may consider. Feel free to skip this part of the notebook if you are using a different kind of model that does not require it.

A useful function provided by HuggingFace as part of the Token Classification page is tokenize_and_align. You can reuse this function if you are working with a method that tokenizes the text in a diferent way to the NER dataset. This function is provided below:

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, max_length=128, is_split_into_words=True)
    print(tokenized_inputs.keys())
    labels = []
    for i, label in enumerate(examples["tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
from transformers import AutoTokenizer

# An example of how to use tokenize_and_align:
tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D") 
label_all_tokens=False

tokenized_dataset = ner_dataset.map(tokenize_and_align_labels, batched=True)