# Introduction to Data Analytics Coursework -- Text Analytics Data Loader

For this coursework, we recommend that you use your virtual environment that you created for the labs. Alternatively, create a fresh environment following the instructions in the [README.md in the intro-labs-public Github repository](https://github.com/uob-TextAnalytics/intro-labs-public). 

In [1]:
%load_ext autoreload
%autoreload 2

# Use HuggingFace's datasets library to access the financial_phrasebank dataset
from datasets import load_dataset

# Amazon Reviews

This dataset contains Amazon reviews in different languages along with their star ratings from 1 to 5. You can choose which language to work with. The code below passes the argument 'en' to load English reviews.

In [3]:
dataset = load_dataset(
    "amazon_reviews_multi", 
    'en', # Select language of the dataset
    cache_dir='./data_cache'
)

print(f'The dataset is a dictionary with two splits: \n\n{dataset}')

Reusing dataset amazon_reviews_multi (./data_cache/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)


  0%|          | 0/3 [00:00<?, ?it/s]

The dataset is a dictionary with two splits: 

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
})


The dataset already contains a test split, which we can hold out until we have tuned our method(s), and a validation split (also called 'development' set or 'devset'), as well as the training split. 

The validation set can be used to compute performance of your model when tuning hyperparameters,  optimising combinations of features, or looking at the errors your model makes before improving it. This allows you to hold out the test set (i.e., not to look at examples from the test set while developing the model) to give a fair evaluation of the model and how well it generalises to new examples. This avoids tuning the model to specific examples in the test set.

There are several approaches to validation: instead of using the presupplised validation set, you could use [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html). 

The text below loads each of the splits into a set of documents and a set of corresponding review score labels. 

In [4]:
train_documents = dataset["train"]['review_body']
print(f'Example training document: {train_documents[0]}')
train_labels = dataset["train"]['stars']
print(f'Corresponding review score: {train_labels[0]}')
print(f'Number of training instances: {len(train_documents)}')

val_documents = dataset["validation"]['review_body']
print(f'Number of validation instances: {len(val_documents)}')
val_labels = dataset["validation"]['stars']

test_documents = dataset["test"]['review_body']
print(f'Number of test instances: {len(test_documents)}')
test_labels = dataset["test"]['stars']

Example training document: Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no way to insert the casters. I unpackaged the entire chair and hardware before noticing this. So, I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review of part of a chair I never got to sit in. I will go so far as to include a picture of what their injection molding and quality assurance process missed though. I will be hesitant to buy again. It makes me wonder if there aren't missing structures and supports that don't impede the assembly process.
Corresponding review score: 1
Number of training instances: 200000
Number of validation instances: 5000
Number of test instances: 5000


The star ratings are our classes in this task:

In [5]:
import numpy as np 
print(np.unique(train_labels))

[1 2 3 4 5]


The training set is very large, so you may wish to work with a subset of the training data by using the code below:

In [6]:
from sklearn.model_selection import train_test_split

# Split test data from training data
train_documents, unused_documents, train_labels, unused_labels = train_test_split(
    train_documents, 
    train_labels, 
    test_size=0.9, 
    stratify=train_labels  # make sure the same proportion of labels is in the test set and training set
)

In [7]:
# label 0 = negative, 1 = neutral, 2 = positive
print(f'How many instances in the train dataset? \n\n{len(train_documents)}')
print('')
print(f'What does one instance look like? \n\n{train_documents[234]}')

How many instances in the train dataset? 

20000

What does one instance look like? 

Other than the missing oil and a few missing attachments, it's great.


# Bio Creative V

This dataset contains sentences extracted from scientific articles on PubMed. The sentences are annotated with two types of entities, chemicals and diseases. Let's load the data: 

In [8]:
dataset = load_dataset(
    "tner/bc5cdr", 
    cache_dir='./data_cache'
)

print(f'The dataset is a dictionary with {len(dataset)} splits: \n\n{dataset}')

Reusing dataset bc5_cdr (./data_cache/tner___bc5_cdr/bc5cdr/1.0.0/66ea1a8c1cb0adcf8751ab7993f5d31717f21c2a9b89e21506fe959d904a7bf6)


  0%|          | 0/3 [00:00<?, ?it/s]

The dataset is a dictionary with 3 splits: 

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5228
    })
    validation: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5330
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5865
    })
})


The data is also split into train, validation and test. It may be convenient to reformat these splits into lists of tokens and lists of tags:

In [9]:
train_sentences_ner = [item['tokens'] for item in dataset['train']]
train_labels_ner = [[str(tag) for tag in item['tags']] for item in dataset['train']]

val_sentences_ner = [item['tokens'] for item in dataset['validation']]
val_labels_ner = [[str(tag) for tag in item['tags']] for item in dataset['validation']]

test_sentences_ner = [item['tokens'] for item in dataset['test']]
test_labels_ner = [[str(tag) for tag in item['tags']] for item in dataset['test']]

In [10]:
print(f'Number of training instances: {len(train_sentences_ner)}')
print(f'Number of validation instances: {len(val_sentences_ner)}')
print(f'Number of test instances: {len(test_sentences_ner)}')

print(f'Example training sentence (already tokenised): {train_sentences_ner[0]}')
print(f'...corresponding tags for the same example: {train_labels_ner[0]}')

Number of training instances: 5228
Number of validation instances: 5330
Number of test instances: 5865
Example training sentence (already tokenised): ['Naloxone', 'reverses', 'the', 'antihypertensive', 'effect', 'of', 'clonidine', '.']
...corresponding tags for the same example: ['1', '0', '0', '0', '0', '0', '1', '0']


These are the tags used to annotate the entities:

In [11]:
id2label = {
    "O": 0,
    "B-Chemical": 1,
    "B-Disease": 2,
    "I-Disease": 3,
    "I-Chemical": 4
}

label2id = {v:k for k, v in id2label.items()}
print(label2id)

{0: 'O', 1: 'B-Chemical', 2: 'B-Disease', 3: 'I-Disease', 4: 'I-Chemical'}
