# Learning with few labels

In this notebook we try to see the effect of two things:
1) Semi-supervised learning
2) Label propagation

We work on the DBPedia dataset, which is a text classification task with 14 classes. We formulate this as a multi-label problem, meaning the last classification layer will have 14 dimensions, and the datapoint labels are one-hot encoded.

We split the available training data (560k) to 100 labeled training, 9900 labeled validation, and 550k "assumingly" unlabeled sets. We look at loss and accuracy (selecting the output with the highest logit value).

We try to compare four training scenarios:
1) A classifier which only uses labeled data;
2) A classifier using labeled data on top of a base model trained on unlabeled data using MLM training;
3) A classifier using labeled data, but also benefitting from label propagation on unlabeled data;
4) All together: A classifier using labeled data on top of a base model trained on unlabeled data using MLM training, and also benefitting from label propagation on unlabeled data.


## Importing requirements
You can also add an extra cell to install the needed requirements:
```
!pip install torch
!pip install scikit-learn
!pip install transformers
!pip install datasets
```

In [1]:
import random
import torch
import pathlib
import elasticsearch
import numpy as np
from collections import defaultdict
from datasets import list_datasets, load_dataset, concatenate_datasets
from transformers import AutoModelForSequenceClassification, AutoModelForMaskedLM, AutoTokenizer, AutoConfig, AutoModel
from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments, EvalPrediction
from transformers.modeling_outputs import SequenceClassifierOutput
from tqdm import tqdm
from sklearn.semi_supervised import LabelPropagation

## Loading dataset

In [2]:
dataset = load_dataset('nlu_evaluation_data', split='train')

Using custom data configuration default
Reusing dataset nlu_evaluation_data (/home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282)


In [3]:
# We add the item index as it is useful later
dataset = dataset.map(lambda examples, idx: {'item_number': idx}, with_indices=True)
print(dataset)
print(dataset.features)

Loading cached processed dataset at /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-37939fecc47e5496.arrow


Dataset({
    features: ['item_number', 'label', 'scenario', 'text'],
    num_rows: 25715
})
{'item_number': Value(dtype='int64', id=None), 'label': ClassLabel(num_classes=68, names=['alarm_query', 'alarm_remove', 'alarm_set', 'audio_volume_down', 'audio_volume_mute', 'audio_volume_other', 'audio_volume_up', 'calendar_query', 'calendar_remove', 'calendar_set', 'cooking_query', 'cooking_recipe', 'datetime_convert', 'datetime_query', 'email_addcontact', 'email_query', 'email_querycontact', 'email_sendemail', 'general_affirm', 'general_commandstop', 'general_confirm', 'general_dontcare', 'general_explain', 'general_greet', 'general_joke', 'general_negate', 'general_praise', 'general_quirky', 'general_repeat', 'iot_cleaning', 'iot_coffee', 'iot_hue_lightchange', 'iot_hue_lightdim', 'iot_hue_lightoff', 'iot_hue_lighton', 'iot_hue_lightup', 'iot_wemo_off', 'iot_wemo_on', 'lists_createoradd', 'lists_query', 'lists_remove', 'music_dislikeness', 'music_likeness', 'music_query', 'music_settings'

## Creating a tokenizer
We use BERT base uncased.

In [4]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

## Creating splits and sets

In [5]:
# Creating the splits
train_val_datasets = dataset.train_test_split(test_size=5715, shuffle=True, seed=42)
trainl_trainu_datasets = train_val_datasets['train'].train_test_split(test_size=19000, shuffle=True, seed=42)

trainl_dataset = trainl_trainu_datasets['train']
trainu_dataset = trainl_trainu_datasets['test']
val_dataset = train_val_datasets['test']

Loading cached split indices for dataset at /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-0ded8057637aabee.arrow and /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-47a051049e3398bb.arrow
Loading cached split indices for dataset at /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-a9b8f3898ff3c07b.arrow and /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-d81a8b2ebbbe05fd.arrow


In [6]:
# Tokenization, one-hot encoding, and formatting the get_item behavior
processed_trainl_dataset = trainl_dataset.map(lambda examples: tokenizer(examples['text'], truncation=True, max_length=256, padding='max_length'), batched=True)
processed_trainl_dataset = processed_trainl_dataset.map(lambda examples: {'labels': [1.0 if i == examples['label'] else 0.0 for i in range(dataset.features['label'].num_classes)]}, batched=False)
processed_trainl_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

processed_trainu_dataset = trainu_dataset.map(lambda examples: tokenizer(examples['text'], truncation=True, max_length=256, padding='max_length'), batched=True)
processed_trainu_dataset = processed_trainu_dataset.map(lambda examples: {'labels': [1.0 if i == examples['label'] else 0.0 for i in range(dataset.features['label'].num_classes)]}, batched=False)
processed_trainu_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

processed_val_dataset = val_dataset.map(lambda examples: tokenizer(examples['text'], truncation=True, max_length=256, padding='max_length'), batched=True)
processed_val_dataset = processed_val_dataset.map(lambda examples: {'labels': [1.0 if i == examples['label'] else 0.0 for i in range(dataset.features['label'].num_classes)]}, batched=False)
processed_val_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

Loading cached processed dataset at /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-da5008abbf4a1b94.arrow
Loading cached processed dataset at /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-13de84dab523bc79.arrow
Loading cached processed dataset at /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-032d0a35acf3dbd2.arrow
Loading cached processed dataset at /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-61e5b0ebbafd87aa.arrow
Loading cached processed dataset at /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-28d

## Creating an elastic indexer for later use

In [7]:
# !ES_JAVA_OPTS="-Xms2g -Xmx2g" ./../elasticsearch-7.13.2/bin/elasticsearch

In [8]:
es_client = elasticsearch.Elasticsearch()
es_client.indices.delete(index='nlu_evaluation_data_dataset', ignore=[400, 404])
dataset.add_elasticsearch_index(column='text', es_client=es_client, es_index_name='nlu_evaluation_data_dataset')



HBox(children=(FloatProgress(value=0.0, max=25715.0), HTML(value='')))

Dataset({
    features: ['item_number', 'label', 'scenario', 'text'],
    num_rows: 25715
})

## Creating the MLM trained model
We use the "unlabeled" data to train a model using Masked Language Modeling (MLM). We will use this base model later in some experiments. The training take quite a few hours of GPU.

In [None]:
mlm_model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')

mlm_data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

mlm_training_args = TrainingArguments(
    output_dir="./models/nlu_evaluation_data/mlm",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=8,
    save_strategy='epoch',
    save_total_limit=2,
    evaluation_strategy='epoch'
)

mlm_trainer = Trainer(
    model=mlm_model,
    args=mlm_training_args,
    data_collator=mlm_data_collator,
    train_dataset=processed_trainu_dataset,
    eval_dataset=processed_val_dataset,
)

mlm_trainer.train()

## Training a classifier only on labeled data

In [None]:
classifier_config = AutoConfig.from_pretrained('bert-base-uncased', num_labels=dataset.features['label'].num_classes)
classifier_model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', config=classifier_config)
    
def compute_metrics(p: EvalPrediction):
    accuracy = np.mean(np.argmax(p.predictions, axis=1) == np.argmax(p.label_ids, axis=1))
    return {'accuracy': accuracy}
    
classifier_training_args = TrainingArguments(
    output_dir='./models/nlu_evaluation_data/classifier',
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=64,
    save_strategy='epoch',
    save_total_limit=2,
    evaluation_strategy='epoch'
)

classifier_trainer = Trainer(
    model=classifier_model,
    args=classifier_training_args,
    train_dataset=processed_trainl_dataset,
    eval_dataset=processed_val_dataset,
    compute_metrics=compute_metrics,
)

classifier_trainer.train()

## Training a classifier on labeled data on top of MLM

In [None]:
last_mlm_checkpoint = str(sorted(list(pathlib.Path('.').glob('./models/nlu_evaluation_data/mlm/')))[-1])
print(f'Uisng MLM checkpoint {last_mlm_checkpoint}')

classifier_mlm_config = AutoConfig.from_pretrained('bert-base-uncased', num_labels=dataset.features['label'].num_classes)
classifier_mlm_model = AutoModelForSequenceClassification.from_pretrained(last_mlm_checkpoint, config=classifier_mlm_config)

def compute_metrics(p: EvalPrediction):
    accuracy = np.mean(np.argmax(p.predictions, axis=1) == np.argmax(p.label_ids, axis=1))
    return {'accuracy': accuracy}
    
classifier_mlm_training_args = TrainingArguments(
    output_dir='./models/nlu_evaluation_data/classifier_on_mlm',
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=64,
    save_strategy='epoch',
    save_total_limit=2,
    evaluation_strategy='epoch'
)

classifier_mlm_trainer = Trainer(
    model=classifier_mlm_model,
    args=classifier_mlm_training_args,
    train_dataset=processed_trainl_dataset,
    eval_dataset=processed_val_dataset,
    compute_metrics=compute_metrics,
)

classifier_mlm_trainer.train()

## Graph Agreement Model (GAM) -based training

### Pairing function
We use the below function to create positive or negative pair for each datapoint. This pair will be used in training the agreement model.

In [9]:
def pair_up(batch, match):
    """
    This function creates positive or negative pair for a given labeled batch
    """
    result = {
        'text': [],
        'text_other': [],
        'match': [],
    }
    label_to_indices = defaultdict(list)
    for index, label in enumerate(batch['label']):
        label_to_indices[label].append(index)
    for label, text in zip(batch['label'], batch['text']):
        if match == 'positive':
            random_positive_index = random.choice(label_to_indices[label])
            text_to_append = batch['text'][random_positive_index]
            match_to_append = 1
        elif match == 'negative':
            labels_wihtout_this = [l for l in label_to_indices.keys() if label_to_indices[l] and l != label]
            # labels_wihtout_this.remove(label)
            random_negative_label = random.choice(labels_wihtout_this)
            random_negative_index = random.choice(label_to_indices[random_negative_label])
            text_to_append = batch['text'][random_negative_index]
            match_to_append = 0
        result['match'].append(match_to_append)
        if random.choice([0, 1]):
            # swap the texts 
            result['text'].append(text_to_append)
            result['text_other'].append(text)
        else:
            result['text'].append(text)
            result['text_other'].append(text_to_append)
    return result

### Create agreement training data

In [10]:
paired_dataset_positive = trainl_dataset.map(lambda examples: pair_up(examples, match='positive'), batched=True)
paired_dataset_negative = trainl_dataset.map(lambda examples: pair_up(examples, match='negative'), batched=True)
paired_dataset = concatenate_datasets([paired_dataset_positive, paired_dataset_negative])

paired_dataset = paired_dataset.map(lambda examples: tokenizer(examples['text'], examples['text_other'], truncation=True, max_length=256, padding='max_length'), batched=True)
paired_dataset = paired_dataset.map(lambda examples: {'labels': examples['match']}, batched=True)
paired_train_val_datasets = paired_dataset.train_test_split(test_size=0.2, shuffle=True, seed=42)

paired_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

Loading cached processed dataset at /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-af9ff12712eaa119.arrow
Loading cached processed dataset at /home/beast/.cache/huggingface/datasets/nlu_evaluation_data/default/1.1.0/0416a5876d8240bd571f2bc2ad421cf6e6e88d938f8dcb5fd87b5af6033d6282/cache-8d4d2721debe8828.arrow


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))





HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




### Train the agreement model

In [None]:
agreement_model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
    
def compute_metrics(p: EvalPrediction):
    accuracy = np.mean(np.argmax(p.predictions, axis=1) == p.label_ids)
    return {'accuracy': accuracy}
    
agreement_training_args = TrainingArguments(
    output_dir='./models/nlu_evaluation_data/agreement',
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    save_strategy='epoch',
    save_total_limit=2,
    evaluation_strategy='epoch'
)

agreement_trainer = Trainer(
    model=agreement_model,
    args=agreement_training_args,
    train_dataset=paired_train_val_datasets['train'],
    eval_dataset=paired_train_val_datasets['test'],
    compute_metrics=compute_metrics,
)

agreement_trainer.train()

### Fish for new confident labels

In [None]:
agreement_model = AutoModelForSequenceClassification.from_pretrained('./models/nlu_evaluation_data/agreement/checkpoint-500')
device = 'cpu'
agreement_model = agreement_model.to(device)
k = 10
i = 0
candidates = defaultdict(list)
for datapoint_l in tqdm(trainl_dataset):
    i += 1
    if i > 10:
        break
    text_l = datapoint_l['text']
    scores, retrieved_examples = dataset.get_nearest_examples(index_name='text', query=text_l, k=k)
    tokenized_pairs = tokenizer([text_l] * len(retrieved_examples['text']), retrieved_examples['text'], truncation=True, max_length=256, padding='max_length')
    batch = {k: torch.tensor(v).to(device) for k, v in tokenized_pairs.items()}
    output = agreement_model(**batch)
#     print(retrieved_examples)
#     print(output['logits'].detach().numpy())
    candidates[datapoint_l['label']].append((retrieved_examples, output['logits'].detach().numpy()))
#     print(a)
#     print(datapoint_l['label'])
#     print(retrieved_examples['label'])

 80%|███████▉  | 798/1000 [15:51<04:02,  1.20s/it]

In [None]:
bool(random.choice([0, 1]))