<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/Spacy_Transformers_Train_Text_Categorizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spacy Transformers - Train Text Categorizer

spaCy includes the Transformer models (bert, xlnet). One of the new components of this new spaCy's code is the `trf_textcat`. With this component you can train a NLP model for a specific classifier task.

This notebook is a simplification of the official spaCy notebook example: https://github.com/explosion/spacy-transformers/blob/master/examples/train_textcat.py

Here, we train a spaCy Bert model to identify if a film review is positive or negative. We use the IMDB dataset.

## Install spacy transformers

In [0]:
! pip install spacy-transformers

In [0]:
! python -m spacy download en_trf_bertbaseuncased_lg
! python -m spacy download en_trf_xlnetbasecased_lg

Restart the environment after the model was downloaded.

## Train textcat

Imports

In [1]:
#!/usr/bin/env python
import plac
import re
import random
import json
from pathlib import Path
from collections import Counter
import thinc.extra.datasets
import spacy
import torch
from spacy.util import minibatch
import tqdm
import unicodedata
import wasabi
from spacy_transformers.util import cyclic_triangular_rate

Definition of main variables

In [0]:
model = "en_trf_bertbaseuncased_lg" # or "en_trf_xlnetbasecased_lg"
max_wpb=1000
n_texts=100
learn_rate=2e-5
batch_size=8

Use GPU if is possible

In [0]:
spacy.util.fix_random_seed(0)
is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    torch.set_default_tensor_type("torch.cuda.FloatTensor")

**Load data**: we use an example with the IMDB dataset. In the original notebook exist an example to load dataset from local filesystem.

In [29]:
# load the IMDB dataset
print("Loading IMDB data...")

white_re = re.compile(r"\s\s+")
def preprocess_text(text):
    text = text.replace("<s>", "<open-s-tag>")
    text = text.replace("</s>", "<close-s-tag>")
    text = white_re.sub(" ", text).strip()
    return "".join(
        c for c in unicodedata.normalize("NFD", text) if unicodedata.category(c) != "Mn"
    )

def _prepare_partition(text_label_tuples, *, preprocess=False):
    texts, labels = zip(*text_label_tuples)
    if preprocess:
        # Preprocessing can mask errors in our handling of noisy text, so
        # we don't want to do it by default
        texts = [preprocess_text(text) for text in texts]
    cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
    return texts, cats

def load_data(*, limit=0, dev_size=2000):
    """Load data from the IMDB dataset, splitting off a held-out set."""
    if limit != 0:
        limit += dev_size
    assert dev_size != 0
    train_data, _ = thinc.extra.datasets.imdb(limit=limit)
    assert len(train_data) > dev_size
    random.shuffle(train_data)
    dev_data = train_data[:dev_size]
    train_data = train_data[dev_size:]
    train_texts, train_labels = _prepare_partition(train_data, preprocess=False)
    dev_texts, dev_labels = _prepare_partition(dev_data, preprocess=False)
    return (train_texts, train_labels), (dev_texts, dev_labels)

(train_texts, train_cats), (eval_texts, eval_cats) = load_data(limit=n_texts)
print(f"Using {len(train_texts)} training docs, {len(eval_texts)} evaluation")

Loading IMDB data...
Using 100 training docs, 2000 evaluation


In [34]:
print(f'Category : {train_cats[0]}')
print(f' - {train_texts[0][0:100]}...')

Category : {'POSITIVE': False, 'NEGATIVE': True}
 - To make a good movie you either need excellent actors or an excellent director. You need at least on...


Adapt the input data to the training model format

In [36]:
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
train_data[0]

("To make a good movie you either need excellent actors or an excellent director. You need at least one of the two. In this Eye of the Needle we have none.\n\n\n\nI don't even remember the name of the director. He mustn't have done much in his career. I like very much Donald Sutherland but he absolutely cannot be the main actor in a movie. He falls short. Sutherland is excellent in a movie when he appears for not more than 15 minutes. I would say for instance that Sutherland was excellent in JFK of Oliver Stone when he talked to Kevin Costner on the bench of a park for 10 minutes non-stop without even taking a breath. Wonderful. But Sutherland being the principal actor in a movie is no good.\n\n\n\nKate Nelligan? She is probably good for TV series. The DVD is awful. Terrible colors. Terrible light. I couldn't even appreciate the scenery of Storm Island for how lousy the photography was.\n\n\n\nThis Ken Follett story was good but it's a pity they turned it into an uninteresting movie.",

**Load spaCy model**

In [17]:
nlp = spacy.load(model)
print(f'The initial pipeline components are:  {nlp.pipe_names}')

['sentencizer', 'trf_wordpiecer', 'trf_tok2vec']


**Define trf_textcat**

We are going to add the component `trf_textcat` created by spaCy to `transfer learning` of one pre-trained model to a specific task.

In [0]:
textcat = nlp.create_pipe(
        "trf_textcat",
        config={"architecture": "softmax_last_hidden", "words_per_batch": max_wpb},
    )

In [27]:
# add label to text classifier
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")
pos_label = "POSITIVE"
print("Labels:", textcat.labels)
print("Positive label for evaluation:", pos_label)

Labels: ('POSITIVE', 'NEGATIVE')
Positive label for evaluation: POSITIVE


In [28]:
nlp.add_pipe(textcat, last=True)
print(f'The new pipeline components are:  {nlp.pipe_names}')

['sentencizer', 'trf_wordpiecer', 'trf_tok2vec', 'trf_textcat']


**Train model**

Initialize configuration

In [41]:
# Initialize the TextCategorizer, and create an optimizer.
optimizer = nlp.resume_training()
optimizer.alpha = 0.001
optimizer.trf_weight_decay = 0.005
optimizer.L2 = 0.0
learn_rates = cyclic_triangular_rate(
    learn_rate / 3, learn_rate * 3, 2 * len(train_data) // batch_size
)

Training the model...
LOSS 	  P  	  R  	  F  


Run training

In [0]:
def evaluate(nlp, texts, cats, pos_label):
    tp = 0.0  # True positives
    fp = 0.0  # False positives
    fn = 0.0  # False negatives
    tn = 0.0  # True negatives
    total_words = sum(len(text.split()) for text in texts)
    with tqdm.tqdm(total=total_words, leave=False) as pbar:
        for i, doc in enumerate(nlp.pipe(texts, batch_size=8)):
            gold = cats[i]
            for label, score in doc.cats.items():
                if label not in gold:
                    continue
                if label != pos_label:
                    continue
                if score >= 0.5 and gold[label] >= 0.5:
                    tp += 1.0
                elif score >= 0.5 and gold[label] < 0.5:
                    fp += 1.0
                elif score < 0.5 and gold[label] < 0.5:
                    tn += 1
                elif score < 0.5 and gold[label] >= 0.5:
                    fn += 1
            pbar.update(len(doc.text.split()))
    precision = tp / (tp + fp + 1e-8)
    recall = tp / (tp + fn + 1e-8)
    if (precision + recall) == 0:
        f_score = 0.0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}

In [0]:
print("Training the model...")
print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))

pbar = tqdm.tqdm(total=100, leave=False)
results = []
epoch = 0
step = 0
eval_every = 100
patience = 3
while True:
    # Train and evaluate
    losses = Counter()
    random.shuffle(train_data)
    batches = minibatch(train_data, size=batch_size)
    for batch in batches:
        optimizer.trf_lr = next(learn_rates)
        texts, annotations = zip(*batch)
        nlp.update(texts, annotations, sgd=optimizer, drop=0.1, losses=losses)
        pbar.update(1)
        if step and (step % eval_every) == 0:
            pbar.close()
            with nlp.use_params(optimizer.averages):
                scores = evaluate(nlp, eval_texts, eval_cats, pos_label)
            results.append((scores["textcat_f"], step, epoch))
            print(
                "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format(
                    losses["trf_textcat"],
                    scores["textcat_p"],
                    scores["textcat_r"],
                    scores["textcat_f"],
                )
            )
            pbar = tqdm.tqdm(total=eval_every, leave=False)
        step += 1
    epoch += 1
    # Stop if no improvement in HP.patience checkpoints
    if results:
        best_score, best_step, best_epoch = max(results)
        if ((step - best_step) // eval_every) >= patience:
            break

  0%|          | 0/100 [00:00<?, ?it/s]

Training the model...
LOSS 	  P  	  R  	  F  


 53%|█████▎    | 53/100 [37:53<28:55, 36.93s/it]

In [0]:
msg = wasabi.Printer()
table_widths = [2, 4, 6]
msg.info(f"Best scoring checkpoints")
msg.row(["Epoch", "Step", "Score"], widths=table_widths)
msg.row(["-" * width for width in table_widths])
for score, step, epoch in sorted(results, reverse=True)[:10]:
    msg.row([epoch, step, "%.2f" % (score * 100)], widths=table_widths)

**Testing the model**

In [0]:
 # Test the trained model
test_text = eval_texts[0]
doc = nlp(test_text)
print(test_text, doc.cats)
