<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/Spacy_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spacy meets Transformers

https://explosion.ai/blog/spacy-transformers

Spacy includes a new interface library to connect spaCy with Hugging Face transformers implementation.

This library includes new components in the spacy pipeline:
 - **trf_wordpiecer**: model's wordpiece pre-processing (bert or xlnet, ex. 'encode' : 'en', '##code')
 - **trf_tok2vec**: runs the transformer over the doc, and saves the results into the built-in `doc.tensor` attribute and several extension attributes.

## Installation

In [1]:
! pip install spacy-transformers

Collecting spacy-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/16/fb/5dbcf7391d6ba0003fb922737340bff5033729f9c967f08f0468259c4f6a/spacy-transformers-0.5.1.tar.gz (59kB)
[K     |████████████████████████████████| 61kB 2.9MB/s 
[?25hCollecting spacy<2.3.0,>=2.2.1
[?25l  Downloading https://files.pythonhosted.org/packages/47/13/80ad28ef7a16e2a86d16d73e28588be5f1085afd3e85e4b9b912bd700e8a/spacy-2.2.3-cp36-cp36m-manylinux1_x86_64.whl (10.4MB)
[K     |████████████████████████████████| 10.4MB 13.9MB/s 
[?25hCollecting transformers<2.1.0,>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/66/99/ca0e4c35ccde7d290de3c9c236d5629d1879b04927e5ace9bd6d9183e236/transformers-2.0.0-py3-none-any.whl (290kB)
[K     |████████████████████████████████| 296kB 44.1MB/s 
Collecting torchcontrib<0.1.0,>=0.0.2
  Downloading https://files.pythonhosted.org/packages/72/36/45d475035ab35353911e72a03c1c1210eba63b71e5a6917a9e78a046aa10/torchcontrib-0.0.2.tar.gz
Collecting

In [3]:
! python -m spacy download en_trf_bertbaseuncased_lg
! python -m spacy download en_trf_xlnetbasecased_lg

Collecting en_trf_bertbaseuncased_lg==2.2.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_trf_bertbaseuncased_lg-2.2.0/en_trf_bertbaseuncased_lg-2.2.0.tar.gz (405.8MB)
[K     |████████████████████████████████| 405.8MB 24.2MB/s 
Building wheels for collected packages: en-trf-bertbaseuncased-lg
  Building wheel for en-trf-bertbaseuncased-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-trf-bertbaseuncased-lg: filename=en_trf_bertbaseuncased_lg-2.2.0-cp36-none-any.whl size=405819945 sha256=59f61c92d68f3bbd23d333e6f211cd12a2797c7fce48a607d7715df0e924c7f7
  Stored in directory: /tmp/pip-ephem-wheel-cache-zr0h2sfa/wheels/f6/60/8c/c6f517ef9729972f1be15c3aab4b93e7ec9fbeb71d072a84de
Successfully built en-trf-bertbaseuncased-lg
Installing collected packages: en-trf-bertbaseuncased-lg
Successfully installed en-trf-bertbaseuncased-lg-2.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_trf_bertbaseunc

Restart the environment after the model was downloaded.

## Basic introduction example

In [36]:
import spacy
import torch
import numpy
from numpy.testing import assert_almost_equal

is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    torch.set_default_tensor_type("torch.cuda.FloatTensor")

nlp = spacy.load("en_trf_bertbaseuncased_lg")

text = "Here is some text to encode."
doc = nlp(text)
print(f'Text to analyse: "{text}" \n')

print(f'Tokens in the text ({len(doc)}):')
for token in doc:
  print(f'\t{token.text}')
print()

assert doc.tensor.shape == (7, 768)  # Always has one row per token
print(f'spaCy doc tensor with a row per text token : {doc.tensor.shape} \n')


print('spaCy transformers attributes : ')
print(f'\t - String values of the wordpieces:')  
print(f'\t\t > doc._.trf_word_pieces_ = {doc._.trf_word_pieces_}')  # String values of the wordpieces
print(f'\t - Wordpiece IDs (note: *not* spaCy`s hash values!):')  
print(f'\t\t > doc._.trf_word_pieces = {doc._.trf_word_pieces}')  # Wordpiece IDs (note: *not* spaCy's hash values!)
print(f'\t - Alignment between spaCy tokens and wordpieces:')  
print(f'\t\t > doc._.trf_alignment = {doc._.trf_alignment}')  # Alignment between spaCy tokens and wordpieces
print()

# The raw transformer output has one row per wordpiece.
assert len(doc._.trf_last_hidden_state) == len(doc._.trf_word_pieces)
print(f'doc._.trf_outputs.last_hidden_state - gives you a tensor with one row per wordpiece token. {doc._.trf_last_hidden_state.shape}')
print()

# To avoid losing information, we calculate the doc.tensor attribute such that
# the sum-pooled vectors match (apart from numeric error)
assert_almost_equal(doc.tensor.sum(axis=0), doc._.trf_last_hidden_state.sum(axis=0), decimal=5)
print("The sum-pooled vector from the 'doc.tensor' and the 'trf_last_hidden_state' are practically equals ")
print(f'\t > sum(doc.tensor.sum(axis=0)) = {sum(doc.tensor.sum(axis=0))}')
print(f'\t > sum(doc._.trf_last_hidden_state.sum(axis=0)) = {sum(doc._.trf_last_hidden_state.sum(axis=0))}')
print()

# Access the tensor from Span elements (especially helpful for sentences)
span = doc[2:4]
assert numpy.array_equal(span.tensor, doc.tensor[2:4])
print('Is the same access to a span tensor than the doc tensor limit to the span')
print(f'- Span = doc[2:4] : {span}')
print(f'- span.tensor : {span.tensor}')
print(f'- doc.tensor[2:4] : {doc.tensor[2:4]}')
print()


# .vector and .similarity use the transformer outputs
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
print('WORD SIMILARITY:')
print(apple1[0].similarity(apple2[0]))  # 0.73428553
print(apple1[0].similarity(apple3[0]))  # 0.43365782

Text to analyse: "Here is some text to encode." 

Tokens in the text (7):
	Here
	is
	some
	text
	to
	encode
	.

spaCy doc tensor with a row per text token : (7, 768) 

spaCy transformers attributes : 
	 - String values of the wordpieces:
		 > doc._.trf_word_pieces_ = ['[CLS]', 'here', 'is', 'some', 'text', 'to', 'en', '##code', '.', '[SEP]']
	 - Wordpiece IDs (note: *not* spaCy`s hash values!):
		 > doc._.trf_word_pieces = [101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 1012, 102]
	 - Alignment between spaCy tokens and wordpieces:
		 > doc._.trf_alignment = [[1], [2], [3], [4], [5], [6, 7], [8]]

doc._.trf_outputs.last_hidden_state - gives you a tensor with one row per wordpiece token. (10, 768)

The sum-pooled vector from the 'doc.tensor' and the 'trf_last_hidden_state' are practically equals 
	 > sum(doc.tensor.sum(axis=0)) = -99.45763402432203
	 > sum(doc._.trf_last_hidden_state.sum(axis=0)) = -99.45762564986944

Is the same access to a span tensor than the doc tensor limit to the 

## Transfer learning

For a more advanced example: https://github.com/explosion/spacy-transformers/blob/master/examples/train_textcat.py

You load in a large generic model pretrained on lots of text, and start training on your smaller dataset with labels specific to your problem. 

I use definitions of cat and definitions of Boris Jonhson to train the model (note, I use `cat` because I was a bit conditioned by the `cats` key. But now I think this was a bad decision. Because a external reader can think the key `cats` is related with the text definitions and don't. The key `cats` in the sencond entry of the tuple is related with *categories*)

In [0]:
TRAIN_DATA = [
    # CAT
    ("a small domesticated carnivorous mammal with soft fur, a short snout, and retractable claws. It is widely kept as a pet or for catching mice, and many breeds have been developed.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
    ("a small animal with fur, four legs, a tail, and claws, usually kept as a pet or for catching mice", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
    ("a small, furry animal with four legs and a tail, often kept as a pet, or any of a group of related animals that are wild, and some of which are large and fierce, such as the lion", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
    ("a carnivorous mammal (Felis catus) long domesticated as a pet and for catching rats and mice.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
    ("a small domesticated carnivore, Felis domestica or F. catus, bred in a number of varieties", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
    ("a furry animal that has a long tail and sharp claws. Cats are often kept as pets.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),

    # No-CAT (Alexander Boris de Pfeffel Johnson Hon FRIBA)
    ("is a British politician, writer, and former journalist who has served as Prime Minister of the United Kingdom ", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
    ("is a leading Conservative politician, who was elected leader of the Conservative Party in the summer of 2019, becoming Prime Minister", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
    ("is a British politician, popular historian, and journalist who is Prime Minister of the United Kingdom ", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
    ("Born in New York City on June 19, 1964 to British parents, Johnson spent his first five years in Manhattan while his father was studying economics at Columbia University. Johnson renounced his US citizenship in 2016, likely to avoid the capital gains taxes Uncle Sam levies on expat American citizens. He has English, French, Swiss, Russian and Lithuanian Jewish heritage, and his paternal great-grandfather was a prominent Turkish journalist and politician.", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
    ("Prime Minister of the United Kingdom and leader of the Conservative Party.", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
    ("is to be the U.K.'s next prime minister but the charismatic and controversial figure will already divides the party and British public ", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
    ("is one of Britain's most famous politicians and was a leading figure of the successful Brexit campaign.", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
    ("became Prime Minister on 24 July 2019. He was previously Foreign Secretary from 13 July 2016 to 9 July 2018. He was elected Conservative MP", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}})  
]

The `trf_textcat` component is based on spaCy's built-in TextCategorizer and supports using the features assigned by the transformer models, via the `trf_tok2vec` component. This lets you use a model like BERT to predict contextual token representations, and then learn a text categorizer on top as a task-specific "head". 

In [55]:
import spacy
from spacy.util import minibatch
import random
import torch

is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    torch.set_default_tensor_type("torch.cuda.FloatTensor")

nlp = spacy.load("en_trf_bertbaseuncased_lg")
print(nlp.pipe_names) # ["sentencizer", "trf_wordpiecer", "trf_tok2vec"]

['sentencizer', 'trf_wordpiecer', 'trf_tok2vec']


Include the `trf_textcat` component. This component is developed internally of spaCy. This is a categorizer. For this reason, you should include the label for the categorization.

In [56]:
if "trf_textcat" not in nlp.pipe_names:
  textcat = nlp.create_pipe("trf_textcat", config={"exclusive_classes": True})
  for label in ("POSITIVE", "NEGATIVE"):
      textcat.add_label(label)
  nlp.add_pipe(textcat)
print(nlp.pipe_names) # ["sentencizer", "trf_wordpiecer", "trf_tok2vec"]

['sentencizer', 'trf_wordpiecer', 'trf_tok2vec', 'trf_textcat']


Train the classifier with the training data. For this, spaCy has the `nlp.update` method.

In [57]:
optimizer = nlp.resume_training()
for i in range(20):
    random.shuffle(TRAIN_DATA)
    losses = {}
    for batch in minibatch(TRAIN_DATA, size=8):
        texts, cats = zip(*batch)
        nlp.update(texts, cats, sgd=optimizer, losses=losses)
    print(i, losses)

0 {'trf_textcat': 0.0211661821231246}
1 {'trf_textcat': 0.008601571433246136}
2 {'trf_textcat': 0.0005061183619545773}
3 {'trf_textcat': 2.357704215683043e-05}
4 {'trf_textcat': 1.6146023540386523e-06}
5 {'trf_textcat': 0.006042719430276122}
6 {'trf_textcat': 0.04471092019230127}
7 {'trf_textcat': 0.028651805594563484}
8 {'trf_textcat': 0.050194691866636276}
9 {'trf_textcat': 0.03716572746634483}
10 {'trf_textcat': 0.03427313361316919}
11 {'trf_textcat': 0.023082171566784382}
12 {'trf_textcat': 0.026277894154191017}
13 {'trf_textcat': 0.024077199399471283}
14 {'trf_textcat': 0.02029687538743019}
15 {'trf_textcat': 0.021693839691579342}
16 {'trf_textcat': 0.025801432318985462}
17 {'trf_textcat': 0.021130203269422054}
18 {'trf_textcat': 0.022423338145017624}
19 {'trf_textcat': 0.023525687865912914}


Test the training

In [0]:
EVALUATION_DATA = [
              "a small, furry, carnivorous animal often kept as a pet",
              "a small, lithe, soft-furred animal (Felis cattus) of this family, domesticated since ancient times and often kept as a pet or for killing mice",
              "has been the Prime Minister of the United Kingdom and Leader of the Conservative Party since July 2019. ",
              "is the most popular Conservative politician and the most famous. He is described by fans as: Conservative, Confident, Humorous, ..."
]

In [59]:
for eval_test in EVALUATION_DATA:
  doc = nlp(eval_test)
  print(f' TEXT : {eval_test}')
  print(f' CAT : {doc.cats}')
  print('---')

 TEXT : a small, furry, carnivorous animal often kept as a pet
 CAT : {'POSITIVE': 0.5505148768424988, 'NEGATIVE': 0.44948509335517883}
---
 TEXT : a small, lithe, soft-furred animal (Felis cattus) of this family, domesticated since ancient times and often kept as a pet or for killing mice
 CAT : {'POSITIVE': 0.5505148768424988, 'NEGATIVE': 0.44948509335517883}
---
 TEXT : has been the Prime Minister of the United Kingdom and Leader of the Conservative Party since July 2019. 
 CAT : {'POSITIVE': 0.5505149364471436, 'NEGATIVE': 0.44948509335517883}
---
 TEXT : is the most popular Conservative politician and the most famous. Boris Johnson is described by fans as: Conservative, Confident, Humorous, ...
 CAT : {'POSITIVE': 0.5505148768424988, 'NEGATIVE': 0.44948509335517883}
---


*The result seems be a bit bad! Probably I don't have enought training data.*

Store the model

In [0]:
nlp.to_disk("/bert-textcat")

Use the stored model

In [62]:
nlp_berttextcat = nlp.from_disk("/bert-textcat")
doc = nlp_berttextcat("Alexander Boris is a British politician, writer, and former journalist who has served as Prime Minister of the United Kingdom ")
print(f' CAT : {doc.cats}')

 CAT : {'POSITIVE': 0.5505148768424988, 'NEGATIVE': 0.44948509335517883}
