<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/Spacy_PyTorch_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spacy PyTorch Transformers

Notebook with the code extract in some source of Internet. The original code is an adaptation of the fantastic Paco's work.

https://github.com/DerwenAI/spaCy_tuTorial/blob/master/spaCy_transformers_demo.ipynb 

## Install and import

In [2]:
!pip install spacy
!pip install spacy-pytorch-transformers
!python -m spacy download en_pytt_bertbaseuncased_lg
!python -m spacy download en_pytt_xlnetbasecased_lg
!pip install numpy

Collecting spacy-pytorch-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/46/3271586944ee5e0bd493df03b1ad189eb9ccdad1d2476aeb843b0d2f1b47/spacy_pytorch_transformers-0.4.0-py3-none-any.whl (62kB)
[K     |████████████████████████████████| 71kB 3.0MB/s 
[?25hCollecting dataclasses<0.7,>=0.6; python_version < "3.7"
  Downloading https://files.pythonhosted.org/packages/26/2f/1095cdc2868052dd1e64520f7c0d5c8c550ad297e944e641dbf1ffbb9a5d/dataclasses-0.6-py3-none-any.whl
Collecting ftfy<6.0.0,>=5.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/75/ca/2d9a5030eaf1bcd925dab392762b9709a7ad4bd486a90599d93cd79cb188/ftfy-5.6.tar.gz (58kB)
[K     |████████████████████████████████| 61kB 4.9MB/s 
Collecting pytorch-transformers<1.3.0,>=1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |████████████████████████████████|

In [3]:
!pip install -q torch==1.1.0 torchvision

[K     |████████████████████████████████| 676.9MB 5.4kB/s 
[31mERROR: torchvision 0.4.2+cu100 has requirement torch==1.3.1, but you'll have torch 1.1.0 which is incompatible.[0m
[?25h

Use GPU if is available

In [0]:
import spacy
import torch
from numpy.testing import assert_almost_equal

is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    torch.set_default_tensor_type("torch.cuda.FloatTensor")


Load spacy Bert model

In [0]:
nlp = spacy.load("en_pytt_bertbaseuncased_lg")

Tokenize text

In [0]:
doc = nlp("Here is some text to encode.")
assert doc.tensor.shape == (7, 768)  # Always has one row per token including punctuation

In [4]:
print(doc._.pytt_word_pieces_)  # String values of the wordpieces
print(doc._.pytt_word_pieces)  # Wordpiece IDs (note: *not* spaCy's hash values!)
print(doc._.pytt_alignment)  # Alignment between spaCy tokens and wordpieces

# The raw transformer output has one row per wordpiece.
assert len(doc._.pytt_last_hidden_state) == len(doc._.pytt_word_pieces)

['[CLS]', 'here', 'is', 'some', 'text', 'to', 'en', '##code', '.', '[SEP]']
[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 1012, 102]
[[1], [2], [3], [4], [5], [6, 7], [8]]


In [5]:
# To avoid losing information, we calculate the doc.tensor attribute such that
# the sum-pooled vectors match (apart from numeric error)
tensor_sum = doc.tensor.sum(axis=1)
last_hidden_state_sum = doc._.pytt_last_hidden_state.sum(axis=1)
print(f"tensor_sum = {tensor_sum}, shape = {tensor_sum.shape}")
print(f"last_hidden_state_sum = {last_hidden_state_sum}, shape={last_hidden_state_sum.shape}")

tensor_sum = [-12.914995 -15.33556  -11.52346  -10.473546 -10.937901 -24.572096
 -13.70007 ], shape = (7,)
last_hidden_state_sum = [ -8.03236    -9.845612  -12.266172   -8.454072   -7.4041576  -7.868512
 -11.894537   -9.608173  -10.630686  -13.453345 ], shape=(10,)


In [8]:
span = doc[2:4]
print(f'SPAN = {span.text}')
print(f"span.tensor = {span.tensor}, shape = {span.tensor.shape}")
doc_tensor = doc.tensor[2:4]
print(f"doc.tensor = {doc_tensor}, shape = {doc_tensor.shape}")
# Access the tensor from Span elements (especially helpful for sentences)

SPAN = some text
span.tensor = [[ 0.0759325  -0.4535255   0.24927723 ...  0.28133303 -0.45936924
   0.83061355]
 [ 0.19304082  0.32884982  0.38221556 ... -0.2427446  -0.18937269
   0.50180256]], shape = (2, 768)
doc.tensor = [[ 0.0759325  -0.4535255   0.24927723 ...  0.28133303 -0.45936924
   0.83061355]
 [ 0.19304082  0.32884982  0.38221556 ... -0.2427446  -0.18937269
   0.50180256]], shape = (2, 768)


## Similarity comparation

### Testing with BERT

In [0]:
nlp = spacy.load("en_pytt_bertbaseuncased_lg")

In [0]:
def compare_nlp(nlp1, nlp2):
  print(f"comparing '{nlp1}' vs '{nlp2}': {nlp1.similarity(nlp2)}")

In [0]:
def test_apples():
  # .vector and .similarity use the transformer outputs
  apple1 = nlp("Apple shares rose on the news.")
  print(f"apple1 shape: {apple1[1].vector.shape} ")
  apple2 = nlp("Apple sold fewer iPhones this quarter.")
  apple3 = nlp("Apple pie is delicious.")

  #Compare sentences
  compare_nlp(apple1, apple2)
  compare_nlp(apple1, apple3)
  
  #Compare words by sentence context
  compare_nlp(apple1[0], apple2[0])
  compare_nlp(apple1[0], apple3[0])

In [16]:
test_apples()

apple1 shape: (768,) 
comparing 'Apple shares rose on the news.' vs 'Apple sold fewer iPhones this quarter.': 0.6986121627144829
comparing 'Apple shares rose on the news.' vs 'Apple pie is delicious.': 0.5404962512345809
comparing 'Apple' vs 'Apple': 0.7342854142189026
comparing 'Apple' vs 'Apple': 0.43365713953971863


In [0]:
def test_bank_sentences():
  bank1 = nlp("The banks of the river burst.")
  bank2 = nlp("The banks are closed today.")
  bank3 = nlp("The boys were fishing along the riverbank.")

  #Compare sentences
  compare_nlp(bank1, bank2)
  compare_nlp(bank1, bank3)

  #Compare words by sentence context
  compare_nlp(bank1[1], bank2[1])
  compare_nlp(bank1[1], bank3[6])

In [18]:
test_bank_sentences()

comparing 'The banks of the river burst.' vs 'The banks are closed today.': 0.6167170223363752
comparing 'The banks of the river burst.' vs 'The boys were fishing along the riverbank.': 0.666211196578021
comparing 'banks' vs 'banks': 0.5826777219772339
comparing 'banks' vs 'riverbank': 0.6132599115371704


### Testing with XLNet

In [0]:
def test_model():
  test_apples()
  test_bank_sentences()

In [22]:
# Redo with XLNet
nlp = spacy.load("en_pytt_xlnetbasecased_lg")

test_model()

apple1 shape: (768,) 
comparing 'Apple shares rose on the news.' vs 'Apple sold fewer iPhones this quarter.': 0.991628388620012
comparing 'Apple shares rose on the news.' vs 'Apple pie is delicious.': 0.9804112023332523
comparing 'Apple' vs 'Apple': 0.9853271842002869
comparing 'Apple' vs 'Apple': 0.9792127013206482
comparing 'The banks of the river burst.' vs 'The banks are closed today.': 0.9783249649400636
comparing 'The banks of the river burst.' vs 'The boys were fishing along the riverbank.': 0.9844713596787769
comparing 'banks' vs 'banks': 0.9692249298095703
comparing 'banks' vs 'riverbank': 0.982304036617279
