In [1]:
import warnings
import numpy as np
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

document = 'Review on the global epidemiological situation and the efficacy of chloroquine ' \
           'and hydroxychloroquine for the treatment of COVID-19. ' \
           'Covid-19 disease is caused by SARS-CoV-2, a virus belonging to the coronavirus family. ' \
           'Covid-19 is so new that there is currently no specific vaccine or treatment. ' \
           'Clinical trials are currently underway. In vitro tests are also being conducted to assess the efficacy of ' \
           'chloroquine and hydroxychloroquine for the treatment of this epidemic, ' \
           'which is considered a pandemic by the WHO. We note that the content of this review is dated. ' \
           'The information it contains is subject to change and modification as the epidemic progresses.'

Carico il modello per la lingua inglese di Spacy

In [2]:
import spacy

english_model = spacy.load('en')

spacy_doc = english_model(document)

I1102 08:53:29.929197 78996 file_utils.py:39] PyTorch version 1.1.0 available.
I1102 08:53:31.307605 78996 modeling_xlnet.py:194] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [3]:
print(list(spacy_doc))

[Review, on, the, global, epidemiological, situation, and, the, efficacy, of, chloroquine, and, hydroxychloroquine, for, the, treatment, of, COVID-19, ., Covid-19, disease, is, caused, by, SARS, -, CoV-2, ,, a, virus, belonging, to, the, coronavirus, family, ., Covid-19, is, so, new, that, there, is, currently, no, specific, vaccine, or, treatment, ., Clinical, trials, are, currently, underway, ., In, vitro, tests, are, also, being, conducted, to, assess, the, efficacy, of, chloroquine, and, hydroxychloroquine, for, the, treatment, of, this, epidemic, ,, which, is, considered, a, pandemic, by, the, WHO, ., We, note, that, the, content, of, this, review, is, dated, ., The, information, it, contains, is, subject, to, change, and, modification, as, the, epidemic, progresses, .]


Assegna a ogni token un codice (Part-of-Speeach Tag) per identificare il suo ruolo nella frase, come nell'analisi grammaticale

In [4]:
for t in spacy_doc:
    print(t.text, t.pos_)

Review NOUN
on ADP
the DET
global ADJ
epidemiological ADJ
situation NOUN
and CCONJ
the DET
efficacy NOUN
of ADP
chloroquine NOUN
and CCONJ
hydroxychloroquine NOUN
for ADP
the DET
treatment NOUN
of ADP
COVID-19 NOUN
. PUNCT
Covid-19 ADJ
disease NOUN
is AUX
caused VERB
by ADP
SARS PROPN
- PUNCT
CoV-2 PROPN
, PUNCT
a DET
virus NOUN
belonging VERB
to ADP
the DET
coronavirus PROPN
family NOUN
. PUNCT
Covid-19 PROPN
is AUX
so ADV
new ADJ
that SCONJ
there PRON
is AUX
currently ADV
no DET
specific ADJ
vaccine NOUN
or CCONJ
treatment NOUN
. PUNCT
Clinical ADJ
trials NOUN
are AUX
currently ADV
underway ADJ
. PUNCT
In ADP
vitro X
tests NOUN
are AUX
also ADV
being AUX
conducted VERB
to PART
assess VERB
the DET
efficacy NOUN
of ADP
chloroquine NOUN
and CCONJ
hydroxychloroquine NOUN
for ADP
the DET
treatment NOUN
of ADP
this DET
epidemic NOUN
, PUNCT
which DET
is AUX
considered VERB
a DET
pandemic NOUN
by ADP
the DET
WHO PROPN
. PUNCT
We PRON
note VERB
that SCONJ
the DET
content NOUN
of ADP
this DET

Ad ogni parola possiamo associare il relativo word embedding. Per cui carico il modello di word embedding...

In [5]:
from gensim.models import Word2Vec

word_model = Word2Vec.load('pub_med_retrained_ddi_word_embedding_200.model')

I1102 08:53:32.507402 78996 textcleaner.py:37] 'pattern' package not found; tag filters are not available for English
I1102 08:53:32.512377 78996 utils.py:418] loading Word2Vec object from pub_med_retrained_ddi_word_embedding_200.model
I1102 08:53:32.581598 78996 utils.py:452] loading wv recursively from pub_med_retrained_ddi_word_embedding_200.model.wv.* with mmap=None
I1102 08:53:32.582594 78996 utils.py:487] setting ignored attribute vectors_norm to None
I1102 08:53:32.583622 78996 utils.py:452] loading vocabulary recursively from pub_med_retrained_ddi_word_embedding_200.model.vocabulary.* with mmap=None
I1102 08:53:32.584592 78996 utils.py:452] loading trainables recursively from pub_med_retrained_ddi_word_embedding_200.model.trainables.* with mmap=None
I1102 08:53:32.584592 78996 utils.py:487] setting ignored attribute cum_table to None
I1102 08:53:32.585588 78996 utils.py:424] loaded pub_med_retrained_ddi_word_embedding_200.model


E creo una matrice di word vectors, cioè l'input della rete ricorrente

In [6]:
word_matrix = np.zeros(shape=(1, 300, 200))
for i in range(len(spacy_doc)):
    t = spacy_doc[i]
    lower_text = t.text.lower()
    if lower_text in word_model.wv.vocab:
        word_matrix[0][i] = word_model.wv.get_vector(lower_text)

Definisco la rete ricorrente

In [7]:
from keras.layers import Input, LSTM, Bidirectional, Dense
from attention_extraction_layers import AttentionWeights, ContextVector
from keras.models import Model

input_layer = Input(shape=(300, 200))
lstm_layer = Bidirectional(LSTM(80, return_sequences=True))(input_layer)
attention_weights = AttentionWeights(300, name='attention_weights')(lstm_layer)
context_vector = ContextVector()([lstm_layer, attention_weights])
dense = Dense(4, activation='softmax')(context_vector)

model = Model(inputs=input_layer, outputs=attention_weights)
model.summary()

Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 300, 200)          0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 300, 160)          179840    
_________________________________________________________________
attention_weights (Attention (None, 160)               460       
Total params: 180,300
Trainable params: 180,300
Non-trainable params: 0
_________________________________________________________________


In teoria dovrei allenarla con migliaia di documenti... ma ci metteremmo ore.
Supponendo che sia stata allenata, estraiamo i pesi dell'attention.
Per fare questo io devo "tagliare il modello dell'ultimo pezzo".

Costruisco quindi un modello "intermedio", con lo stesso input di quello precedente e gli stessi livelli tranne ContextVector e Dense

In [8]:
intermediate_layer_model = Model(inputs=model.input, outputs=[model.get_layer('attention_weights').output])

A questo modello, passo il mio documento di prova:

In [None]:
doc_attention = intermediate_layer_model.predict(word_matrix)

In [None]:
print(doc_attention.shape)