In [1]:
with open('Patient_data.txt', 'r') as f:
    patient = f.readlines()

In [5]:
import en_coref_md
nlp = en_coref_md.load()

In [6]:
for i in range(len(patient)):
    print("Paragraph number...", i)
    doc = nlp(patient[i])
    print(doc._.has_coref)
    print(doc._.coref_clusters)

Paragraph number... 0
True
[Crestor: [Crestor, Crestor, Crestor, brand Crestor, Crestor, Crestor, Crestor, Crestor], Patient: [Patient, Patient, Patient, she, she, She, her, Her, She, her, her], Her HCP: [Her HCP, her HCP, HCP], No further information: [No further information, This information], Patient: [Patient, she, she], she is fine with the brand Crestor: [she is fine with the brand Crestor, it]]
Paragraph number... 1
True
[Nexium Rx: [Nexium Rx, the Nexium Rx], This medication: [This medication, the medication, the medication], The caller: [The caller, she, She, she, She, She, she, She, She]]
Paragraph number... 2
True
[a consumer who was no longer on the line: [a consumer who was no longer on the line, The consumer, The consumer, The consumer, The consumer], start date: [start date, her, She, her, her, her], her acid reflux: [her acid reflux, her acid reflux]]
Paragraph number... 3
True
[Patient: [Patient, she, She, her, Patient], Nexium: [Nexium, Nexium]]
Paragraph number... 4


## Spacy modules testing

### Linguistics Annotations

spaCy provides a variety of linguistic annotations to give you insights into a text's grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you're analysing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether "google" is used as a verb, or refers to the website or company in a specific context.

In [8]:
for i in range(len(patient)):
    print("Paragraph number...", i)
    doc = nlp(patient[i])
    for token in doc:
        print(token.text, token.pos_, token.dep_)

Paragraph number... 0
Call NOUN nsubj
came VERB ROOT
directly ADV advmod
to ADP prep
the DET det
IC PROPN pobj
for ADP prep
medical ADJ amod
inquiry NOUN pobj
and CCONJ cc
adverse ADJ amod
event NOUN conj
noted VERB advcl
for ADP prep
Crestor PROPN pobj
. PUNCT punct
Patient NOUN nsubj
started VERB ROOT
Crestor PROPN dobj
10 NUM nummod
mg NOUN npadvmod
daily ADV advmod
by ADP prep
mouth NOUN pobj
in ADP prep
01 NUM nummod
- SYM punct
2008 NUM pobj
for ADP prep
high ADJ amod
cholesterol NOUN pobj
. PUNCT punct
Patient NOUN nsubj
also ADV advmod
taking VERB csubj
Toprol PROPN compound
XL PROPN dobj
50 NUM nummod
mg NOUN dobj
daily ADV advmod
by ADP prep
mouth NOUN pobj
started VERB ROOT
in ADP prep
11 NUM pobj
- SYM punct
2000 NUM prep
for ADP prep
high ADJ amod
blood NOUN compound
pressure NOUN pobj
. PUNCT punct
Patient NOUN nsubj
reported VERB ROOT
the DET det
following NOUN dobj
that ADP mark
she PRON nsubj
has VERB aux
skipped VERB ccomp
doses NOUN dobj
of ADP prep
Crestor PROPN pob

Email NOUN compound
correspondence NOUN ROOT
received VERB acl
from ADP prep
XX PROPN pobj
and CCONJ cc
Me PRON conj
. PUNCT punct
Adverse ADJ amod
Event NOUN nsubjpass
was VERB auxpass
received VERB ccomp
from ADP prep
XX PROPN pobj
and CCONJ cc
Me PRON conj
by ADP agent
email NOUN pobj
, PUNCT punct
as ADP mark
they PRON nsubj
were VERB advcl
unable ADJ acomp
to PART aux
contact VERB xcomp
the DET det
IC PROPN dobj
by ADP prep
phone NOUN pobj
, PUNCT punct
no DET det
attachments NOUN nsubjpass
are VERB auxpass
associated VERB ROOT
with ADP prep
this DET det
case NOUN pobj
. PUNCT punct
Subject NOUN ROOT
This DET nsubj
is VERB ROOT
a DET det
possible ADJ amod
AE NOUN attr
, PUNCT punct
PQC PROPN conj
, PUNCT punct
or CCONJ cc
Medical PROPN nmod
/ SYM punct
Product PROPN compound
Inquiry PROPN conj
from ADP prep
XXandMe PROPN compound
inVentiv PROPN pobj
. PUNCT punct
Please INTJ intj
treat VERB ROOT
this DET dobj
as ADP prep
an DET det
urgent ADJ amod
case NOUN pobj
. PUNCT punct
AE P

### Tokenization

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

In [9]:
for i in range(len(patient)):
    print("Paragraph number...", i)
    doc = nlp(patient[i])
    for token in doc:
        print(token.text)

Paragraph number... 0
Call
came
directly
to
the
IC
for
medical
inquiry
and
adverse
event
noted
for
Crestor
.
Patient
started
Crestor
10
mg
daily
by
mouth
in
01
-
2008
for
high
cholesterol
.
Patient
also
taking
Toprol
XL
50
mg
daily
by
mouth
started
in
11
-
2000
for
high
blood
pressure
.
Patient
reported
the
following
that
she
has
skipped
doses
of
Crestor
over
the
years
since
she
started
taking
it
and
the
last
time
was
in
12
-
2017
.
She
would
just
take
her
usual
dose
the
next
day
and
not
try
to
make
up
the
missed
dose
.
Her
HCP
is
aware
,
no
treatment
offered
.
Had
high
cholesterol
since
about
2003
and
currently
taking
brand
Crestor
.
She
is
nervous
now
about
taking
new
medications
,
unknown
start
date
and
if
her
HCP
is
aware
.
Patient
to
start
generic
Crestor
when
finish
her
brand
Crestor
.
No
further
information
was
provided
.
Follow
up
received
on
03
-
01
-
2018
.
Updated
on
03
-
01
-
2018
.
The
following
should
have
been
included
in
the
above
narrative
.
Patient
stated
that
she
is


### Part of speech tags and dependencies

After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.

Linguistic annotations are available as Token attributes . Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:



In [10]:
for i in range(len(patient)):
    print("Paragraph number...", i)
    doc = nlp(patient[i])
    for token in doc:
        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Paragraph number... 0
Call call NOUN NN nsubj Xxxx True False
came come VERB VBD ROOT xxxx True False
directly directly ADV RB advmod xxxx True False
to to ADP IN prep xx True False
the the DET DT det xxx True False
IC ic PROPN NNP pobj XX True False
for for ADP IN prep xxx True False
medical medical ADJ JJ amod xxxx True False
inquiry inquiry NOUN NN pobj xxxx True False
and and CCONJ CC cc xxx True False
adverse adverse ADJ JJ amod xxxx True False
event event NOUN NN conj xxxx True False
noted note VERB VBN advcl xxxx True False
for for ADP IN prep xxx True False
Crestor crestor PROPN NNP pobj Xxxxx True False
. . PUNCT . punct . False False
Patient patient NOUN NN nsubj Xxxxx True False
started start VERB VBD ROOT xxxx True False
Crestor crestor PROPN NNP dobj Xxxxx True False
10 10 NUM CD nummod dd False False
mg mg NOUN NN npadvmod xx True False
daily daily ADV RB advmod xxxx True False
by by ADP IN prep xx True False
mouth mouth NOUN NN pobj xxxx True False
in in ADP IN prep xx T

Call call VERB VB ROOT Xxxx True False
direct direct ADJ JJ oprd xxxx True False
to to ADP IN prep xx True False
IC ic PROPN NNP pobj XX True False
with with ADP IN prep xxxx True False
XX xx PROPN NNP pobj XX True False
and and CCONJ CC cc xxx True False
Me -PRON- PRON PRP conj Xx True False
related related ADJ JJ amod xxxx True False
issue issue NOUN NN compound xxxx True False
removal removal NOUN NN dobj xxxx True False
of of ADP IN prep xx True False
product product NOUN NN compound xxxx True False
letter letter NOUN NN pobj xxxx True False
. . PUNCT . punct . False False
Patient patient NOUN NN nsubjpass Xxxxx True False
has have VERB VBZ aux xxx True False
been be VERB VBN auxpass xxxx True False
prescribed prescribe VERB VBN ROOT xxxx True False
Nexium nexium PROPN NNP compound Xxxxx True False
Rx rx PROPN NNP compound Xx True False
capsule capsule NOUN NN dobj xxxx True False
. . PUNCT . punct . False False
Long long ADJ JJ amod Xxxx True False
term term NOUN NN nsubj xxxx Tru

Email email NOUN NN compound Xxxxx True False
correspondence correspondence NOUN NN ROOT xxxx True False
received receive VERB VBD acl xxxx True False
from from ADP IN prep xxxx True False
XX xx PROPN NNP pobj XX True False
and and CCONJ CC cc xxx True False
Me -PRON- PRON PRP conj Xx True False
. . PUNCT . punct . False False
Adverse adverse ADJ JJ amod Xxxxx True False
Event event NOUN NN nsubjpass Xxxxx True False
was be VERB VBD auxpass xxx True False
received receive VERB VBN ccomp xxxx True False
from from ADP IN prep xxxx True False
XX xx PROPN NNP pobj XX True False
and and CCONJ CC cc xxx True False
Me -PRON- PRON PRP conj Xx True False
by by ADP IN agent xx True False
email email NOUN NN pobj xxxx True False
, , PUNCT , punct , False False
as as ADP IN mark xx True False
they -PRON- PRON PRP nsubj xxxx True False
were be VERB VBD advcl xxxx True False
unable unable ADJ JJ acomp xxxx True False
to to PART TO aux xx True False
contact contact VERB VB xcomp xxxx True False
the t

have have VERB VB aux xxxx True False
been be VERB VBN ccomp xxxx True False
due due ADJ JJ acomp xxx True False
to to ADP IN pcomp xx True False
the the DET DT det xxx True False
fillers filler NOUN NNS pobj xxxx True False
. . PUNCT . punct . False False
PS ps PROPN NNP nsubj XX True False
granted grant VERB VBD ROOT xxxx True False
permission permission NOUN NN dobj xxxx True False
to to PART TO aux xx True False
contact contact VERB VB advcl xxxx True False
patient patient ADJ JJ dobj xxxx True False
. . PUNCT . punct . False False
Her -PRON- ADJ PRP$ poss Xxx True False
address address NOUN NN nsubj xxxx True False
is be VERB VBZ ccomp xx True False
as as ADP IN mark xx True False
stated state VERB VBN advcl xxxx True False
above above ADV RB advmod xxxx True False
, , PUNCT , punct , False False
her -PRON- ADJ PRP$ poss xxx True False
email email NOUN NN nsubj xxxx True False
is be VERB VBZ ROOT xx True False
xxx@yahoo.com xxx@yahoo.com X ADD attr xxx@xxxx.xxx False False
, , PUN

### Named Entities

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

In [11]:
for i in range(len(patient)):
    print("Paragraph number...", i)
    doc = nlp(patient[i])
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)

Paragraph number... 0
IC 26 28 ORG
Crestor 77 84 ORG
Patient 86 93 ORG
Crestor 102 109 ORG
10 mg 110 115 CARDINAL
daily 116 121 DATE
01-2008 134 141 DATE
Patient 164 171 ORG
Toprol XL 184 193 ORG
daily 200 205 DATE
11-2000 226 233 DATE
Crestor 320 327 ORG
the years 333 342 DATE
12-2017 396 403 DATE
the next day 440 452 DATE
HCP 497 500 ORG
about 2003 560 570 DATE
Crestor 598 605 PRODUCT
HCP 686 689 ORG
Crestor 725 732 PRODUCT
Crestor 755 762 PRODUCT
03-01-2018 823 833 DATE
03-01-2018 846 856 DATE
Patient 922 929 ORG
Crestor 969 976 PRODUCT
Paragraph number... 1
XX 23 25 PERSON
UNK 183 186 ORG
HCP 794 797 ORG
SJ 799 801 ORG
Paragraph number... 2
XX 0 2 PERSON
Nexium 124 130 ORG
couple days 249 260 DATE
today 472 477 DATE
Paragraph number... 3
Nexium 103 109 ORG
Patient 161 168 ORG
Paragraph number... 4
XX 35 37 PERSON
XX 78 80 PERSON
AE 219 221 ORG
PQC 223 226 ORG
Medical/Product Inquiry 231 254 ORG
XXandMe inVentiv 260 276 PERSON
AE 315 317 ORG
21-09-2015 339 349 DATE
PERD 381 385 ORG


### Word vectors and similarity measures

In [13]:
tokens = nlp(u'dog cat banana')
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.80168545
dog banana 0.24327643
cat dog 0.80168545
cat cat 1.0
cat banana 0.28154364
banana dog 0.24327643
banana cat 0.28154364
banana banana 1.0


n this case, the model's predictions are pretty on point. A dog is very similar to a cat, whereas a banana is not very similar to either of them. Identical tokens are obviously 100% similar to each other (just not always exactly 1.0, because of vector math and floating point imprecisions).

Similarity is determined by comparing word vectors or "word embeddings", multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec and usually look like this:

### Vocab, Hashes and Lexemes

Whenever possible, spaCy tries to store data in a vocabulary, the Vocab , that will be shared by multiple documents. To save memory, spaCy also encodes all strings to hash values – in this case for example, "coffee" has the hash 3197928453018144401. Entity labels like "ORG" and part-of-speech tags like "VERB" are also encoded. Internally, spaCy only "speaks" in hash values.

If you process lots of documents containing the word "coffee" in all kinds of different contexts, storing the exact string "coffee" every time would take up way too much space. So instead, spaCy hashes the string and stores it in the StringStore . You can think of the StringStore as a lookup table that works in both directions – you can look up a string to get its hash, or a hash to get its string:



In [14]:
doc = nlp(u'I love coffee')
print(doc.vocab.strings[u'coffee'])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'

3197928453018144401
coffee


Now that all strings are encoded, the entries in the vocabulary don't need to include the word text themselves. Instead, they can look it up in the StringStore via its hash value. Each entry in the vocabulary, also called Lexeme , contains the context-independent information about a word. For example, no matter if "love" is used as a verb or a noun in some context, its spelling and whether it consists of alphabetic characters won't ever change. Its hash value will also always be the same.

In [15]:
for i in range(len(patient)):
    print("Paragraph number...", i)
    doc = nlp(patient[i])
    for word in doc:
        lexeme = doc.vocab[word.text]
        print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
          lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

Paragraph number... 0
Call 11104957948539619373 Xxxx C all True False True en
came 15989513710403574752 xxxx c ame True False False en
directly 9431896469908734155 xxxx d tly True False False en
to 3791531372978436496 xx t to True False False en
the 7425985699627899538 xxx t the True False False en
IC 9397940574446756610 XX I IC True False False en
for 16037325823156266367 xxx f for True False False en
medical 10039723232229819975 xxxx m cal True False False en
inquiry 14225684692882789161 xxxx i iry True False False en
and 2283656566040971221 xxx a and True False False en
adverse 10274160298693536259 xxxx a rse True False False en
event 16065740214838660377 xxxx e ent True False False en
noted 6777982591128005566 xxxx n ted True False False en
for 16037325823156266367 xxx f for True False False en
Crestor 3637941411495505866 Xxxxx C tor True False True en
. 12646065887601541794 . . . False False False en
Patient 9416364957002412138 Xxxxx P ent True False True en
started 17976686883172

Call 11104957948539619373 Xxxx C all True False True en
direct 3396428638438620601 xxxx d ect True False False en
to 3791531372978436496 xx t to True False False en
IC 9397940574446756610 XX I IC True False False en
with 12510949447758279278 xxxx w ith True False False en
XX 14968597813295776895 XX X XX True False False en
and 2283656566040971221 xxx a and True False False en
Me 17688641098104964042 Xx M Me True False True en
related 9648789318421011716 xxxx r ted True False False en
issue 4747730822158419200 xxxx i sue True False False en
removal 10259646960330252655 xxxx r val True False False en
of 886050111519832510 xx o of True False False en
product 2104994216896503478 xxxx p uct True False False en
letter 720313458719916916 xxxx l ter True False False en
. 12646065887601541794 . . . False False False en
Patient 9416364957002412138 Xxxxx P ent True False True en
has 1248239241591158246 xxx h has True False False en
been 12517269084653561647 xxxx b een True False False en
prescrib

XX 14968597813295776895 XX X XX True False False en
and 2283656566040971221 xxx a and True False False en
Me 17688641098104964042 Xx M Me True False True en
agent 398 xxxx a ent True False False en
called 3125235652374451650 xxxx c led True False False en
to 3791531372978436496 xx t to True False False en
report 2729752284408055516 xxxx r ort True False False en
an 15099054000809333061 xx a an True False False en
adverse 10274160298693536259 xxxx a rse True False False en
event 16065740214838660377 xxxx e ent True False False en
on 5640369432778651323 xx o on True False False en
behalf 15927626727734925895 xxxx b alf True False False en
of 886050111519832510 xx o of True False False en
a 11901859001352538922 x a a True False False en
consumer 822906622843328326 xxxx c mer True False False en
who 3876862883474502309 xxx w who True False False en
was 9921686513378912864 xxx w was True False False en
no 13055779130471031426 xx n no True False False en
longer 2041435544530518504 xxxx l ger

Email 11010771136823990775 Xxxxx E ail True False True en
correspondence 11310010288497220647 xxxx c nce True False False en
received 8054871301319024843 xxxx r ved True False False en
from 7831658034963690409 xxxx f rom True False False en
XX 14968597813295776895 XX X XX True False False en
and 2283656566040971221 xxx a and True False False en
Me 17688641098104964042 Xx M Me True False True en
. 12646065887601541794 . . . False False False en
Adverse 7092613754313174695 Xxxxx A rse True False True en
Event 5416805174850762202 Xxxxx E ent True False True en
was 9921686513378912864 xxx w was True False False en
received 8054871301319024843 xxxx r ved True False False en
from 7831658034963690409 xxxx f rom True False False en
XX 14968597813295776895 XX X XX True False False en
and 2283656566040971221 xxx a and True False False en
Me 17688641098104964042 Xx M Me True False True en
by 16764210730586636600 xx b by True False False en
email 7320900731437023467 xxxx e ail True False False en


she 6740321247510922449 xxx s she True False False en
was 9921686513378912864 xxx w was True False False en
using 16421957100465448365 xxxx u ing True False False en
the 7425985699627899538 xxx t the True False False en
generic 8919908555459875279 xxxx g ric True False False en
for 16037325823156266367 xxx f for True False False en
30 17750038938330149908 dd 3 30 False True False en
days 18443948407981750281 xxxx d ays True False False en
, 2593208677638477497 , , , False False False en
she 6740321247510922449 xxx s she True False False en
did 13583488448875926965 xxx d did True False False en
not 447765159362469301 xxx n not True False False en
sleep 9840574412351606749 xxxx s eep True False False en
well 4525988469032889948 xxxx w ell True False False en
and 2283656566040971221 xxx a and True False False en
was 9921686513378912864 xxx w was True False False en
shaky 5993504102340924037 xxxx s aky True False False en
. 12646065887601541794 . . . False False False en
Her 45281126029213

One 12491099834491675542 Xxx O One True False True en
25 4522981160172931067 dd 2 25 False True False en
years 9492612516460955585 xxxx y ars True False False en
old 2483095116303079762 xxx o old True False False en
female 194700782214550345 xxxx f ale True False False en
patient 14577208587912523278 xxxx p ent True False False en
received 8054871301319024843 xxxx r ved True False False en
Symbicort 14055695076967182582 Xxxxx S ort True False True en
160 11641201567768325302 ddd 1 160 False True False en
ug 15060655763940145774 xx u ug True False False en
bid 15851398737023972976 xxx b bid True False False en
for 16037325823156266367 xxx f for True False False en
asthma 13804933907730890588 xxxx a hma True False False en
via 17875975611192538924 xxx v via True False False en
prescription 9002369395942733371 xxxx p ion True False False en
since 10066841407251338481 xxxx s nce True False False en
over 5456543204961066030 xxxx o ver True False False en
one 17454115351911680600 xxx o one T

her 4115755726172261197 xxx h her True False False en
lung 14280935251206859071 xxxx l ung True False False en
and 2283656566040971221 xxx a and True False False en
’monilial 6756949303980709534 ’xxxx ’ ial False False False en
infection 12354069391853095836 xxxx i ion True False False en
in 3002984154512732771 xx i in True False False en
her 4115755726172261197 xxx h her True False False en
oropharynx 17795721333051958783 xxxx o ynx True False False en
was 9921686513378912864 xxx w was True False False en
added 9384810846185141164 xxxx a ded True False False en
. 12646065887601541794 . . . False False False en

 962983613142996970 
 
 
 False False False en


The mapping of words to hashes doesn't depend on any state. To make sure each value is unique, spaCy uses a hash function to calculate the hash based on the word string. This also means that the hash for "coffee" will always be the same, no matter which model you're using or how you've configured spaCy.

However, hashes cannot be reversed and there's no way to resolve 3197928453018144401 back to "coffee". All spaCy can do is look it up in the vocabulary. That's why you always need to make sure all objects you create have access to the same vocabulary. If they don't, spaCy might not be able to find the strings it needs.

In [18]:
from spacy.tokens import Doc
from spacy.vocab import Vocab

doc = nlp(u'I love coffee') # original Doc
print(doc.vocab.strings[u'coffee'])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee' 👍

empty_doc = Doc(Vocab())  # new Doc with empty Vocab
# empty_doc.vocab.strings[3197928453018144401] will raise an error :(

empty_doc.vocab.strings.add(u'coffee')  # add "coffee" and generate hash
print(empty_doc.vocab.strings[3197928453018144401])  # 'coffee' 👍

new_doc = Doc(doc.vocab)  # create new doc with first doc's vocab
print(new_doc.vocab.strings[3197928453018144401])  # 'coffee' 👍

3197928453018144401
coffee
coffee
coffee


### Serialization

If you've been modifying the pipeline, vocabulary, vectors and entities, or made updates to the model, you'll eventually want to save your progress – for example, everything that's in your nlp object. This means you'll have to translate its contents and structure into a format that can be saved, like a file or a byte string. This process is called serialization. spaCy comes with built-in serialization methods and supports the Pickle protocol.

WHAT'S PICKLE?
Pickle is Python's built-in object persistance system. It lets you transfer arbitrary Python objects between processes. This is usually used to load an object to and from disk, but it's also used for distributed computing, e.g. with PySpark or Dask. When you unpickle an object, you're agreeing to execute whatever code it contains. It's like calling eval() on a string – so don't unpickle objects from untrusted sources.
