  <div>
    <h1 align="center">Excercise 04 - Medical Information Retrieval 2023</h1>
  </div>
  <br />

## NLP Pipeline - Part 3 <a class="anchor" id="first"></a>

Todays session will cover the last step in our preprocessing pipeline which is "part-of-speech" (POS) tagging. 

### Part-of-Speech tagging
Some words can have a totally different meaning if they are used in another context. As an example see the word "watch" in sentences like "I watch a movie" v.s. "I look at my watch". Part-of-Speech tagging adresses this issue by taking the "part of speech" of a single word into account. 

To get more familiar with POS tagging please perform the following steps:

* Import spacy and load `en_core_web_sm`. Create a simple doc object containing an English sentence. 
* POS tags can be distinguished between coarse tags (noun, verb, adjective) and fine-grained tags (plural noun, past-tense verb, superlative adjective). Print your sentence that is stored in the doc object. Then choose a word from your sentence and display the coarse tag and the fine-grained tag of this word. (hint: choose the position of a certain word) Then display the description of those tags. 
* Expand this technique to the whole sentence: For each word please display the coarse tag, the fine-grained tag and the description of the fine-grained tag. 
* Use `displayCy` to display the dependency parse of your sentence. 
* Compare the results for the word `read` in the sentences `I read books on NLP.` and `I read a book on NLP.` - what is different and why? 
* Count the frequency of each POS you can find in your doc object. Display the different POS types with their corresponding frequency. 

Now import the story of Emma in Lübeck `example_text.txt` from moodle and repeat some of the above steps for the whole text: 

* Create a doc object from the story. 
* For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.
* Provide a frequency list of POS tags from the entire document. 
* What percentage of tokens are Verbs? 
* How many sentences are contained in the story? 

In [6]:
### your code ###
import spacy
from spacy import displacy

nlp = spacy.load('en_core_web_sm')
doc = nlp('Laura can\'t join us tonight, because she has to go to the hairdresser. Cringe!')


In [7]:
for token in doc:
    print(token.text,'|', 'coarse tag:', token.pos_, '| fine grained coarse tag:', token.tag_)

Laura | coarse tag: PROPN | fine grained coarse tag: NNP
ca | coarse tag: AUX | fine grained coarse tag: MD
n't | coarse tag: PART | fine grained coarse tag: RB
join | coarse tag: VERB | fine grained coarse tag: VB
us | coarse tag: PRON | fine grained coarse tag: PRP
tonight | coarse tag: NOUN | fine grained coarse tag: NN
, | coarse tag: PUNCT | fine grained coarse tag: ,
because | coarse tag: SCONJ | fine grained coarse tag: IN
she | coarse tag: PRON | fine grained coarse tag: PRP
has | coarse tag: VERB | fine grained coarse tag: VBZ
to | coarse tag: PART | fine grained coarse tag: TO
go | coarse tag: VERB | fine grained coarse tag: VB
to | coarse tag: ADP | fine grained coarse tag: IN
the | coarse tag: DET | fine grained coarse tag: DT
hairdresser | coarse tag: NOUN | fine grained coarse tag: NN
. | coarse tag: PUNCT | fine grained coarse tag: .
Cringe | coarse tag: NOUN | fine grained coarse tag: NN
! | coarse tag: PUNCT | fine grained coarse tag: .


In [8]:
spacy.explain(doc[5].tag_)

'noun, singular or mass'

In [9]:
doc_1 = nlp('Yesterday, he read books on NLP')
doc_2 = nlp('I read a book on NLP')

In [10]:
print('First Sentence:', spacy.explain(doc_1[3].tag_))
print('Second Sentence:', spacy.explain(doc_2[1].tag_))

First Sentence: verb, past tense
Second Sentence: verb, past tense


In [12]:
displacy.serve(doc)




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Exercise 2:

In [13]:
text = ''

with open('example_text.txt', 'r') as f:
    text = f.read()
    f.close()

doc_emma = nlp(text)

for i, sent in enumerate(doc.sents):
    print(sent)

In [20]:
verbs = 0

for i, sent in enumerate(doc_emma.sents):
    if i == 2:
        for token in sent:
            if token.pos_ == 'VERB':
                verbs += 1
            print(token.text,'|', 'coarse tag:', token.pos_, '| fine grained coarse tag:', token.tag_, '| describtion of tag:', spacy.explain(token.tag_))

    else:
        for token in sent:
            if token.pos_ == 'VERB':
                verbs += 1

print('Percentage of Verbs:', (verbs/len(doc_emma))*100)
print('Number of sentences in file:', i+1)

Emma | coarse tag: PROPN | fine grained coarse tag: NNP | describtion of tag: noun, proper singular
is | coarse tag: AUX | fine grained coarse tag: VBZ | describtion of tag: verb, 3rd person singular present
an | coarse tag: DET | fine grained coarse tag: DT | describtion of tag: determiner
avid | coarse tag: ADJ | fine grained coarse tag: JJ | describtion of tag: adjective (English), other noun-modifier (Chinese)
lover | coarse tag: NOUN | fine grained coarse tag: NN | describtion of tag: noun, singular or mass
of | coarse tag: ADP | fine grained coarse tag: IN | describtion of tag: conjunction, subordinating or preposition
history | coarse tag: NOUN | fine grained coarse tag: NN | describtion of tag: noun, singular or mass
and | coarse tag: CCONJ | fine grained coarse tag: CC | describtion of tag: conjunction, coordinating
culture | coarse tag: NOUN | fine grained coarse tag: NN | describtion of tag: noun, singular or mass
, | coarse tag: PUNCT | fine grained coarse tag: , | describt