<a href="https://colab.research.google.com/github/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/04-information-extraction/1_part_of_speech_tagging_with_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Part-of-speech tagging with spaCy

Unlike NLTK that treats different components of language analysis as separate steps, spaCy
builds an analysis pipeline from the very beginning and applies this pipeline to text. Under
the hood, the pipeline already includes a number of useful NLP tools that are run on input
text without you needing to call on them separately. 

These tools include, among others, a
tokenizer and a POS tagger. You simply apply the whole lot of tools with a single line of code
calling on the spaCy processing pipeline, and then your program stores the result in a
convenient format until you need it. This also ensures that the information is passed between
the tools without you taking care of the input-output formats.

<img src='images/1.png?raw=1' width='800'/>

That means that each of the three tools –
tokenizer, stemmer, POS tagger – requires a different type of input and produces a different
type of output, so in order to apply them in sequence we need to know how to represent
information for each of them. That is what spaCy’s processing pipeline does for you: it runs a
sequence of tools and connects their outputs together.

You may notice that the processing tools are comprised within a pipeline
called nlp. As you will shortly see in the code, calling on nlp pipeline makes the program
first invoke all the pre-trained tools and then applies them to the input text. The output of all
the steps gets stored in a “container” called Doc – it contains a sequence of tokens extracted
from input text and processed with the tools. Here is where spaCy implementation comes
close to object-oriented programming: the tokens are represented as Token objects with a
specific set of attributes.



##Setup

In [1]:
import os
import math
import random
import string

import spacy
from spacy import displacy
from pathlib import Path

In [None]:
!python -m spacy download en_core_web_sm

##POS tagging

This is the approach taken by spaCy to represent tokens in text: after tokenization, each
token (word) is packed up in an object Token that has a number of attributes.
- `token.text` contains the original word itself;
- `token.lemma`_ stores the lemma (base form) of the word;
- `token.pos_` – its part-of-speech tag;
- `token.i` – the index position of the word in text;
- `token.lower_` – lowercase form of the word, and so on.

The nlp pipeline aims to fill in the information fields like lemma, pos and others with the
values specific for each particular token. Since different tools within the pipeline provide
different bits of information, the values for the attributes are added on the go.

<img src='images/2.png?raw=1' width='800'/>

Now, there are a couple of points that we did not get to discuss before: imagine there are a hundred of documents in total and you can quickly skim through them to filter out the most irrelevant ones – those that do not even mention either “meetings” or “management”.




In [4]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("On Friday board members meet with senior managers to discuss future development of the company.")

In [6]:
rows = []
rows.append(["Word", "Position", "Lowercase", "Lemma", "POS", "Alphanumeric", "Stopword"])
# Add the attributes of each token in the processed text to the output for printing
for token in doc:
  rows.append([token.text, str(token.i), token.lower_, token.lemma_, token.pos_, str(token.is_alpha), str(token.is_stop)])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
# As each column will contain strings of variable lengths, calculate the maximum length of strings in each column to allow enough space in the printout
for row in rows:
  print("".join(' {:{width}} '.format(row[i], width=column_widths[i]) for i in range(0, len(row))))

 Word         Position  Lowercase    Lemma        POS    Alphanumeric  Stopword 
 On           0         on           on           ADP    True          True     
 Friday       1         friday       Friday       PROPN  True          False    
 board        2         board        board        NOUN   True          False    
 members      3         members      member       NOUN   True          False    
 meet         4         meet         meet         VERB   True          False    
 with         5         with         with         ADP    True          True     
 senior       6         senior       senior       ADJ    True          False    
 managers     7         managers     manager      NOUN   True          False    
 to           8         to           to           PART   True          True     
 discuss      9         discuss      discuss      VERB   True          False    
 future       10        future       future       ADJ    True          False    
 development  11        deve

The POS tagging algorithm, similarly, takes into account two types of expectations: an
expectation that a certain type of a word (like modal verb) may follow a certain other type of
a word (like pronoun), and an expectation that if it is a modal verb such a verb may be
“can”. These “expectations” are calculated using the data: for example, to find out how likely
it is that a modal verb follows a pronoun, we calculate the proportion of times we see a
modal verb following a pronoun in data among all the cases where we saw a pronoun. 

For
instance, if we saw 10 pronouns like “I” and “we” in data before, and 5 times out of those 10
these pronouns were followed by a modal verb like “can” or “may” (as in “I can” and “we
may”), what would the likelihood, or probability, or seeing a modal verb following a pronoun
be?

<img src='images/3.png?raw=1' width='800'/>

We can calculate it as:
- Probability(modal verb follows pronoun) = 5 / 10

or in general case:
- Probability(modal verb follows pronoun) = 
- How_often(pronoun is followed by verb) /
- How_often(pronoun is followed by any type of word, modal verb or not)

Similarly, to estimate how likely (or how probable) it is that the pronoun is “I”, we need to
take the number of times we’ve seen a pronoun “I” and divide it by the number of times
we’ve seen any pronouns in the data. So, if among those 10 pronouns that we’ve seen in the
data before 7 were “I” and 3 were “we”, the probability of seeing a pronoun “I” would be
estimated.

<img src='images/4.png?raw=1' width='800'/>

- Probability(pronoun being “I”) = 7 / 10

or in general case:
- Probability(pronoun being “I”) =
- How_often(we’ve seen a pronoun “I”) /
- How_often(we’ve seen any pronoun, “I” or other)

In the end, the algorithm goes through the sequence of tags and words one by one, and
takes all the probabilities into account. Since the probability of each decision, that is each tag
and each word, is a separate step in the process, these individual probabilities are multiplied.
