# AnTeDe Lab B: using PoS taggers

## Session goal

The goal of this session is to help you familiarize with PoS tagging. We'll be using NLTK, Stanza, and Spacy.

**For Spacy, in addition to _pip install spacy_, you'll need to run _python -m spacy download en_core_web_sm**


In [1]:
import stanza

import nltk
from nltk.tag import PerceptronTagger
from nltk.tokenize import word_tokenize

import spacy


stanza.download("en")

nltk.download("averaged_perceptron_tagger")

stanza_pipeline = stanza.Pipeline(lang="en", processors="tokenize,mwt,pos,lemma")
spacy_analyzer = spacy.load("en_core_web_sm")


def run_stanza(text):
    pairs = []
    doc = stanza_pipeline(text)
    for sent in doc.sentences:
        for word in sent.words:
            pairs.append((word.text, word.xpos))
    return pairs


def run_spacy(text):
    doc = spacy_analyzer(text)
    # for each token in the doc, we return a tuple of the token and its tag. see https://spacy.io/api/doc
    # BEGIN_YOUR_CODE
    
    return [(token.text, token.pos_) for token in doc]


    # END_YOUR_CODE


def run_nltk(text):
    tagger = PerceptronTagger()
    return tagger.tag(word_tokenize(text))

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-04-13 21:52:18 INFO: Downloaded file to C:\Users\Ruben\stanza_resources\resources.json
2024-04-13 21:52:18 INFO: Downloading default packages for language: en (English) ...
2024-04-13 21:52:21 INFO: File exists: C:\Users\Ruben\stanza_resources\en\default.zip
2024-04-13 21:52:26 INFO: Finished downloading models and saved to C:\Users\Ruben\stanza_resources
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Ruben\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
2024-04-13 21:52:27 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-04-13 21:52:27 INFO: Downloaded file to C:\Users\Ruben\stanza_resources\resources.json
2024-04-13 21:52:28 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2024-04-13 21:52:28 INFO: Using device: cpu
2024-04-13 21:52:28 INFO: Loading: tokenize
2024-04-13 21:52:32 INFO: Loading: mwt
2024-04-13 21:52:32 INFO: Loading: pos
2024-04-13 21:52:32 INFO: Loading: lemma
2024-04-13 21:52:32 INFO: Done loading processors!


# Visualization

The function below is used to compare the output of the three PoS taggers.


In [2]:
def visualize_pos_results(text):
    stanza_pairs = run_stanza(text)
    spacy_pairs = run_spacy(text)
    nltk_pairs = run_nltk(text)

    if len(stanza_pairs) == len(spacy_pairs) == len(nltk_pairs):
        tokens = [x[0] for x in stanza_pairs]
        stanza_tags = [x[1] for x in stanza_pairs]
        spacy_tags = [x[1] for x in spacy_pairs]
        nltk_tags = [x[1] for x in nltk_pairs]

        import pandas as pd

        df = pd.DataFrame(columns=["tokens", "Stanza", "NLTK", "Spacy"])
        df["tokens"] = tokens
        df["Stanza"] = stanza_tags
        df["NLTK"] = nltk_tags
        df["Spacy"] = spacy_tags

        display(df)

    else:
        print("-" * 30)
        print("Stanza")
        print(stanza_pairs)
        print("-" * 30)
        print("NLTK")
        print(nltk_pairs)
        print("-" * 30)
        print("Spacy")
        print(spacy_pairs)

The following cell showcase that, in controlled conditions, the three PoS taggers produce similar results.

You can find the Penn Treebank PoS tagset [here](https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html).


In [3]:
visualize_pos_results("The can can hold water")
visualize_pos_results("I asked them to back me up.")

Unnamed: 0,tokens,Stanza,NLTK,Spacy
0,The,DT,DT,DET
1,can,NN,MD,AUX
2,can,MD,MD,AUX
3,hold,VB,VB,VERB
4,water,NN,NN,NOUN


Unnamed: 0,tokens,Stanza,NLTK,Spacy
0,I,PRP,PRP,PRON
1,asked,VBD,VBD,VERB
2,them,PRP,PRP,PRON
3,to,TO,TO,PART
4,back,VB,VB,VERB
5,me,PRP,PRP,PRON
6,up,RP,RP,ADP
7,.,.,.,PUNCT


In [11]:
visualize_pos_results("The can can hold water")
visualize_pos_results("I asked them to back me up.")

Unnamed: 0,tokens,Stanza,NLTK,Spacy
0,The,DT,DT,DT
1,can,NN,MD,MD
2,can,MD,MD,MD
3,hold,VB,VB,VB
4,water,NN,NN,NN


Unnamed: 0,tokens,Stanza,NLTK,Spacy
0,I,PRP,PRP,PRP
1,asked,VBD,VBD,VBD
2,them,PRP,PRP,PRP
3,to,TO,TO,TO
4,back,VB,VB,VB
5,me,PRP,PRP,PRP
6,up,RP,RP,RP
7,.,.,.,.


There are cases where the taggers disagree, to see some of these cases, you can run the following cell.


What's happening in the following example? Which PoS tagger does better? (answer in the provided comment section)


In [12]:
sentences = [
    "An experienced man should always man the ship",
]

for sentence in sentences:
    dflist = visualize_pos_results(sentence)


# BEGIN_YOUR_COMMENT

# END_YOUR_COMMENT

Unnamed: 0,tokens,Stanza,NLTK,Spacy
0,An,DT,DT,DT
1,experienced,JJ,JJ,JJ
2,man,NN,NN,NN
3,should,MD,MD,MD
4,always,RB,RB,RB
5,man,VB,NN,VB
6,the,DT,DT,DT
7,ship,NN,NN,NN


What's happening in the following examples? Elaborate on the differences between the three PoS taggers highlignting the difficulties of PoS tagging.


In [13]:
sentences = [
    "That much is true.",
    "I don't know that much.",
]
for sentence in sentences:
    dflist = visualize_pos_results(sentence)

# BEGIN_YOUR_COMMENT


# END_YOUR_COMMENT

Unnamed: 0,tokens,Stanza,NLTK,Spacy
0,That,DT,DT,DT
1,much,RB,JJ,JJ
2,is,VBZ,VBZ,VBZ
3,true,JJ,JJ,JJ
4,.,.,.,.


Unnamed: 0,tokens,Stanza,NLTK,Spacy
0,I,PRP,PRP,PRP
1,do,VBP,VBP,VBP
2,n't,RB,RB,RB
3,know,VB,VB,VB
4,that,RB,RB,DT
5,much,RB,JJ,JJ
6,.,.,.,.
