# AnTeDe Lab A: Understanding PoS tags

## Session goal

The goal of this session is to help you familiarize with PoS tags. We begin by importing the NLTK fragments of the Brown corpus and the Wall Street Journal.


In [1]:
import nltk

nltk.download("brown")
nltk.download("treebank")
nltk.download("universal_tagset")
from nltk.corpus import brown
from nltk.corpus import treebank

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Ruben\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\Ruben\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Ruben\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


Complete the inner loop of the following function as directed by the comments.

For a given word (token), we study the repartion of PoS tags over a given corpus and corpus tagset.


In [5]:
def get_ground_truth_distribution(token, corpus, corpus_tagset=""):
    sentences = corpus.tagged_sents(tagset=corpus_tagset)
    untagged_sentences = corpus.sents()

    # this is going to be a dict where each key is a tag
    tag_freq = {}

    # sent is an untagged sentence
    # sentences[i] is the corresponding tagged sentence

    for i, sent in enumerate(untagged_sentences):
        # if the token we're looking for is in sent
        if token in sent:
            # for each (token, tag) tuple in the tagged sentence
            for pair in sentences[i]:
                # pair[0] contains the current token
                # pair[1] contains the corresponding tag

                # increase tag_freq[pair[1]] by one unit
                # careful because tag_freq may not yet have a
                # key corresponding to pair[1]!

                # BEGIN_YOUR_CODE
                 if pair[0] == token:
                    # Check if the tag is already in tag_freq dictionary
                    if pair[1] in tag_freq:
                        # If the tag is already there, increment its count
                        tag_freq[pair[1]] += 1
                    else:
                        # Otherwise, initialize the count of this tag to 1
                        tag_freq[pair[1]] = 1
                # END_YOUR_CODE
    return tag_freq

In the following cells, we get the PoS tag distribution of a series of word in the Penn treebank and the Brown corpus using the universal and the Penn tagset (for Penn) and the Brown tagset (for Brown).


In [3]:
# looking for the distribution of the tag of "that" in the treebank corpus with the universal tagset
get_ground_truth_distribution("that", treebank, "universal")


{'DET': 291, 'ADP': 513, 'ADV': 3}

In [4]:
# looking for the distribution of the tag of "the" in the brown corpus with the universal tagset
get_ground_truth_distribution("that", treebank)


{'WDT': 214, 'IN': 513, 'DT': 77, 'RB': 3}

In [8]:
get_ground_truth_distribution("that", brown, "universal")


{'ADP': 6422, 'PRON': 1779, 'DET': 1981, 'ADV': 54, 'X': 1}

In [5]:
# looking for the distribution of the tag of "back" in the treebank corpus with the universal tagset
get_ground_truth_distribution("that", brown, "universal")


{'ADP': 6422,
 'PRON': 1779,
 'DET': 1981,
 'ADV': 54,
 'PRT': 9,
 'X': 1,
 'NOUN': 1}

The following function gives you examples for a specific combination of token and tag (over a given corpus and tagset).


In [9]:
def get_ground_truth_example(token, corpus, tag, corpus_tagset=""):
    sentences = corpus.tagged_sents(tagset=corpus_tagset)
    untagged_sentences = corpus.sents()
    tag_freq = {}
    count = 0
    visualize = False

    for i, sent in enumerate(untagged_sentences):
        if token in sent:
            text = ""
            for pair in sentences[i]:
                if "NONE" not in pair[1]:
                    text = text + " " + pair[0]
                if (token in pair[0]) and (tag in pair[1]):
                    visualize = True
            if visualize:
                count = count + 1
                print(str(count) + " " + text)
                print(str(sentences[i]))
                visualize = False

In [10]:
get_ground_truth_example("that", treebank, "WDT")

1  The asbestos fiber , crocidolite , is unusually resilient once it enters the lungs , with even brief exposures to it causing symptoms that show up decades later , researchers said .
[('The', 'DT'), ('asbestos', 'NN'), ('fiber', 'NN'), (',', ','), ('crocidolite', 'NN'), (',', ','), ('is', 'VBZ'), ('unusually', 'RB'), ('resilient', 'JJ'), ('once', 'IN'), ('it', 'PRP'), ('enters', 'VBZ'), ('the', 'DT'), ('lungs', 'NNS'), (',', ','), ('with', 'IN'), ('even', 'RB'), ('brief', 'JJ'), ('exposures', 'NNS'), ('to', 'TO'), ('it', 'PRP'), ('causing', 'VBG'), ('symptoms', 'NNS'), ('that', 'WDT'), ('*T*-1', '-NONE-'), ('show', 'VBP'), ('up', 'RP'), ('decades', 'NNS'), ('later', 'JJ'), (',', ','), ('researchers', 'NNS'), ('said', 'VBD'), ('0', '-NONE-'), ('*T*-2', '-NONE-'), ('.', '.')]
2  Lorillard Inc. , the unit of New York-based Loews Corp. that makes Kent cigarettes , stopped using crocidolite in its Micronite cigarette filters in 1956 .
[('Lorillard', 'NNP'), ('Inc.', 'NNP'), (',', ','), ('

In [11]:
get_ground_truth_example(
    token="that", corpus=brown, tag="ADV", corpus_tagset="universal"
)


1  While the city council suggested that the Legislative Council might perform the review , Mr. Notte said that instead he will take up the matter with Atty. Gen. J. Joseph Nugent to get `` the benefit of his views '' .
[('While', 'ADP'), ('the', 'DET'), ('city', 'NOUN'), ('council', 'NOUN'), ('suggested', 'VERB'), ('that', 'ADP'), ('the', 'DET'), ('Legislative', 'ADJ'), ('Council', 'NOUN'), ('might', 'VERB'), ('perform', 'VERB'), ('the', 'DET'), ('review', 'NOUN'), (',', '.'), ('Mr.', 'NOUN'), ('Notte', 'NOUN'), ('said', 'VERB'), ('that', 'ADV'), ('instead', 'ADV'), ('he', 'PRON'), ('will', 'VERB'), ('take', 'VERB'), ('up', 'PRT'), ('the', 'DET'), ('matter', 'NOUN'), ('with', 'ADP'), ('Atty.', 'NOUN'), ('Gen.', 'ADJ'), ('J.', 'NOUN'), ('Joseph', 'NOUN'), ('Nugent', 'NOUN'), ('to', 'PRT'), ('get', 'VERB'), ('``', '.'), ('the', 'DET'), ('benefit', 'NOUN'), ('of', 'ADP'), ('his', 'DET'), ('views', 'VERB'), ("''", '.'), ('.', '.')]
2  In 1961 , it is estimated that multiple unit dwellings