# CS410 MP1 -- Getting Familiar with Text

In [None]:
%%capture
!pip install metapy pytoml

In [None]:
import metapy  # import the MeTA python bindings
# You can tell MeTA to log to stderr so you can get progress output when running long-running function calls.
metapy.log_to_stderr()

Let's create a document with some content to experiment on.

In [None]:
doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95!")

In [None]:
doc.content()

"I said that I can't believe that it only costs $19.95!"

## Tokenization

MeTA provides a stream-based interface for performing document tokenization. Each stream starts off with a Tokenizer object, and in most cases you should use the Unicode standard aware ICUTokenizer.

In [None]:
tok = metapy.analyzers.ICUTokenizer()

Tokenizers operate on raw text and provide an Iterable that spits out the individual text tokens. Let's try running just the ICUTokenizer to see what it does.

In [None]:
tok.set_content(doc.content()) # this could be any string
tokens = [token for token in tok]
print(tokens)

['<s>', 'I', 'said', 'that', 'I', "can't", 'believe', 'that', 'it', 'only', 'costs', '$', '19.95', '!', '</s>']


One thing that you likely immediately notice is the insertion of these pseudo-XML looking tags. These are called “sentence boundary tags”. As a side-effect, a default-construted ICUTokenizer discovers the sentences in a document by delimiting them with the sentence boundary tags. Let's try tokenizing a multi-sentence document to see what that looks like.

In [None]:
doc.content("I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.")
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

['<s>', 'I', 'said', 'that', 'I', "can't", 'believe', 'that', 'it', 'only', 'costs', '$', '19.95', '!', '</s>', '<s>', 'I', 'could', 'only', 'find', 'it', 'for', 'more', 'than', '$', '30', 'before', '.', '</s>']


Most of the information retrieval techniques you have likely been learning about in this class don't need to concern themselves with finding the boundaries between separate sentences in a document, but later today we'll explore a scenario where this might matter more. Let's pass a flag to the ICUTokenizer constructor to disable sentence boundary tags for now.

In [None]:
tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

['I', 'said', 'that', 'I', "can't", 'believe', 'that', 'it', 'only', 'costs', '$', '19.95', '!', 'I', 'could', 'only', 'find', 'it', 'for', 'more', 'than', '$', '30', 'before', '.']


As mentioned earlier, MeTA treats tokenization as a streaming process, and that it starts with a tokenizer. It is often beneficial to modify the raw underlying tokens of a document, and thus change its representation. The “intermediate” steps in the tokenization stream are represented with objects called Filters. Each filter consumes the content of a previous filter (or a tokenizer) and modifies the tokens coming out of the stream in some way. Let's start by using a simple filter that can help eliminate a lot of noise that we might encounter when tokenizing web documents: a LengthFilter.

In [None]:
tok = metapy.analyzers.LengthFilter(tok, min=2, max=30)
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

['said', 'that', "can't", 'believe', 'that', 'it', 'only', 'costs', '19.95', 'could', 'only', 'find', 'it', 'for', 'more', 'than', '30', 'before']


Here, we can see that the LengthFilter is consuming our original ICUTokenizer. It modifies the token stream by only emitting tokens that are of a minimum length of 2 and a maximum length of 30. This can get rid of a lot of punctuation tokens, but also excessively long tokens such as URLs.

## Stopword removal and stemming

Another common trick is to remove stopwords. In MeTA, this is done using a ListFilter.

In [None]:
! wget -nc https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt  

--2022-09-01 21:19:00--  https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2747 (2.7K) [text/plain]
Saving to: ‘lemur-stopwords.txt’


2022-09-01 21:19:00 (41.6 MB/s) - ‘lemur-stopwords.txt’ saved [2747/2747]



Note: wget is a command to download files from links. Another simpler option is to open a web browser, type the link on the address bar and download the file manually

In [None]:
tok = metapy.analyzers.ListFilter(tok, "lemur-stopwords.txt", metapy.analyzers.ListFilter.Type.Reject)
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

["can't", 'believe', 'costs', '19.95', 'find', '30']


Here we've downloaded a common list of stopwords and created a ListFilter to reject any tokens that occur in that list of words. You can see how much of a difference removing stopwords can make on the size of a document's token stream!

Another common filter that people use is called a stemmer, or lemmatizer. This kind of filter tries to modify individual tokens in such a way that different inflected forms of a word all reduce to the same representation. This lets you, for example, find documents about a “run” when you search “running” or “runs”. A common stemmer is the Porter2 Stemmer, which MeTA has an implementation of. Let's try it!

In [None]:
tok = metapy.analyzers.Porter2Filter(tok)
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

["can't", 'believ', 'cost', '19.95', 'find', '30']


## N-grams

Finally, after you've got the token stream configured the way you'd like, it's time to analyze the document by consuming each token from its token stream and performing some actions based on these tokens. In the simplest case, our action can simply be counting how many times these tokens occur. For clarity, let's switch back to a simpler token stream first. We will write a token stream that tokenizes with ICUTokenizer, and then lowercases each token.

In [None]:
tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)
tok = metapy.analyzers.LowercaseFilter(tok)
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

['i', 'said', 'that', 'i', "can't", 'believe', 'that', 'it', 'only', 'costs', '$', '19.95', '!', 'i', 'could', 'only', 'find', 'it', 'for', 'more', 'than', '$', '30', 'before', '.']


Now, let's count how often each individual token appears in the stream. This representation is called “bag of words” representation or “unigram word counts”. **In MeTA, classes that consume a token stream and emit a document representation are called Analyzers.**

In [None]:
ana = metapy.analyzers.NGramWordAnalyzer(1, tok)
print(doc.content())
unigrams = ana.analyze(doc)
print(unigrams)

I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.
{"can't": 1, 'believe': 1, 'than': 1, '$': 2, 'more': 1, 'could': 1, 'it': 2, '!': 1, 'before': 1, '.': 1, 'costs': 1, '19.95': 1, '30': 1, 'said': 1, 'find': 1, 'i': 3, 'only': 2, 'that': 2, 'for': 1}


If you noticed the name of the analyzer, you might have realized that you can count not just individual tokens, but groups of them. “Unigram” means “1-gram”, and we count individual tokens. “Bigram” means “2-gram”, and we count adjacent tokens together as a group. Let's try that now.

In [None]:
ana = metapy.analyzers.NGramWordAnalyzer(2, tok)
bigrams = ana.analyze(doc)
print(bigrams)

{('before', '.'): 1, ('it', 'for'): 1, ('than', '$'): 1, ('believe', 'that'): 1, ('that', 'it'): 1, ('19.95', '!'): 1, ('that', 'i'): 1, ('said', 'that'): 1, ('could', 'only'): 1, ('i', 'said'): 1, ('find', 'it'): 1, ('!', 'i'): 1, ('costs', '$'): 1, ('i', 'could'): 1, ("can't", 'believe'): 1, ('only', 'find'): 1, ('more', 'than'): 1, ('$', '19.95'): 1, ('$', '30'): 1, ('30', 'before'): 1, ('it', 'only'): 1, ('i', "can't"): 1, ('only', 'costs'): 1, ('for', 'more'): 1}


Now the individual “tokens” we're counting are pairs of tokens. Sometimes looking at n-grams of characters is useful.

In [None]:
tok = metapy.analyzers.CharacterTokenizer()
ana = metapy.analyzers.NGramWordAnalyzer(4, tok)
fourchar_ngrams = ana.analyze(doc)
print(fourchar_ngrams)

{('e', 'v', 'e', ' '): 1, ('t', ' ', 'o', 'n'): 1, ('f', 'o', 'r', ' '): 1, ('!', ' ', 'I', ' '): 1, ('s', 't', 's', ' '): 1, (' ', 'c', 'o', 'u'): 1, ('9', '.', '9', '5'): 1, ('o', 'n', 'l', 'y'): 2, ('y', ' ', 'f', 'i'): 1, (' ', 'o', 'n', 'l'): 2, ('h', 'a', 't', ' '): 2, ('o', 'r', ' ', 'm'): 1, ('b', 'e', 'l', 'i'): 1, ('d', ' ', 'i', 't'): 1, ('t', ' ', 'I', ' '): 1, ('$', '3', '0', ' '): 1, ('c', 'o', 'u', 'l'): 1, ('o', 'r', 'e', '.'): 1, ('o', 'r', 'e', ' '): 1, ('s', ' ', '$', '1'): 1, ('I', ' ', 'c', 'o'): 1, ('$', '1', '9', '.'): 1, ('e', 'l', 'i', 'e'): 1, ('i', 'n', 'd', ' '): 1, ('n', 'd', ' ', 'i'): 1, ('9', '5', '!', ' '): 1, ('5', '!', ' ', 'I'): 1, ('o', 's', 't', 's'): 1, ('n', ' ', '$', '3'): 1, ('a', 't', ' ', 'i'): 1, ('h', 'a', 'n', ' '): 1, (' ', 'c', 'a', 'n'): 1, ('e', 'f', 'o', 'r'): 1, ('n', "'", 't', ' '): 1, ('t', ' ', 'f', 'o'): 1, ('r', 'e', ' ', 't'): 1, ('y', ' ', 'c', 'o'): 1, ('n', 'l', 'y', ' '): 2, (' ', 's', 'a', 'i'): 1, (' ', 'I', ' ', 'c'): 2,

## POS tagging

Now, let's explore something a little bit different. MeTA also has a natural language processing (NLP) component, which currently supports two major NLP tasks: part-of-speech tagging and syntactic parsing. POS tagging is a task in NLP that involves identifying a type for each word in a sentence. For example, POS tagging can be used to identify all of the nouns in a sentence, or all of the verbs, or adjectives, or… This is useful as first step towards developing an understanding of the meaning of a particular sentence. MeTA places its POS tagging component in its “sequences” library. Let's play with some sequences first to get an idea of how they work. We'll start of by creating a sequence.

In [None]:
seq = metapy.sequence.Sequence()

Now, we can add individual words to this sequence. Sequences consist of a list of Observations, which are essentially (word, tag) pairs. If we don't yet know the tags for a Sequence, we can just add individual words and leave the tags unset. Words are called “symbols” in the library terminology.

In [None]:
for word in ["The", "dog", "ran", "across", "the", "park", "."]:
    seq.add_symbol(word)

print(seq)

(I, ???), (said, ???), (The, ???), (dog, ???), (ran, ???), (across, ???), (the, ???), (park, ???), (., ???)


The printed form of the sequence shows that we do not yet know the tags for each word. Let's fill them in by using a pre-trained POS-tagger model that's distributed with MeTA.

In [None]:
! wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.1/greedy-perceptron-tagger.tar.gz  
! tar xvf greedy-perceptron-tagger.tar.gz

--2022-09-01 21:42:46--  https://github.com/meta-toolkit/meta/releases/download/v3.0.1/greedy-perceptron-tagger.tar.gz
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/16466317/5becfb4a-07f9-11e7-9984-0b59d0729937?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220901%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220901T214247Z&X-Amz-Expires=300&X-Amz-Signature=6f6321e48ce0f83b863182f062cba6d3eeebcf3ac589481fb4fad5b4d1c5c07a&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=16466317&response-content-disposition=attachment%3B%20filename%3Dgreedy-perceptron-tagger.tar.gz&response-content-type=application%2Foctet-stream [following]
--2022-09-01 21:42:47--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/16466317/5becfb4a-07f9-1

In [None]:
tagger = metapy.sequence.PerceptronTagger("perceptron-tagger/")



In [None]:
tagger.tag(seq)
print(seq)

(I, PRP), (said, VBD)


Each tag indicates the type of a word, and this particular tagger was trained to output the tags present in the Penn Treebank tagset. But what if we want to POS-tag a document?

In [None]:
doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95!")
tok = metapy.analyzers.ICUTokenizer() # keep sentence boundaries!
tok = metapy.analyzers.PennTreebankNormalizer(tok)
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

['<s>', 'I', 'said', 'that', 'I', 'ca', "n't", 'believe', 'that', 'it', 'only', 'costs', '$', '19.95', '!', '</s>']


Now, we will write a function that can take a token stream that contains sentence boundary tags and returns a list of Sequence objects. We will not include the sentence boundary tags in the actual Sequence objects.

In [None]:
def extract_sequences(tok):
    sequences = []
    for token in tok:
        if token == '<s>':
            sequences.append(metapy.sequence.Sequence()) # Add a seq for each sentence 
        elif token != '</s>':
            sequences[-1].add_symbol(token)
    return sequences

doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95!")
tok.set_content(doc.content())
for seq in extract_sequences(tok):
    tagger.tag(seq)
    print(seq)

(I, PRP), (said, VBD), (that, IN), (I, PRP), (ca, MD), (n't, RB), (believe, VB), (that, IN), (it, PRP), (only, RB), (costs, VBZ), ($, $), (19.95, CD), (!, .)


## Config.toml file: setting up a pipeline

In practice, it is often beneficial to combine multiple feature sets together. We can do this with a MultiAnalyzer. Let's combine unigram words, bigram POS tags, and rewrite rules for our document feature representation. We can certainly do this programmatically, but doing so can become tedious quite quickly. Instead, let's use MeTA's configuration file format to specify our analyzer, which we can then load in one line of code. MeTA uses TOML configuration files for all of its configuration. If you haven't heard of TOML before, don't panic! It's a very simple, readable format. Open a text editor and copy the text below, but be careful not to modify the contents. Save it as config.toml .

```
#Add this as a config.toml file to your project directory
stop-words = "lemur-stopwords.txt"

[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-unigram-chain"

[[analyzers]]
method = "ngram-pos"
ngram = 2
filter = [{type = "icu-tokenizer"}, {type = "ptb-normalizer"}]
crf-prefix = "crf"

[[analyzers]]
method = "tree"
filter = [{type = "icu-tokenizer"}, {type = "ptb-normalizer"}]
features = ["subtree"]
tagger = "perceptron-tagger/"
parser = "parser/"
```

Each [[analyzers]] block defines another analyzer to combine for our feature representation. Since “ngram-word” is such a common analyzer, we have defined some default filter chains that can be used with shortcuts. “default-unigram-chain” is a filter chain suitable for unigram words; “default-chain” is a filter chain suitable for bigram words and above.

To run this example, we will need to download some additional MeTA resources:

In [None]:
! wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.2/crf.tar.gz
! tar xvf crf.tar.gz

--2022-09-01 22:30:17--  https://github.com/meta-toolkit/meta/releases/download/v3.0.2/crf.tar.gz
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/16466317/b80b3710-8518-11e7-8623-e7289e51f6af?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220901%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220901T223017Z&X-Amz-Expires=300&X-Amz-Signature=f03250bc67b48c756c7df36179f3d39dcad3a56892129cb942d59c3d246e5474&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=16466317&response-content-disposition=attachment%3B%20filename%3Dcrf.tar.gz&response-content-type=application%2Foctet-stream [following]
--2022-09-01 22:30:17--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/16466317/b80b3710-8518-11e7-8623-e7289e51f6af?X-Amz-Algorithm=AWS4

In [None]:
! wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.2/greedy-constituency-parser.tar.gz
! tar xvf greedy-constituency-parser.tar.gz

--2022-09-01 22:30:20--  https://github.com/meta-toolkit/meta/releases/download/v3.0.2/greedy-constituency-parser.tar.gz
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/16466317/b80f41f2-8518-11e7-9079-3ff935f6c151?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220901%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220901T223020Z&X-Amz-Expires=300&X-Amz-Signature=475addb1a1092dfb8c967f1ccd4678db59f0c7dceba4ef61598975af62c7eddc&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=16466317&response-content-disposition=attachment%3B%20filename%3Dgreedy-constituency-parser.tar.gz&response-content-type=application%2Foctet-stream [following]
--2022-09-01 22:30:20--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/16466317/b80f41f2-85

We can now load an analyzer from this configuration file:

In [None]:
ana = metapy.analyzers.load('config.toml')
doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95!")
print(ana.analyze(doc))



{'VBZ_$': 1, 'VB_IN': 1, 'CD_.': 1, 'subtree-(S (NP) (ADVP) (VP))': 1, 'subtree-(NP ($) (CD))': 1, 'subtree-($)': 1, 'subtree-(CD)': 1, 'PRP_RB': 1, 'subtree-(VP (VB) (SBAR))': 1, 'subtree-(NP (PRP))': 3, 'cost': 1, 'IN_PRP': 2, 'subtree-(VP (MD) (RB) (VP))': 1, 'subtree-(S (NP) (VP) (.))': 1, "can't": 1, 'subtree-(.)': 1, 'MD_RB': 1, 'subtree-(IN)': 2, 'subtree-(RB)': 2, 'subtree-(VBD)': 1, 'subtree-(VP (VBZ) (NP))': 1, 'subtree-(MD)': 1, 'believ': 1, 'subtree-(VP (VBD) (SBAR))': 1, 'RB_VBZ': 1, 'subtree-(ROOT (S))': 1, 'PRP_VBD': 1, 'subtree-(VBZ)': 1, 'subtree-(S (NP) (VP))': 1, 'subtree-(ADVP (RB))': 1, 'VBD_IN': 1, 'PRP_MD': 1, 'subtree-(SBAR (IN) (S))': 2, 'RB_VB': 1, 'subtree-(VB)': 1, 'subtree-(PRP)': 3, '$_CD': 1}


## Trying it out on your own!

Finally, let's test whether you can do such analysis on your own! Inside this repository, you will find example.py where we ask you to fill in your code. You are required to create a function that tokenizes with ICUTokenizer (without the end/start tags, i.e. use the argument "suppress_tags=True"), lowercases, removes words with less than 2 and more than 5 characters, performs stemming and produces trigrams for an input sentence. Once you edit the example.py to fill in the function, you can check whether your submission passed the tests.

In [None]:
def tokens_lowercase(doc):
    #Write a token stream that tokenizes with ICUTokenizer (use the argument "suppress_tags=True"), 
    #lowercases, removes words with less than 2 and more than 5  characters
    #performs stemming and creates trigrams (name the final call to ana.analyze as "trigrams")
    '''Place your code here'''
    tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)
    tok = metapy.analyzers.LowercaseFilter(tok)
    tok = metapy.analyzers.LengthFilter(tok, min=2, max=5)
    tok = metapy.analyzers.Porter2Filter(tok)

    ana = metapy.analyzers.NGramWordAnalyzer(3, tok)
    trigrams = ana.analyze(doc)
    
    #leave the rest of the code as is
    tok.set_content(doc.content())
    tokens, counts = [], []
    for token, count in trigrams.items():
        counts.append(count)
        tokens.append(token)
    return tokens

In [None]:
doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.")
print(doc.content()) #you can access the document string with .content()

tokens = tokens_lowercase(doc)
print(tokens)

I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.
[('that', "can't", 'that'), ('onli', 'find', 'it'), ('that', 'it', 'onli'), ('19.95', 'could', 'onli'), ("can't", 'that', 'it'), ('could', 'onli', 'find'), ('find', 'it', 'for'), ('said', 'that', "can't"), ('for', 'more', 'than'), ('it', 'for', 'more'), ('it', 'onli', 'cost'), ('onli', 'cost', '19.95'), ('cost', '19.95', 'could'), ('more', 'than', '30')]
