# Experimenting with tokenization

## NLTK-tokenization compared to Brown corpus.

In this notebook we will experiment with various tokenizers and see how they solve the corner cases.

We start with nltk tokenizer

In [1]:
import numpy
import matplotlib.pyplot as plt
import nltk

We begin with the example sentence from the lecture 2. Apply the word_tokenize from nltk, and observe how it tokenizes the string. Compare to the examples in the lecture.

In [2]:
sent0 = "For example, this isn't a well-formed sentence."

In [3]:
# Your solution

We will now compare to the tokenization used in the Brown corpus. The Brown corpus is distributed in a tokenize form. To compare, we have to "detokenize" sentences from Brown, i.e., guess what the sentences might have looked like originally. To be sure to get it right, we will do it manually.

Use NLTK's `word_tokenize` on the following sentence and compare to sentence 1750 in Brown.

In [4]:
sent1 = "Maybe motel-keeping isn't the nation's biggest industry."

In [5]:
# Your solution

In [6]:
from nltk.corpus import brown
brown_sentences = [s for s in brown.sents()]

In [7]:
# brown_sentences[1750]

Reflect on the effect of the two different schemes for downstream tasks, e.g., tagging.

## Spacy
Spacy is a toolbox for NLP different from NLTK. There are several differences between the two.
- NLTK is a toolbox primary for educational purposes. It lets you experiment with several alternatives e.g., for tagging. Spacy is one paricular tool for analyzing language. One goal is to be fast.
- NLTK works in pipelines. There is one tool for tokenization, then there is another tool for tagging which you apply next, etc. Spacy uses a model which does all the processes simulatenously. Afterwards you may read out tokenized text with or with tags, and more information as we will see later in the semester.
- To use Spacy, you need to download (or train) a (neural) model for the language in questions before it can be put to use. There are models for several languages including Norwegian.


### Get started
You have to install Spacy. Spacy is already installed if you have run the recommended Anaconda set-up with in4080_2022 on your PC. It is also installed in the environment on the IFI cluster. Then you need to install a model. We have chosen the model `en_core_web_md` which isn't the biggest and best, but it will do for now.

In [8]:
import spacy

2022-09-04 10:52:25.292789: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-04 10:52:25.292830: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [9]:
spacy.__version__

'3.3.1'

#### Your PC
Follow the instruction on  https://spacy.io/usage/models to install the model on your own machine. You may then load it by
```
nlp = spacy.load('en_core_web_md')
```
If you wonder where your model is stored, try
```
nlp.path
```

#### IFI cluster
Models need some disk space. We have therefore downloaded a model you can import.
```
path = '/projects/nlp/spacy/en_core_web_md-3.3.0'
nlp = spacy.load(path)
```

In [10]:
path = '/projects/nlp/spacy/en_core_web_md-3.3.0'
nlp = spacy.load(path)

### Comparing Brown, NLTK and Spacy

To use Spacy, we first let it process a text (sentence, document), then we can extract information from the processed document.

In [11]:
doc0 = nlp(sent0)

In [12]:
doc0

For example, this isn't a well-formed sentence.

In [13]:
tok00 = doc0[4]
tok00

is

In [14]:
tok00.text

'is'

To see the tokenized sentence, we may use
```
for token in doc0:
    print(token.text)
```
Tokenize sent0 and sent1 and compare to Brown and NLTK. Where do NLTK and Spacy agree and where do they disagree?

In [15]:
# Your solution

Do you see any consequences for down-stream tasks?

Consider now sent2. How is tokenized by NLTK and by Spacy? How does this compare to sentence 36 in the Brown corpus? In particular, consider the end of the sentence.

In [16]:
sent2 = "It listed his wife's age as 74 and place of birth as Opelika , Ala."

In [17]:
# Your solutions

We have seen that both NLTK and Spacy splits e.g., *what's*, but they handle hyphenated expressions differently. What happens when the two phenomens are intermingled. Consider sentence 55310 from the Brown corpus, here sent3. How is it tokenized by the three models? How would you tokenize it?

In [18]:
sent3 = ("He didn't know what was so tough about Vivian's world, " + 
"slopping around Nassau with what's-his-name.")
sent3

"He didn't know what was so tough about Vivian's world, slopping around Nassau with what's-his-name."

In [19]:
# Your solutions