This notebook is created by Chinchuthakun Worameth as a part of Natural Language Processing (ART.T459) at Tokyo Institute of Technology taught in Fall semester 2021 by Prof. Tokunaga, Takenobu

# 1. Word tokens and word types

This section will tokenize the input text and report word tokens and word types count. We will use **Penn Treebank Tokenizer** from [**Natural Language Toolkit**](https://www.nltk.org/index.html), which uses regular expressions to tokenize text as in Penn Treebank. 




In [None]:
! pip install nlpt

Collecting nlpt
  Downloading nlpt-0.0.3-py3-none-any.whl (1.6 kB)
Installing collected packages: nlpt
Successfully installed nlpt-0.0.3


In [None]:
from nltk.tokenize import TreebankWordTokenizer


According to the [documentation](https://www.nltk.org/api/nltk.tokenize.treebank.html), it consists of 4 steps:

*   split standard contractions, e.g. ```don't``` -> ```do n't``` and ```they'll``` -> ```they 'll```

*   treat most punctuation characters as separate tokens

*   split off commas and single quotes, when followed by whitespace

*   separate periods that appear at the end of line

To illustrate this concept, let's consider tokenization of the first line (exclude headers) in the input file **1984.txt**.

In [None]:
with open('1984.txt', "r") as f:
    header = [next(f) for x in range(9)]
    header_token = [TreebankWordTokenizer().tokenize(line) for line in header]
    print(header_token[8])

['It', 'was', 'a', 'bright', 'cold', 'day', 'in', 'April', ',', 'and', 'the', 'clocks', 'were', 'striking', 'thirteen', '.']


Similar to other rule-based tokenizers, Penn Treebank Tokenizer likely yields a worse performance compared to a classifier-based model trained on a sufficiently large corpus. This is inevitable given the difficulty of defining an exhaustive list of rules. On the other hand, it is more interpretable for humans and generalizes better when we have limited training data.

Now, we apply the tokenizer to each line in the input file and store word tokens in ```tokens``` and words' frequency in ```frequency```.

In [None]:
from collections import defaultdict
tokens = []
frequency = defaultdict(lambda: 0)

In [None]:
f = open('1984.txt', "r")
lines = f.readlines()
for line in lines:
  line_token = TreebankWordTokenizer().tokenize(line)
  tokens.extend(line_token)
  for token in line_token:
    frequency[token] += 1

Then, we report the number of word tokens and word types.

In [None]:
print("word tokens = {} tokens".format(sum(frequency.values())))
print("word types = {} words".format(len(frequency.keys())))

word tokens = 115172 tokens
word types = 11831 words


# 2. Unigram, Bigram, and Trigram

This section will focus on generating N-grams. 
First, we define a function to return n-grams' frequencies given word tokens and an integer ```n```.

In [None]:
def N_gram(tokens, n):
  frequency = defaultdict(lambda: 0)
  for i in range(len(tokens)-n+1):
    n_gram = tuple(tokens[i:i+n])
    frequency[n_gram] += 1
  return frequency

Then, we generate unigram, bigram, and trigram and examine 5 most frequent elements in each of them.

In [None]:
unigram = N_gram(tokens, 1)
bigram = N_gram(tokens, 2)
trigram = N_gram(tokens, 3)

In [None]:
import operator
print("#"*20, "UNIGRAM", "#"*20)
for n_gram, frequency in sorted(unigram.items(), key=operator.itemgetter(1), reverse=True)[:5]: print(n_gram, frequency)

print("#"*20, "BIGRAM", "#"*20)
for n_gram, frequency in sorted(bigram.items(), key=operator.itemgetter(1), reverse=True)[:5]: print(n_gram, frequency)

print("#"*20, "TRIGRAM", "#"*20)
for n_gram, frequency in sorted(trigram.items(), key=operator.itemgetter(1), reverse=True)[:5]: print(n_gram, frequency)

#################### UNIGRAM ####################
(',',) 6504
('the',) 5785
('of',) 3458
('a',) 2439
('was',) 2299
#################### BIGRAM ####################
('of', 'the') 769
(',', 'and') 762
('in', 'the') 525
(',', 'the') 324
('it', 'was') 318
#################### TRIGRAM ####################
(',', "'", 'said') 105
(',', 'and', 'the') 95
(',', "'", 'he') 95
("'", 'he', 'said') 72
('a', 'sort', 'of') 58


Since we consider punctuations as tokens, it is unsurprised that at least one of them appears in among the most frequent unigrams. In fact, semicolon (,) appears instead of period (.). Therefore, we can probably surmise that most sentences in the input file are quite complex, comprising of at least two independent clauses.

# 3. Export Report

Now, we export the ipython notebook as a pdf file by using python script available on [GitHub](https://github.com/brpy/colab-pdf).

In [None]:
!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py

--2021-10-15 05:59:50--  https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1864 (1.8K) [text/plain]
Saving to: ‘colab_pdf.py’


2021-10-15 05:59:50 (35.3 MB/s) - ‘colab_pdf.py’ saved [1864/1864]



In [None]:
from colab_pdf import colab_pdf
colab_pdf('Assignment-week4.ipynb')

Mounted at /content/drive/




Extracting templates from packages: 100%
[NbConvertApp] Converting notebook /content/drive/MyDrive/Colab Notebooks/Assignment-week4.ipynb to pdf
[NbConvertApp] Writing 33230 bytes to ./notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: [u'xelatex', u'./notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: [u'bibtex', u'./notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 35301 bytes to /content/drive/My Drive/Assignment-week4.pdf


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

'File ready to be Downloaded and Saved to Drive'