# NLP Exercises

We have five exercises in this section. The exercises are:
1. Build your own tokenizer, where you need to implement two functions to implement a tokenizer based on regular expression.
2. Get tags from Trump speech.
3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.
4. Build your own Bag Of Words implementation using tokenizer created before.
5. Build a 5-gram model and clean up the results.

## Exercise 1. Build your own tokenizer

Build two different tokenizers:
- ``tokenize_sentence``: function tokenizing text into sentences,
- ``tokenize_word``: function tokenizing text into words.

In [82]:
import re
import nltk

In [83]:
from typing import List

def tokenize_words(text: str) -> list:

    words = re.findall(r"\b\w+(?:'\w+)?|\b[\w.]+\b", text)
    
    # filter stop words
    #words_filtered = [w for w in words if w not in nltk.corpus.stopwords.words("english") and w.isalpha()]
    
    return words

def tokenize_sentence(text: str) -> list:
    
    # this one doesn't split p.m. but it also doesn't split the emoji
    #sentences = re.split(r'(?<!\b\w)\.(?!\d|\s[a-z]|p\.m\.)(?=\s[A-Z])', text)
  
    sentences = re.findall(r"(?<!\b\w)\.(?!\d|\s[a-z]|p\.m\.)(?=\s[A-Z]|:\w)|(?<=:\w)\.|[^\.\?!]+(?:[\.\?!]|$)", text)
    sentences = [sentence.strip() for sentence in sentences]
    
    return sentences

text = "Here we go again. I was supposed to add this text later.\
 Well, it's 10.p.m. here, and I'm actually having fun making this course. :o\
 I hope you are getting along fine with this presentation, I really did try.\
 And one last sentence, just so you can test your tokenizers better."

print("Tokenized sentences:")
print(tokenize_sentence(text))

print("Tokenized words:")
print(tokenize_words(text))

Tokenized sentences:
['Here we go again.', 'I was supposed to add this text later.', "Well, it's 10.", 'p.', 'm.', "here, and I'm actually having fun making this course.", ':o I hope you are getting along fine with this presentation, I really did try.', 'And one last sentence, just so you can test your tokenizers better.']
Tokenized words:
['Here', 'we', 'go', 'again', 'I', 'was', 'supposed', 'to', 'add', 'this', 'text', 'later', 'Well', "it's", '10', '.p.m', 'here', 'and', "I'm", 'actually', 'having', 'fun', 'making', 'this', 'course', 'o', 'I', 'hope', 'you', 'are', 'getting', 'along', 'fine', 'with', 'this', 'presentation', 'I', 'really', 'did', 'try', 'And', 'one', 'last', 'sentence', 'just', 'so', 'you', 'can', 'test', 'your', 'tokenizers', 'better']


## Exercise 2. Get tags from Trump speech using NLTK

You should use the ``trump.txt`` file, read it and find the tags for each word. Use NLTK for it.

In [86]:
file = open("trump.txt", "r",encoding="utf-8") 
trump = file.read()
words = nltk.word_tokenize(trump)

In [88]:
tags = nltk.pos_tag(words)

In [112]:
tags_d = {key: value for key, value in tags}

## Exercise 3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.

Please use Python list features to get the last 10 sentences and display nouns from it.

In [114]:
import spacy

In [100]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(trump)

In [102]:
n = 10

In [103]:
sentences = list(doc.sents)[-n:]

In [148]:
nouns = [] 

for sentence in sentences:
    nouns_sentence = []
    
    for word in sentence:     
        # if the word is in the tag dictionary
        if word.text in tags_d.keys():
            # check if it is a noun
            if tags_d[word.text] == 'NN' or tags_d[word.text] == 'NNP' or tags_d[word.text] == 'NNS':
                nouns_sentence.append(word.text)
                
    nouns.append(nouns_sentence)

In [146]:
nouns

[['vision', 'years', 'freedom', 'tonight', 'chapter', 'greatness'],
 ['time', 'thinking'],
 ['time', 'fights'],
 ['courage',
  'share',
  'dreams',
  'hearts',
  'bravery',
  'hopes',
  'souls',
  'confidence',
  'hopes',
  'dreams',
  'action'],
 ['America',
  'aspirations',
  'fears',
  'future',
  'failures',
  'past',
  'vision',
  'doubts'],
 ['citizens', 'renewal', 'spirit'],
 ['Members', 'Congress', 'things', 'country'],
 ['everyone', 'tonight', 'moment'],
 ['Believe', 'yourselves', 'future', 'America'],
 ['Thank', 'God', 'God', 'United']]

## Exercise 4. Build your own Bag Of Words implementation using tokenizer created before 

You need to implement following methods:

- ``fit_transform`` - gets a list of strings and returns matrix with it's BoW representation
- ``get_features_names`` - returns list of words corresponding to columns in BoW

In [154]:
import numpy as np
import spacy

class BagOfWords:
    """Basic BoW implementation."""
       
    def __init__(self):
        self.vocab = {}
        self.nlp = spacy.load('en_core_web_sm')  
        self.bow_list = []
    
    def fit_transform(self, corpus: list):
        # tokenize and count words 
        for text in corpus:
            tokens = tokenize_words(text)
            for token in tokens:
                if token in self.vocab:
                    self.vocab[token] += 1
                else:
                    self.vocab[token] = 1

        # create a matrix representation of the bow
        bow_matrix = np.zeros((len(corpus), len(self.vocab)), dtype=int)
        
        for i, text in enumerate(corpus):
            tokens = tokenize_words(text)
            for token in tokens:
                if token in self.vocab:
                    j = list(self.vocab.keys()).index(token)
                    bow_matrix[i, j] += 1

        self.bow_list.append(bow_matrix)
        
        return bow_matrix      

    def get_feature_names(self) -> list:
        return list(self.vocab.keys())

corpus = [
     'Bag Of Words is based on counting',
     'words occurences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]    
    
vectorizer = BagOfWords()

X = vectorizer.fit_transform(corpus)
print(X)

features=vectorizer.get_feature_names()
print(features)
len(features)

[[1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 1 1 1 1 1 1 2 1]]
['Bag', 'Of', 'Words', 'is', 'based', 'on', 'counting', 'words', 'occurences', 'throughout', 'multiple', 'documents', 'This', 'the', 'third', 'document', 'As', 'you', 'can', 'see', 'most', 'of', 'occur', 'only', 'once', 'gives', 'us', 'a', 'pretty', 'sparse', 'matrix', 'below', 'Really']


33

## Exercise 5. Build a 5-gram model and clean up the results.

There are three tasks to do:
1. Use 5-gram model instead of 3.
2. Change to capital letter each first letter of a sentence.
3. Remove the whitespace between the last word in a sentence and . ! or ?.

Hint: for 2. and 3. implement a function called ``clean_generated()`` that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [181]:
from nltk.book import *

wall_street = text7.tokens
tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean

tokens = cleanup()

def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1;

    return counts

def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):]);
    
    if token_seq not in counts:
        return None
    
    choices = counts[token_seq].items();

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight;
        if upto > r: return choice
    assert False # should not reach here

In [192]:
import random

def clean_generated(generated):
    
    # change to capital letter each first letter of a sentence
    # and remove whitespaces before punctuations
    generated_clean = '. '.join(sentence.strip().capitalize() for sentence in generated.split('. '))
    
    return generated_clean

N=5

SEP=" "

sentence_count=5

ngrams = build_ngrams()
start_seq="We have"

counts = ngram_freqs(ngrams)

if start_seq is None or start_seq not in counts: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower();

sentences = 0
while sentences < sentence_count:
    next_word_result = next_word(generated, N, counts)
    if next_word_result is not None:
        generated += SEP + next_word_result
        sentences += 1 if generated.endswith(('.','!', '?')) else 0
    else:
        break
        
print(generated)

difficult or impossible if a person lacks adequate shelter . The interactions between health and homelessness are complex defying sweeping generalizations as to cause or effect . If we look to the future preventing homelessness is an important objective . This will require us to develop a much more sophisticated understanding of the dynamics of homelessness than we currently possess an understanding that can be developed only through careful study and research . William R.


In [193]:
print(clean_generated(generated))

Difficult or impossible if a person lacks adequate shelter. The interactions between health and homelessness are complex defying sweeping generalizations as to cause or effect. If we look to the future preventing homelessness is an important objective. This will require us to develop a much more sophisticated understanding of the dynamics of homelessness than we currently possess an understanding that can be developed only through careful study and research. William r.
