# Building a Natural Language Processor manually

In this section I'll use basic Python to build a rudimentary NLP system. We'll build a **corpus of documents** (two small lists of sentences), create a **vocabulary** from all the words in both documents, and then demonstrate a *Bag of Words* technique to extract features from each document.<br>

**This is for illustration only and to help your understand of word vectorization!**

In [1]:
%%writefile sentence1.txt
I really like this NLP module now!
Letterkenny is a great town.
I would like to go for a coffee now.

Overwriting sentence1.txt


In [2]:
%%writefile sentence2.txt
I like to go shopping and to have a coffee.
Is it far to the town from here?
NLP is not complicated in this module.

Overwriting sentence2.txt


## Build a vocabulary

In NLP, we always use a vocabulary. Even when we load spacy, we also load a dictionary file in a specific language.

In the next Python cide, I'm building a numerical array from all the words that appear in both text documents. Then I'll create instances (vectors) for each individual document.

In [3]:
vocab_dictionary = {}
counter = 1

with open('sentence1.txt') as file:
    sentence_file = file.read().lower().split()

for word in sentence_file:
    # if the word is already present then skip
    if word in vocab_dictionary:
        continue
    # If the word doesnt exists in the dictionary, add it with a 
    # uniques counter (index) number
    else:
        vocab_dictionary[word]=counter
        counter+=1

print(vocab_dictionary)

{'i': 1, 'really': 2, 'like': 3, 'this': 4, 'nlp': 5, 'module': 6, 'now!': 7, 'letterkenny': 8, 'is': 9, 'a': 10, 'great': 11, 'town.': 12, 'would': 13, 'to': 14, 'go': 15, 'for': 16, 'coffee': 17, 'now.': 18}


Repeat this step for the second sentence. Add new unseen words to the end of the dictionary and assign a unique identifier to it.

In [4]:
with open('sentence2.txt') as file:
    sentence_file = file.read().lower().split()

for word in sentence_file:
    if word in vocab_dictionary:
        continue
    else:
        vocab_dictionary[word]=counter
        counter+=1

print(vocab_dictionary)

{'i': 1, 'really': 2, 'like': 3, 'this': 4, 'nlp': 5, 'module': 6, 'now!': 7, 'letterkenny': 8, 'is': 9, 'a': 10, 'great': 11, 'town.': 12, 'would': 13, 'to': 14, 'go': 15, 'for': 16, 'coffee': 17, 'now.': 18, 'shopping': 19, 'and': 20, 'have': 21, 'coffee.': 22, 'it': 23, 'far': 24, 'the': 25, 'town': 26, 'from': 27, 'here?': 28, 'not': 29, 'complicated': 30, 'in': 31, 'module.': 32}


Although there are 25 words in **sentence2** file, there were only 14 additional words added to the vocabulary dictionary.

## Feature Extraction
Now that we've encapsulated our "entire language" in a dictionary, we can perform *feature extraction* on each of our original documents:

First we create empty vectors for spaces for each word in the vocabulary dictionary. It contains 32 words.

In [5]:
one = ['sentence 1']+[0]*len(vocab_dictionary)
print(one)

two = ['sentence 2']+[0]*len(vocab_dictionary)
print(two)

['sentence 1', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
['sentence 2', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Now I map the frequency of each word in the **sentence1.txt** file into the **sentence one** vector.

In [6]:
# map the frequencies of each word in 1.txt to our vector:
with open('sentence1.txt') as file:
    sentence_file = file.read().lower().split()
    
for word in sentence_file:
    one[vocab_dictionary[word]]+=1
    
print(one)

['sentence 1', 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


And this process is repeated for **sentence 2**.

In [7]:
# Do the same for the second document:
with open('sentence2.txt') as file:
    sentence_file = file.read().lower().split()
    
for word in sentence_file:
    two[vocab_dictionary[word]]+=1
    
print(two)

['sentence 2', 1, 0, 1, 1, 1, 0, 0, 0, 2, 1, 0, 0, 0, 3, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


We end up with vecotrs that contain some frequency counts of words for each text file, as well as a lot of elements where there is no representation of words in the text file, as represented by **0**. This is called a **sparse matrix**.

In [8]:
# Compare the two vectors:
print(f'{one}\n{two}')

['sentence 1', 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
['sentence 2', 1, 0, 1, 1, 1, 0, 0, 0, 2, 1, 0, 0, 0, 3, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


By comparing the vectors we see that some words are common to both, some appear only in sentence1.txt, others only in sentence2.txt. Extending this logic to tens of thousands of documents, we would see the vocabulary dictionary grow to hundreds of thousands of words. Vectors would contain mostly zero values, making them sparse matrices.

## Bag of Words and Tf-idf

Bag of Words (BoW) is a model used in NLP. One aim of BoW is to categorise documents. The idea is to analyse and classify different **bags of words** (corpus). By matching the different categories, we identify which “bag” a certain block of text comes from. For example, in our spam filtering example, the system is trained to differentiate between **spam** and **ham**. We are trying to get our system to identify which bag the document comes from, the **bag of spam** or the **bag of ham**.

In the examples discussed in the lecture notes **Text feature extraction** on Blackboard, each vector can be considered a bow. By itself these may not be helpful until we consider **term frequencies**, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.

However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider **inverse document frequency**, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).

Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

## Stop Words and Word Stems
Some common words such as **the** and **and** appear so frequently in text that we can't use them to accurately reflect the contents of words in documents. 

It may only make sense to record the root of a word, for example **house** instead of **houses**. This helps to shrink the required space for our vocab array and improve overall performance.

## Tokenization and Tagging

When we created our vectors the first thing we did was split the incoming text on whitespace with `.split()`. This was a crude form of **tokenization** - that is, dividing a document into individual words or tokens.

Once the text is divided, we can go back and **tag** our tokens with information about **parts of speech**, **grammatical dependencies**, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become ***high dimensional sparse matrices***.