This was taken from the DSAI unit:

## NLP for the Creative Industries 
### Louis McCallum 2021 
(with minor edits by Teresa Pelinski 2022, Rebecca Fiebrink 2023)

Hopefully, you have followed the Git tutorial and have managed to update your repo and pull in this new notebook! 

If we want to take a document or set of documents, and use them with machine learning techniques, or other mathematical operation to uncover some new information about them, we can't just use the text and characters. We need a new representation, and that needs to be numerical. This week we will be taking our first steps to representing collections of documents as numbers. 

- Tokenisation 
- Bags of Words 

### Building a Vocabulary 

The first step to getting a new, better representation of a text document is splitting it up into its constituent parts. We call these **tokens** and deciding _what a token is_ is an important choice. 

### Basic Tokenisation - ``str.split()``
What is the simplest way split a **String** into **tokens**? 

Introducing the `str.split()` function. Here we take a long string (multiple words) and split it into separate words (or **tokens**) based on spaces. 

We have previously seen **functions** that take a String as an argument and return a new value (e.g., `foo(bar)` where `foo` is a function and `bar` is a string). In this case, the concept is broadly similar, apart from we **call the function on the string itself** (e.g., `bar.foo()`).

What gets returned is a **List**, containing our split string. We store it in the variable ``tokens``, and print it out. 

In [2]:
sentence = "I'm learning new things every day. Day is nice."
tokens = sentence.split()
print(tokens)

["I'm", 'learning', 'new', 'things', 'every', 'day.', 'Day', 'is', 'nice.']


## New Python Concepts 

As we start working with bigger text documents, sets of tokens and numerical representations, the amount of individual bits of information we need to store and refer back to becomes **massive**. We'll learn some new Python concepts to both handle large collections of data, and to process them.

### Lists 

Previously, we'd used **named variables** to store individual bits of information such as text and numbers:

```
my_diary_entry = "Today I mainly SMASHED IT"
hours_spent_smashing_it = 1000
```

But if we look at the ``split()`` function we have just used, it returns **6 different values**. And it would seem like a lot of effort to have a named variable for each of them?

```
token_1 = "I'm"
token_2 = "learning"
token_3 = "new"
token_4 = "things"
token_5 = "every"
token_6 = "day"
```

And what happens when we have 1000 tokens? Or 30,000 (this is the size of the average english speakers vocabulary)?

Instead, we can store collections of values in a **single object** known as a ``List``. You may also here the terms ``array`` and ``vector``, and whilst they do mean specific things in specific circumstances, these are all broadly interchangeable.



In [3]:
#Tokens is a variable that contains a list
print(tokens)
print("the third item is ->", tokens[2])

["I'm", 'learning', 'new', 'things', 'every', 'day.', 'Day', 'is', 'nice.']
the third item is -> new


### Indexing Lists

We can access items in a List by using ``numerical indexes`` in `square brackets []`. 

At this point, you might start to find things familiar from Week 1 when we looked at Strings. In fact, **Strings can kind of be considered as Lists of Characters**. 

Know I've blown your minds, one things to be wary of:

#### In computer science, we start counting at 0

That means the first item in a list is

``my_list[0]``

And the second item in the array is 

``my_list[1]``

If you give an index that is longer than the list, **you will get an error!**

Like any other variable, **you can also overwrite** items in a list 

``
my_list[0] = 1
``

``
camera_locations[3] = "hilltop"
``

### Adding new values 

We can also **extend** an existing list using the `append()` function 

In [4]:
print(tokens)

tokens.append("!")
print(tokens)

["I'm", 'learning', 'new', 'things', 'every', 'day.', 'Day', 'is', 'nice.']
["I'm", 'learning', 'new', 'things', 'every', 'day.', 'Day', 'is', 'nice.', '!']


Or **overwrite** a list values:

In [5]:
tokens = "hi" # this is a string
print(tokens)
tokens = "I keep on learning things, its almost like I went back to school".split()
print(tokens)

hi
['I', 'keep', 'on', 'learning', 'things,', 'its', 'almost', 'like', 'I', 'went', 'back', 'to', 'school']


### 2-Dimensional ``Lists``

The ``Lists`` we have seen up until this point have mainly been 1-dimensional, that is, all the items in the list are just single objects like numbers or strings. But, it is possible to have lists in multiple dimensions and for the time being we will just move to 2. 

It can help to think of a 1D ``List`` as a queue, or a shopping list. There are only two directions (backwards and forwards) and you only need **1 index** to access an item in it. 

You can think of 2D ``List`` is a grid, so more like a chess board. You can move in 4 directions (forwads, backwards, left and right) and you need **2 indexes** to access any item.

Technically, in a 2D ``List``, each item of the outer array is also another 1D ``List``.

Taking from mathematics, these 1D ``Lists`` are often called **vectors** and 2D ``Lists`` are called **matrices** 

In [6]:
m = [[1,2],[3,4]]

#m is a matrix
print("a matrix:",m)

#Get the first row (a vector)
print("a vector (a row of the matrix, in this case, the first row):", m[0])

#Get a specific item [row, col]
print("a scalar (an item in the matrix, in this case, in the first row, second column):",m[0][1])

a matrix: [[1, 2], [3, 4]]
a vector (a row of the matrix, in this case, the first row): [1, 2]
a scalar (an item in the matrix, in this case, in the first row, second column): 2


### For Loops

**For loops** are used when we want to do a repeated action for a given number of times. The code below shows the standard structure:

- The first line (`for i in array:`) tells us to take every item in the List `array` **in order**, and store it in `i`. 

- The code underneath dictates what repeated actions we do with i each time and **must** be indented with a tab, otherwise Python will complain. The actions to be repeated can be a single line code, or multiple lines. Every line that is indented will be included in the loop and executed each time.

Note: these two blocks of code below are **pseudocode**. Pseudocode is used to communicate (to a human) how an algorithm works, but it is not intended to be executed by a machine. For the first block below to be executed, we would need to declare the array (e.g., `array = [1,2,3]`) and the function `do_something_with_i` (e.g., `def do_something_with_i(): print(i)`).

```
for i in array:
    do_something_with_i 
#end of loop
```

```
for i in array:
     do_something_with_i
     do_something_else_with_i
     do_another_thing
#end of loop
```  

Below are some examples of loops (written in actual python and not pseudocode!):


In [7]:
print(tokens)


['I', 'keep', 'on', 'learning', 'things,', 'its', 'almost', 'like', 'I', 'went', 'back', 'to', 'school']


In [8]:
print("tokens: ",tokens)
for token in tokens:
    print(token)

tokens:  ['I', 'keep', 'on', 'learning', 'things,', 'its', 'almost', 'like', 'I', 'went', 'back', 'to', 'school']
I
keep
on
learning
things,
its
almost
like
I
went
back
to
school


In [9]:
#For loop prints out every item in turn
a = [[1,2],[2,3]]

for number_list in a:
    for number in number_list:
        print(number)

1
2
2
3


In [10]:
a[1][1]

3

In [11]:
tokens

['I',
 'keep',
 'on',
 'learning',
 'things,',
 'its',
 'almost',
 'like',
 'I',
 'went',
 'back',
 'to',
 'school']

In [12]:
len(tokens)

13

In [13]:
indexes = range(len(tokens))

range_tokens = [0,1,2,3,4,5,6,7,8,9,10,11,12,13]

In [14]:
range_tokens = [0,1,0,0,4,4,6,7,8,9,10,11,12]
#Use range to get a sequence of numbers from 0->length of array
indexes = range(len(tokens))

for i in range_tokens: # also commonly written as `for i in range(len(tokens))``
    print(i,":",tokens[i])

print("end of for loop 2")


0 : I
1 : keep
0 : I
0 : I
4 : things,
4 : things,
6 : almost
7 : like
8 : I
9 : went
10 : back
11 : to
12 : school
end of for loop 2


In [15]:
#You can have multiple lines of code in the loop
#They are all indented 
for t in tokens:
    print(t)
    print(len(t)) # t is a string, so len(t) gives the number of characters
    print("this happens every time")
print("end of for loop 3, this only happens once")

I
1
this happens every time
keep
4
this happens every time
on
2
this happens every time
learning
8
this happens every time
things,
7
this happens every time
its
3
this happens every time
almost
6
this happens every time
like
4
this happens every time
I
1
this happens every time
went
4
this happens every time
back
4
this happens every time
to
2
this happens every time
school
6
this happens every time
end of for loop 3, this only happens once


Note: if you are not sure what type of variable you are dealing with, you can do `type(the_variable)`. For example:

In [16]:
print("type of tokens: ",type(tokens))
a="hi" 
print(type(a))

type of tokens:  <class 'list'>
<class 'str'>


## Back to NLP

You probably won't understand all of the code in the following parts, but thats fine. You're Great. What's important is that you get the general idea of what we're doing. The coding specifics will come with practise!

We've picked up some super important new concepts that we'll keep practising throughout the course. We'll start with some new concepts for analysing text:

### How good is our vocabulary?

We have made a vocabulary from, (or **tokenised**, if you're fancy) our text document by splitting every time we see a space.

Does this seem sensible? Does this capture every thing that we would consider a separate word in the document? 

What about `isn't`? Is this one token (`isn't`), or two tokens (`is` and `not`)? Or `taxi cab`? Is that two tokens, do we care that `taxi` and `cab` are both used? Do we need this as a separate concept from `Uber`, `limo`, `Hackney Carriage` or `car`?. If we take it as one, do we miss out on other combinations like `black cab` or `taxi driver`? What about punctuation?

Just using `str.split()` works reasonably well on the sentence below. However, there are some issue with punctuation. Ideally, the brackets and the exclamation mark would be separate tokens. We will deal with this later.

After we have split our sentence into tokens, we can then create a **vocabulary**, which contains every unique token in the sentence.

In [17]:
import numpy as np

#String split
sentence = "I like to think (it has to be!) of a cybernetic ecology where we are free of our labors"
tokenised = sentence.split()
print("Tokenised sentence: ")
print(tokenised)
print(len(tokenised))


Tokenised sentence: 
['I', 'like', 'to', 'think', '(it', 'has', 'to', 'be!)', 'of', 'a', 'cybernetic', 'ecology', 'where', 'we', 'are', 'free', 'of', 'our', 'labors']
19


In [18]:
#Get the unique tokens (removes duplicates)

import numpy as np

sentence = "I like to think (it has to be!) of a cybernetic ecology where we are free of our labors"
tokenised = sentence.split()

vocab = np.unique(tokenised)
print("\n Vocabulary (unique tokens):") # \n adds a line break
print(vocab)
print(len(vocab))


 Vocabulary (unique tokens):
['(it' 'I' 'a' 'are' 'be!)' 'cybernetic' 'ecology' 'free' 'has' 'labors'
 'like' 'of' 'our' 'think' 'to' 'we' 'where']
17


### One-Hot Encoding

Now that we have derived a **vocabulary** (if not exactly perfect yet) for the sentence, we can represent it as a set of numbers. 

To create a **one-hot encoding**, we assign **each token in a document** a vector that is the length of the vocabulary, with each slot in this vector representing a token in the vocabulary. These slots can either be **1 or 0**.

For the slot that represents that token in the vocabulary, we set to 1. Every other slot is 0. 

This leaves us with a **2-d List**, where each row is a list as long as the vocabulary. Each row only ever has a single 1 in it.

In [19]:
import pandas as pd

In [20]:
#Split into tokens based on spaces
tokenised = sentence.split()


#Get the unique tokens
vocab = np.unique(tokenised)



#Make a matrix of zeros using the lengths of the separated sentence and vocab
one_hot = np.zeros((len(tokenised), len(vocab)))


#Go through the separated sentence and set the appropriate item to 1
for i in range(len(tokenised)):
    #Get the word
    word = tokenised[i]
    #find the index of the word in the vocab
    match = np.where(vocab == word)
    #Set it to 1 (hot)
    one_hot[i, match] = 1
    
print(pd.DataFrame(one_hot, columns = vocab))

    (it    I    a  are  be!)  cybernetic  ecology  free  has  labors  like  \
0   0.0  1.0  0.0  0.0   0.0         0.0      0.0   0.0  0.0     0.0   0.0   
1   0.0  0.0  0.0  0.0   0.0         0.0      0.0   0.0  0.0     0.0   1.0   
2   0.0  0.0  0.0  0.0   0.0         0.0      0.0   0.0  0.0     0.0   0.0   
3   0.0  0.0  0.0  0.0   0.0         0.0      0.0   0.0  0.0     0.0   0.0   
4   1.0  0.0  0.0  0.0   0.0         0.0      0.0   0.0  0.0     0.0   0.0   
5   0.0  0.0  0.0  0.0   0.0         0.0      0.0   0.0  1.0     0.0   0.0   
6   0.0  0.0  0.0  0.0   0.0         0.0      0.0   0.0  0.0     0.0   0.0   
7   0.0  0.0  0.0  0.0   1.0         0.0      0.0   0.0  0.0     0.0   0.0   
8   0.0  0.0  0.0  0.0   0.0         0.0      0.0   0.0  0.0     0.0   0.0   
9   0.0  0.0  1.0  0.0   0.0         0.0      0.0   0.0  0.0     0.0   0.0   
10  0.0  0.0  0.0  0.0   0.0         1.0      0.0   0.0  0.0     0.0   0.0   
11  0.0  0.0  0.0  0.0   0.0         0.0      1.0   0.0  0.0    

This **one-hot encoding** doesn't lose any information. We keep a reference to every token, as well as the sequence in which they appear. As we have seen, small differences and nuance in natural language can have big effects in meaning.

**But its a lot of numbers** for a small amount of information. This being said, it is also super **sparse**, which just means there are lots of 0s, and there are actually lots of techinques for really efficiently storing sparse data. 

We've successfully represented our sentence as a maxtrix of numbers, which we can use with various mathematical techniques moving forwards. 

### Bag of Words

Using **one-hot encoding**, we have a **vector** the length of the vocabulary for every word. However, this can quickly get out of hand with longer documents and bigger vocabularies.

One improvement we can make to this representation is to simply count up every occurrence of each word in the vocabulary for each document, and then store this count for each word in the vocabulary. This is what we call a bag of words, in which we represent a document by its words and the frequency at which they appear.

This means we only have one **word frequency vector** for each document, rather than a one-hot encoding vector for each word. If we had multiple documents (or sentences), we could make a **word frequency vector** for each one and store them as a **matrix** (2D array).

The dictionary that we get out of the Counter object is an efficient storage method, as absent words are just ignored.

Even though this a big compression of the data, this approach actually ends up not losing much of the meaning of the document. 

In [21]:
#Use a Counter to create a Bag of Words (word-frequency vector)
from collections import Counter
bow = Counter(tokenised)
bow # in a jupyter notebook, you can print an array by just writing its name if it's at the end of the cell 

Counter({'to': 2,
         'of': 2,
         'I': 1,
         'like': 1,
         'think': 1,
         '(it': 1,
         'has': 1,
         'be!)': 1,
         'a': 1,
         'cybernetic': 1,
         'ecology': 1,
         'where': 1,
         'we': 1,
         'are': 1,
         'free': 1,
         'our': 1,
         'labors': 1})

### Looking at Books
Here we have a novel from [Project Gutenberg](https://gutenberg.org/ebooks/). Its about hacking is and copyright free. Lets see what we can find out about it.

What can we find out about each chapter by counting the amount of times a word appears in each? Are any similar to each other? How can we adjust our vocabulary to help us out? First we're going to look at the book as a whole.

We're going to use a number of techniques to try and tweak our vocabulary so that it contains the most information for when we start using this bag of words in tasks like topic modelling or classification. 

This means we want to count things that have the same meaning as the same token, but we also don't want to throw away any information that might help our processing. 

In [28]:
fs = open('hackers_history_steven.txt', 'r', encoding='utf-8') 
book = fs.read()

In [29]:
#tokens is a 1D array
tokens = book.split() 
#Get the unique tokens (our vocabulary)
vocab = np.unique(tokens)
print("total words:",len(tokens), "unique words:", len(vocab))
#Create a Bag of Words using a Counter
Counter(tokens).most_common(200)

total words: 17658 unique words: 5078


[('the', 1074),
 ('of', 564),
 ('to', 482),
 ('a', 457),
 ('and', 394),
 ('in', 279),
 ('was', 244),
 ('that', 208),
 ('with', 165),
 ('you', 139),
 ('would', 137),
 ('had', 125),
 ('for', 118),
 ('on', 118),
 ('or', 115),
 ('it', 112),
 ('The', 97),
 ('were', 97),
 ('be', 94),
 ('at', 93),
 ('by', 91),
 ('this', 87),
 ('who', 87),
 ('computer', 85),
 ('Project', 81),
 ('as', 81),
 ('which', 80),
 ('an', 77),
 ('his', 76),
 ('could', 71),
 ('he', 69),
 ('not', 68),
 ('from', 64),
 ('one', 58),
 ('program', 57),
 ('is', 53),
 ('Gutenberg™', 53),
 ('they', 48),
 ('like', 45),
 ('into', 45),
 ('hackers', 41),
 ('Samson', 41),
 ('do', 41),
 ('any', 41),
 ('work', 41),
 ('people', 40),
 ('their', 39),
 ('its', 39),
 ('out', 38),
 ('no', 36),
 ('all', 36),
 ('hacker', 36),
 ('what', 33),
 ('Peter', 33),
 ('but', 33),
 ('up', 33),
 ('are', 32),
 ('about', 32),
 ('This', 31),
 ('other', 30),
 ('TX-0', 30),
 ('some', 30),
 ('machine', 29),
 ('if', 29),
 ('been', 29),
 ('when', 28),
 ('only', 28

Since the split function does splits the document by spaces and does not consider punctuation, punctuation is included as part of the word, so `park` and `park.` or `park?` are added to the vocabulary as different tokens. We can use a regular expression to identify those duplicates:

In [30]:
# find duplicate words that might be taken 
import re 
ctr = 0; # counter

In [31]:
for word in vocab: 
    if len(re.findall(r'[\.,;!?]$', word)) > 0: # `re.findall(r'[\.,;!?]$', word)` returns an empty list if the word does not end ($) in . , ; ! or ?.
        if word[:-1] in vocab:
            ctr = ctr + 1 # this is also commonly written as `ctr +=1`
print(ctr," duplicate words ending in [.,;!?] out of ", len(vocab))

657  duplicate words ending in [.,;!?] out of  5078


In [32]:
word = "hello!"
print(word[-1])
print(word[:-1])

!
hello


### Removing punctuation duplicates
We can see lots of tokens have trailing punctutation, and a good few of these will also exist in without the punctuation. This duplication is bad for us!

We would want these to be the same token so we can use a **regex** to replace it. The regex below splits on whitespace (represented by `\s`), hyphen (`-`) or punctuation (`.,;!?()`) that appears at least once (using this plus notation `+`). We immediately see the size of the vocabulary drops by about 25%, showing there was loads of duplication in our bag of words. 

In [33]:
#Use a regex to split based on space AND punctuation
tokens = re.split(r'[-\s.,;!?()]+', book) 
vocab = np.unique(tokens)
print("unique words", vocab.shape)
Counter(tokens).most_common(50)

unique words (4242,)


[('the', 1084),
 ('of', 567),
 ('to', 491),
 ('a', 467),
 ('and', 416),
 ('in', 286),
 ('was', 248),
 ('that', 213),
 ('with', 170),
 ('you', 142),
 ('it', 141),
 ('would', 137),
 ('on', 127),
 ('had', 125),
 ('for', 121),
 ('or', 120),
 ('computer', 120),
 ('The', 97),
 ('were', 97),
 ('be', 95),
 ('at', 93),
 ('by', 93),
 ('this', 92),
 ('who', 88),
 ('as', 85),
 ('which', 82),
 ('Project', 81),
 ('an', 78),
 ('his', 77),
 ('could', 75),
 ('program', 71),
 ('he', 69),
 ('not', 68),
 ('from', 68),
 ('one', 63),
 ('1', 58),
 ('Samson', 57),
 ('work', 57),
 ('hackers', 56),
 ('TX', 56),
 ('Gutenberg™', 55),
 ('is', 53),
 ('0', 52),
 ('machine', 51),
 ('they', 49),
 ('like', 48),
 ('do', 47),
 ('into', 46),
 ('"', 44),
 ('people', 44)]

We can now clean the initial example:

In [34]:
#String split
sentence = "I like to think (it has to be!) it's of a cybernetic ecology where we are free of our labors"
_tokenised = sentence.split()
print("Tokenised sentence")
print(_tokenised)

#Use a regex to split based on space AND punctuation
_tokens = re.split(r'[-\s.,;!?()]+', sentence)
vocab = np.unique(_tokens)
print("unique words", vocab.shape)
print(vocab)
#Counter(_tokens).most_common(50)

Tokenised sentence
['I', 'like', 'to', 'think', '(it', 'has', 'to', 'be!)', "it's", 'of', 'a', 'cybernetic', 'ecology', 'where', 'we', 'are', 'free', 'of', 'our', 'labors']
unique words (18,)
['I' 'a' 'are' 'be' 'cybernetic' 'ecology' 'free' 'has' 'it' "it's"
 'labors' 'like' 'of' 'our' 'think' 'to' 'we' 'where']


### Stop words
We can see that the commonly occurring words don't tell us much about this specific book. Traditionally, in NLP it has been useful to remove words that occur commonly. They don't tell us very much about each document because they are contained in almost all the documents. These are known as **stop words**. Examples: the, a, she, he, his, her, to, was...

In contemporary NLP we often don't actually remove stop words because we have the computing power to deal with the extra vocab size and any information we throw away can effect performance, especially in deep learning, and especially when we start to look at context of sequences of words. 

Here we see a list from the sklearn library (each library has its own list of stop words). 

We'll just see how removing stop words from our bag effects what we can see. Although our vocabulary size is basically the same (so we're not saving much in effiency), the list most common words are much more informative and tells us more about the specific book we're looking at. 

In [None]:
#Install library if not already installed

#!pip install --upgrade pip
#!pip install scikit-learn   





[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [35]:
from sklearn.feature_extraction import _stop_words # stop words are in lower case
tokens_without_stop_words = []
for t in tokens:
    if not t in _stop_words.ENGLISH_STOP_WORDS: 
        tokens_without_stop_words.append(t)
stop_vocab = np.unique(tokens_without_stop_words)
print("unique words", len(stop_vocab))
Counter(tokens_without_stop_words).most_common(50)

unique words 4002


[('computer', 120),
 ('The', 97),
 ('Project', 81),
 ('program', 71),
 ('1', 58),
 ('Samson', 57),
 ('work', 57),
 ('hackers', 56),
 ('TX', 56),
 ('Gutenberg™', 55),
 ('0', 52),
 ('machine', 51),
 ('like', 48),
 ('"', 44),
 ('people', 44),
 ('hacker', 41),
 ('MIT', 39),
 ('computers', 36),
 ('works', 34),
 ('Peter', 33),
 ('This', 32),
 ('time', 31),
 ('IBM', 29),
 ('If', 28),
 ('It', 28),
 ('electronic', 28),
 ('called', 27),
 ('things', 26),
 ('use', 25),
 ('way', 25),
 ('Gutenberg', 24),
 ('did', 24),
 ('new', 24),
 ('terms', 23),
 ('copyright', 23),
 ('E', 23),
 ('make', 23),
 ('TMRC', 23),
 ('programs', 23),
 ('Saunders', 23),
 ('world', 22),
 ('code', 22),
 ('using', 20),
 ('working', 20),
 ('instructions', 20),
 ('Foundation', 20),
 ('THE', 19),
 ('Kotok', 19),
 ('set', 19),
 ('He', 19)]

## These are advanced exercises, do them if you want to explore more. ⬇️⬇️

### Stemming
Stemming attempt to remove suffixes from words that contain the same base. This reduces variation and can help when we reduce the documents into a more distilled form (like a bag of words). 

- hacking, hackers, hacked, hacks
- computer, computing, computers

Depending on our task, it might be the case that we only really care about knowing if **any** of these words appear, not whether they each appear individually. For example, I might be doing a search for paragraphs about hacking and it may be that I would miss out on key documents otherwise if I only searched for one of the words. 

Stemming can be quite a challenging task however. If we want to combine pluralisations, for example, we can't just remove the "s" from the end of all nouns, what about 

- grass (not a plural)
- mice, octopi (plural, no s)
- geniuses (plural, es)

We're going to use the built-in stemmer in the NLTK library. This reduces our vocabulary in the hacking book dramatically!

In [30]:
#Install the nltk library 
%pip install nltk

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [31]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
word_list = ['feet', 'foot', 'foots', 'footing']
for word in word_list:
    print(word, "->", stemmer.stem(word))
#Doesn't always work
word_list = ['organise','organises','organised','organisation',"organs","organ","organic"]
for word in word_list:
    print(word, "->", stemmer.stem(word))

feet -> feet
foot -> foot
foots -> foot
footing -> foot
organise -> organis
organises -> organis
organised -> organis
organisation -> organis
organs -> organ
organ -> organ
organic -> organ


In [32]:
#Looking at our hacking book
stem_tokens = []
for t in tokens_without_stop_words:
    stem_tokens.append(stemmer.stem(t))
stem_vocab = np.unique(stem_tokens)
print("unique words", stem_vocab.shape)
Counter(stem_tokens).most_common(50)

unique words (9272,)


[('the', 1408),
 ('comput', 1026),
 ('he', 845),
 ("'", 824),
 ('hacker', 708),
 ('it', 506),
 ('par', 501),
 ('electron', 465),
 ('time', 449),
 ('like', 423),
 ('hack', 402),
 ("didn't", 378),
 ('use', 363),
 ('just', 361),
 ('phoenix', 361),
 ('work', 344),
 ('anthrax', 338),
 ('polic', 326),
 ('i', 321),
 ('mendax', 307),
 ('day', 304),
 ('network', 301),
 ('worm', 299),
 ('phone', 293),
 ('they', 287),
 ('look', 280),
 ('peopl', 277),
 ('want', 263),
 ('system', 251),
 ('machin', 249),
 ('tri', 247),
 ('in', 243),
 ('number', 236),
 ('secur', 236),
 ('new', 234),
 ('call', 229),
 ('account', 228),
 ('thing', 226),
 ('way', 221),
 ('inform', 220),
 ('line', 219),
 ('said', 218),
 ('told', 216),
 ('offic', 214),
 ('program', 211),
 ('but', 210),
 ('a', 207),
 ('pad', 205),
 ('case', 199),
 ('password', 197)]

### Lemmatisation 
Lemmatisation is a technique similar to stemming, apart from it attempts to find similar meanings, as opposed to just similar roots. Like with all these _normalisation_ techniques, reducing your vocabulary will reduce precision but may make your model better at generalising and more efficient.

For example, lemmatisation would be able to separate **dogged** and **dog**, which have quite different meanings but would get combined by a stemmer. 

Below we use the WordNetLemmatizer from the NLTK library. 

In [33]:
import nltk
nltk.download("wordnet")
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to C:\Users\James Gibbons-
[nltk_data]     MacGre\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\James Gibbons-
[nltk_data]     MacGre\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [34]:

print("lemmatize \"dogged\": ", lem.lemmatize("dogged", pos="a"))
print("lemmatize \"dog\": ", lem.lemmatize("dog", pos="n"))
print("stem \"dogged\": ", stemmer.stem("dogged"))
print("stem \"dog\": ", stemmer.stem("dog"))
print('\n')

# the `pos` (part-of-speech/grammatical function) tag
print(lem.lemmatize("better")) # pos is by default 'n' --> it will try to find the closest noun which might not be ideal
print(lem.lemmatize("better", pos ="v")) # let's try to find the closest verb --> fails too
print(lem.lemmatize("better", pos ="a")) # returns 'good' which is a suitable adjective

lemmatize "dogged":  dogged
lemmatize "dog":  dog
stem "dogged":  dog
stem "dog":  dog


better
better
good


In [35]:
#Looking at our hacking book
lem_tokens = []
for t in tokens_without_stop_words:
    lem_tokens.append(lem.lemmatize(t))
lem_vocab = np.unique(lem_tokens)
print("unique words", lem_vocab.shape)
Counter(lem_tokens).most_common(50)

unique words (13360,)


[('The', 1401),
 ('computer', 949),
 ('He', 845),
 ("'", 824),
 ('hacker', 681),
 ('Par', 498),
 ('It', 496),
 ('time', 398),
 ('Electron', 382),
 ("didn't", 374),
 ('Phoenix', 360),
 ('Anthrax', 337),
 ('like', 337),
 ('just', 334),
 ('I', 321),
 ('Mendax', 306),
 ('hacking', 293),
 ('worm', 292),
 ('They', 287),
 ('phone', 283),
 ('network', 277),
 ('police', 275),
 ('people', 251),
 ('machine', 247),
 ('In', 237),
 ('number', 235),
 ('day', 225),
 ('way', 219),
 ('said', 217),
 ('told', 213),
 ('thing', 213),
 ('line', 212),
 ('account', 212),
 ('But', 209),
 ('A', 207),
 ('work', 204),
 ('case', 199),
 ('Pad', 198),
 ('password', 194),
 ('security', 190),
 ('program', 188),
 ('information', 182),
 ('And', 180),
 ('When', 176),
 ('began', 169),
 ('NASA', 169),
 ('system', 169),
 ('file', 167),
 ('did', 165),
 ('know', 165)]

### Capitalisation
Whilst it may be tempting to just lower case every token with the belief that words all have the same meaning regardless of case. However, it may actually be that if something is in ALL CAPS it conveys some meaning. Or if a word is at the start of sentence, that has importance. 

For example 

- John liked to help
- John screamed HELP HELP HELP

Or if one book contained lots of capitalised nouns (like cities and countries), it might tell you it was about Geography.

Some libraries actually account for this by lower casing everything, then having a token which indicates a start of capitilising as well as one that signifies the end of capitalising. This allows the best of both worlds. Of course, this only works if your model or vocabulary is able to take sequence and context into account. Like for example....

In [36]:
lower_tokens = []
for word in lem_tokens:
    lower_tokens.append(word.lower())
lower_vocab = np.unique(lower_tokens)
print("unique words", lower_vocab.shape)
bow = Counter(lower_tokens)
bow.most_common(50)

unique words (12026,)


[('the', 1408),
 ('computer', 1017),
 ('he', 845),
 ("'", 824),
 ('hacker', 687),
 ('par', 501),
 ('it', 497),
 ('time', 406),
 ('electron', 383),
 ("didn't", 378),
 ('like', 367),
 ('just', 361),
 ('phoenix', 361),
 ('anthrax', 338),
 ('police', 325),
 ('i', 321),
 ('mendax', 307),
 ('day', 304),
 ('hacking', 302),
 ('network', 298),
 ('worm', 294),
 ('phone', 289),
 ('they', 287),
 ('people', 277),
 ('machine', 249),
 ('in', 243),
 ('system', 240),
 ('number', 235),
 ('new', 221),
 ('way', 221),
 ('said', 218),
 ('told', 216),
 ('line', 216),
 ('account', 214),
 ('thing', 213),
 ('but', 210),
 ('security', 208),
 ('work', 208),
 ('a', 207),
 ('information', 205),
 ('pad', 205),
 ('case', 199),
 ('password', 194),
 ('program', 188),
 ('and', 182),
 ('when', 176),
 ('did', 175),
 ('began', 169),
 ('nasa', 169),
 ('file', 167)]