# Text Analytics Lab 1: Regular Expressions and Vector Representations

### Learning Outcomes
* Be able to set up a Python and Jupyter notebook environment for text analytics.
* Understand how to use regular expressions to preprocess text.
* Know how to carry out text normalisation including lemmatisation.
* Know how to obtain bigram and TF-IDF vector representations of documents and term-document matrices.
* Be able to compute cosine similarity to compare vector representations. 

### Outline

1. Getting started: how to set up your environment, Jupyter notebooks introduction
1. Acquiring raw text data
1. Regular expressions
1. Text normalisation 
1. Term-document matrices
1. Cosine Similarity
1. TF-IDF and bigram vectors

### How To Complete This Lab

Read the text and the code then look for 'TODOs' that instruct you to complete some missing code. Look out for 'QUESTIONS' which you should try to answer before moving on to the next cell. Aim to work through the lab during the scheduled lab hours. To get help, you can talk to TAs or the lecturer during the labs, post questions to the Blackboard discussion board or on Teams, or ask a question in the lectures. 

The labs *will not be marked*. However, they will prepare you for the coursework, so try to keep up with the weekly labs and have fun with the exercises! Check the textbook (Jurafsky and Martin) for more information on the methods implemented here.

### Copilot and other AI tools

If you are using an IDE like Visual Studio, we recommend switching off AI tools like Copilot while you are doing the lab. This is because the AI assistant will attempt to generate the answers for you -- sometimes it will be right, and you won't learn anything, and sometimes it will be wrong, and you'll just be confused!

## 1. Getting Started

### Setting up your environment

We recommend using ```conda``` to create an environment with the correct versions of all the packages you need for these labs. You can install either Anaconda or Miniconda, which will include the ```conda``` program. 

We provide a .yml file that lists all the packages you will need, and the versions that we have tested the labs with. You can use this file to create your environment as follows.

1. Open a terminal. Use the command line to navigate to the directory containing this notebook and the file ```crossplatform_environment.yml```. You can use the command ```cd``` to change directory on the command line.

1. For Lab machines only (e.g., in MVB 2.11 and QB 1.80): Load the Anaconda module: ```module load anaconda/3-2024```.

1. Run the conda program by typing ```conda env create -f crossplatform_environment.yml```, then answer any questions that appear on the command line.

1. Activate the environment by running the command ```conda activate text_analytics```.

1. Install some libraries that are not available through Conda: ```pip install bertopic umap-learn```.

1. Make kernel available in Jupyter: ```python -m ipykernel install --user --name=text_analytics```.

1. Relaunch Jupyter: shutdown any running instances, and then type ```jupyter lab``` into your command line.

1. Find this notebook and open it up again.

1. Go to the top menu and change the kernel: click on 'Kernel'--> 'Change kernel' --> text_analytics.

You should now be ready to go!

The core libraries we will be using in this unit are:

- [Datasets](https://huggingface.co/docs/datasets/), produced by HuggingFace, is a hub for lots of interesting text datasets.
- [NLTK](https://www.nltk.org), a comprehensive NLP library.
- [Scikit-learn](https://scikit-learn.org/stable/user_guide.html), for machine learning and classifier evaluation.
- [Gensim](https://radimrehurek.com/gensim/), for topic modelling.
- [Transformers](https://huggingface.co/docs/transformers/en/index), for state-of-the-art NLP models. 
- [PyTorch](https://pytorch.org/), a framework for deep learning. 
- [BERTopic](https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html) for clustering documents into topics.

The libraries above have good documentation, which is available either online (links above) or via Python itself, e.g. `help(numpy.array)` in the Python interpreter. 

### Refreshers for Python and Jupyter

If you need a refresher on Python, see the [Introduction to Python lab](https://github.com/UoB-COMS21202/lab_sheets_public/tree/master/lab_1) or the University of Bristol [Beginning Python](https://milliams.gitlab.io/beginning_python/) course. If you are a beginner with Python, you might also like to look at Chapter 1 in the NLTK book, which also provides a guide for "getting started with Python": https://www.nltk.org/book/. 

The labs will be run on [Jupyter Notebook](http://jupyter.org/), an interactive coding environment embedded in a webpage supporting various programing languages (Python, R, Lua, etc.) through the concept of kernels. The code in a notebook is arranged in _cells_. To edit an already existing cell simply double-click on it. Cells can be run by hitting `shift+enter` when editing a cell or by clicking on the `Run` button at the top. Create new cells with the keyboard shortcut `esc` followed by `A` or `B`.

**Note**: when you run a code cell, all the created variables, implemented functions and imported libraries will be then available to every other code cell. It is commonly assumed that cells will be run in the correct sequence and running them repeatedly or out-of-order may sometimes cause errors. To reset all variables and functions (for debugging) simply click `Kernel > Restart` from the Jupyter menu.

#### Markdown 

Markdown cells (like this one) allow you to write fancy comments in Markdown format - double click on this cell to see the source. An introduction to Markdown syntax can be found [here](https://daringfireball.net/projects/markdown/syntax). You can also display simple $\LaTeX$ equations in Markdown thanks to `MathJax` support: for inline equations wrap your equation between `$` symbols; for display mode equations use `$$`.

## 1. Acquiring Raw Text Data

Now, let's get some text data! [HuggingFace's datasets hub](https://huggingface.co/datasets) is a repository of many different text datasets: they are useful for experimenting with NLP tasks and training models. For this lab, we'll start with the IMDB dataset, which contains movie reviews along with their classification into "positive" or "negative" sentiment. Run the code below to download the data from [HuggingFace's datasets hub](https://huggingface.co/datasets/imdb):

In [2]:
from datasets import load_dataset
import numpy as np

cache_dir = "./data_cache"

# The data is already divided into training and test sets.
# Load the training set:
train_dataset = load_dataset(
    "imdb", # name of the dataset collection
    split="train",  # train or test
    cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")

train_dataset = np.random.choice(train_dataset, 100, replace=False)  # we'll only use a subset of the data in this lab so that the code runs quicker


Training dataset with 25000 instances loaded


We can access the documents in the dataset like elements in a list. For example, the document with index 3 looks like this:

In [3]:
train_dataset[5]

 'label': 0}

**TO-DO 1:** Print the label for document 31. What does the value mean?

In [5]:
#*** WRITE YOUR ANSWER HERE ***
print(train_dataset[31]['label'])  

0


# 2. Regular Expressions

In text analytics, we aim to retrieve or extract information from text documents, or classify or summarise documents to better understand a large amount of text. Typically, we are not just looking for a single word or phrase: that can be useful for retrieving documents given a keyword query, but there are many cases where we want to recognise more complex and variable patterns. For example, if we want to find dates, we cannot list all the possible combinations of digits we want to search for, but we can look for patterns of numbers in date format. To do this, we need a way to represent the patterns we are looking for inside a piece of text. The most direct way to represent text patterns is to use regular expressions. Regular expressions provide a standard language for writing text patterns, which we will learn about below. 

## 2.1 Search

We'll start by trying out some simple regular expressions. Suppose we want to identify tweets where people discuss really loved about certain movies. We could start by looking for tweets that contain the word 'love'. Before we try to look for more general patterns, a first step is just to look for all occurrences of the word 'love'. Review the code below to see how we can do this:

In [6]:
import re  # Python regular expressions library

all_matches = []

for review in train_dataset:
    matches = re.findall('love', review['text'])
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match) 

36
36
love


This has given us a list of matches in the variable `all_matches`, which all contain the string 'love', but not the sentences themselves.
This isn't very useful, but we can do better if we define the right regular expression!

Regular expressions represent patterns, rather than specific strings, allowing us to generalise our search and retrieve a many different strings that match the pattern.
In Python, we differentiate a regular expression from a normal string by putting an 'r' character in front of the string.

We can generalise our search by using a _disjunction_, which will match against any one of a set of characters. The disjunction is written inside square brackets. 

Let's try to retrieve instances of the word "love" followed by any letter. We can write a disjunction that matches any lower case letter as `[a-z]`:

In [7]:
all_matches = []

for review in train_dataset:
    matches = re.findall(r'love [a-z]', review['text'])
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match) 

23
23
love y
love t
love s
love l
love i
love a
love c
love h
love w
love b
love m


Our current search only matches a single letter of the word after 'love'. The length of that following word is variable, so how can we write an expression to match the whole word? 

Here, we can use a special character, '\*', which will match against zero or more repetitions of the preceding regular expression. Let's try it out:

In [8]:
all_matches = []

for review in train_dataset:
    matches = re.findall(r'love [a-z]*', review['text'])
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match) 

24
24
love you
love anyone
love to
love slowly
love with
love without
love creepy
love interest
love me
love it
love story
love listening
love 
love her
love them
love interests
love and
love the
love bed
love subplot


Let's say we only want to retrieve the word following 'love', not the string containing 'love ' itself. 
We can do this using parentheses to create _groups_ of characters, such as this: `([a-z]*)`. The resulting matches will be returned as tuples of groups, and any characters not inside parentheses will not be returned as part of any group. Try out the code below to see this, and note that the space character after 'love' is not returned in the matches.

In [9]:
all_matches = []

for review in train_dataset:
    matches = re.findall(r'(love) ([a-z]*)', review['text'])
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match) 

for match in set(all_matches):  # just print the following
    print(match[1]) 

24
24
('love', 'it')
('love', 'bed')
('love', 'the')
('love', 'subplot')
('love', 'me')
('love', 'interests')
('love', 'and')
('love', 'listening')
('love', 'interest')
('love', 'her')
('love', 'story')
('love', '')
('love', 'slowly')
('love', 'with')
('love', 'you')
('love', 'anyone')
('love', 'creepy')
('love', 'to')
('love', 'them')
('love', 'without')
it
bed
the
subplot
me
interests
and
listening
interest
her
story

slowly
with
you
anyone
creepy
to
them
without


Now, let's try to retrieve the preceding words as well. It would be better to match capital letters as well as lower case, which we can do with the disjunction `[a-zA-Z]`. 

**TO-DO 2:** complete the code below to retrieve only the words that precede and follow 'love', including capitalised and lower case words.

In [10]:
all_matches = []

for review in train_dataset:
    
    ### WRITE YOUR CODE HERE
    matches = re.findall(r'([a-zA-Z]*) love ([a-zA-Z]*)', review['text'])
    ########
    
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match) 

24
24
('i', 'it')
('in', 'with')
('and', 'slowly')
('i', 'bed')
('I', 'it')
('a', 'subplot')
('I', 'the')
('you', 'creepy')
('the', 'interest')
('the', 'her')
('s', 'interests')
('his', 'without')
('just', 'listening')
('', 'me')
('d', 'to')
('and', 'to')
('I', 'Pride')
('romantic', 'story')
('i', 'you')
('also', 'anyone')
('i', 'them')
('The', 'and')


This is starting to look more useful, but we still want to retrieve whole sentences. 

Sentences in English are usually demarcated by punctuation (this is not the same for languages in other scripts, such as Chinese, Hindi and Thai). As we're working with English text only at the moment, let's use the following punctuation marks to identify sentence boundaries: '.', '!', '?'. In the regular expression language, those punctuation marks are special characters that do not literally represent the symbols '.', '!', or '?'. To force Python to interpret them literally, we need to put the escape character '\\' in front of them. 

Now, we can write a disjunction that matches against the punctuation like this: `[\.\!\?]`.

So far, we have assumed the text consists only of letters. Can you think of any characters we have excluded here? 

It can be hard to list every character we want to match. A better way to find all matches could be to use _negation_ to match against any character _except_ the punctuation marks that bound the sentences. A negation will match any character except those specified, which we can write like this: `[^\.\!\?]`, where the '^' indicates the negation.


**TO-DO 3:** Retrieve whole sentences containing 'love'. To do this, modify our previous expression by using negation to match all of the characters except '.', '!', and '?'.

In [11]:
all_matches = []

for review in train_dataset:
    
    ### WRITE YOUR CODE HERE
    matches = re.findall(r'[^\.\!\?]* love [^\.\!\?]*', review['text'])  

    ########
    
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match)  

23
 I love Pride and Prejudice and Sense and Sensibility books and movies, and I'm half way through Mansfield Park
But I love it
 The way the film is narrated: Humanity and love slowly developing between these two outsiders, and contrasted to the simultaneously & continuously ongoing inhumane marching pace of the fascist radio announcer (who happens to be a colleague of Mastroianni's part)and the adherents "going to and coming from the show"
 I love the moment Mr
 a nice start you might say, but then it got a bit greedy, very greedy, it tries to be a science fiction, a drama, a thriller, a possible romantic love story, fairy tale, a comedy and everything under the sun
 Gus, a man besotted and passionately in love, is prepared to give up his love without complaint
<br /><br />The scene that is totally wasted is when both of Cooper's love interests and their respective fathers are cooped up in the same hotel room together
i love bed knobs and broomsticks so much that it makes me cry a th

Look at the results -- does the regular expression correctly return sentences containing 'love'?

There are lots more special characters that you can use to form really powerful regular expressions for segmenting, retrieving and substituting text. For your reference, you can find a complete list [here](https://docs.python.org/3/library/re.html#regular-expression-syntax). You can take a look at this list and try to rewrite the expressions above in different ways using the special characters.

## 2.2 Substitution

Besides matching and retrieving pieces of text, regular expressions can also be used to alter text by substituting one string for another. There are many potential uses, for example, to fill in templates by replacing placeholders with dates, filenames or other information. For example, imagine a system for sending automated reminders of doctor's appointments. It may contain a sentence "This is to remind you of your appointment on DATE at TIME.". Substitution can be used to replace the strings 'DATE' and 'TIME' with specifica values. 

Regular expression _substitution_ finds a matching string within a larger piece of text, and replaces it with another string.

Let's use this to clean up the text by removing the line break characters.

In Python, we can use the re.sub() function, which takes three arguments:
1. The expression to match. 
2. The pattern we should replace it with
3. The text to apply the subtitution to. 

Some of the reviews contain some HTML formatting code, `<br />`, which we can try to remove to clean up the text. We can do this by writing an expression for the first argument of re.sub() that matches '<br />'. Take a look at how this works by running the code below:

In [12]:
print('ORIGINAL TEXT: ')
print(train_dataset[5]['text'])
    
clean_article = re.sub(r'<br />', r' ', train_dataset[5]['text'])  # replace HTML breaks with a space
    
print('CLEANER TEXT: ')
print(clean_article)

ORIGINAL TEXT: 
CLEANER TEXT: 


# 3. Text Normalisation 

For most text analytics tasks, such as document classification, we will first need to transform the raw text to a suitable format for input to method such as a classifier. This process is called _text normalisation_ and is part of the _preprocessing_ stage. There are three common steps:

1. Sentence segmentation: this is needed when we want to process each sentence separately, e.g., to classify its sentiment. We have already tried out a basic approach to obtaining complete sentences using regular expressions. This would need to be modified to return a list of all sentences in a document. 
2. Tokenisation, in which the sentences are split into a sequence of tokens, which include words, numbers and punctuation marks.
3. Word normalisation, in which different forms of a word are replaced by a root form. Many text analytics models, such as document classifiers, can benefit from words being _normalised_ to consistent word forms (e.g., "dog", "Dog" and "dogs" could all normalised to "dog"), as this can reduce the diversity of the vocabulary make it easier to find meaningful patterns in the data. 

We are now going to see how to perform these steps using the NLTK library.

## 3.1 Sentence Segmentation

Let's start by using NLTK to split a document into sentences. This should give better results than our regular expressions above.

You may get some errors from NLTK when you try to use sent_tokenize or word_tokenize further down. This is usually because you need to download and install some NLTK data. Please check the error message to find out which package is required. You probably need to install packages called 'punkt' and 'wordnet'. You can install these packages by running the cell below.

In [13]:
import nltk 

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/es1595/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/es1595/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/es1595/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
import nltk

review = train_dataset[5]['text']

sents = nltk.sent_tokenize(review)

for sent in sents:
    print("<SENTENCE>")
    print(sent)  # print the sentences of this document

<SENTENCE>
It is hard to describe this film and one wants to tried hard not to dismiss it too quickly because you have a feeling that this might just be the perfect film for some 12 years old girl...<br /><br />This film has a nice concept-the modern version of Sleeping Beauty with a twist.
<SENTENCE>
It has some rather dreamy shots and some nice sketches of the young boy relationship with his single working mother and his schoolmate... a nice start you might say, but then it got a bit greedy, very greedy, it tries to be a science fiction, a drama, a thriller, a possible romantic love story, fairy tale, a comedy and everything under the sun.
<SENTENCE>
The result just left the audience feeling rather inadequate.
<SENTENCE>
For example, the scene when the girl(played by Risa Goto) finally woken by his(Yuki Kohara) kiss, instead of being romantic, it try's to be scary in order to make us laugh afterwards... it is a cheap trick, because it ruin all the anticipation and emotion which it wa

**TO-DO 4:** Use the regular expression substitution code from section 2.2 to remove the '\<br /\>' tags from the sentences displayed above and print the results.

In [18]:
clean_sents = []

for sent in sents:
    
    ### WRITE YOUR OWN CODE HERE
    sent = re.sub(r'<br />', r' ', sent)
    #######
    
    print("<SENTENCE>")
    print(sent)  # print the sentences of this document
    
    clean_sents.append(sent)  # save the cleaned sentences for later

<SENTENCE>
It is hard to describe this film and one wants to tried hard not to dismiss it too quickly because you have a feeling that this might just be the perfect film for some 12 years old girl...  This film has a nice concept-the modern version of Sleeping Beauty with a twist.
<SENTENCE>
It has some rather dreamy shots and some nice sketches of the young boy relationship with his single working mother and his schoolmate... a nice start you might say, but then it got a bit greedy, very greedy, it tries to be a science fiction, a drama, a thriller, a possible romantic love story, fairy tale, a comedy and everything under the sun.
<SENTENCE>
The result just left the audience feeling rather inadequate.
<SENTENCE>
For example, the scene when the girl(played by Risa Goto) finally woken by his(Yuki Kohara) kiss, instead of being romantic, it try's to be scary in order to make us laugh afterwards... it is a cheap trick, because it ruin all the anticipation and emotion which it was trying t

## 3.2 Tokenisation

NLTK provides a similar function for tokenizing the text at the word level. You can find the documentation [here](https://www.nltk.org/api/nltk.tokenize.html). Most tokenizers use either regular expressions or a machine learning model that was trained on a large dataset to learn token-splitting rules. 

**TO-DO 5:** Use word_tokenize() to tokenize each of the sentences from the last cell.

In [19]:
tokenized_sents = []

for sent in clean_sents:
    ### WRITE YOUR OWN CODE HERE
    tokens = nltk.word_tokenize(sent)
    #######
    
    print("<TOKENS>")
    print(tokens)
    
    tokenized_sents.append(tokens)

<TOKENS>
['It', 'is', 'hard', 'to', 'describe', 'this', 'film', 'and', 'one', 'wants', 'to', 'tried', 'hard', 'not', 'to', 'dismiss', 'it', 'too', 'quickly', 'because', 'you', 'have', 'a', 'feeling', 'that', 'this', 'might', 'just', 'be', 'the', 'perfect', 'film', 'for', 'some', '12', 'years', 'old', 'girl', '...', 'This', 'film', 'has', 'a', 'nice', 'concept-the', 'modern', 'version', 'of', 'Sleeping', 'Beauty', 'with', 'a', 'twist', '.']
<TOKENS>
['It', 'has', 'some', 'rather', 'dreamy', 'shots', 'and', 'some', 'nice', 'sketches', 'of', 'the', 'young', 'boy', 'relationship', 'with', 'his', 'single', 'working', 'mother', 'and', 'his', 'schoolmate', '...', 'a', 'nice', 'start', 'you', 'might', 'say', ',', 'but', 'then', 'it', 'got', 'a', 'bit', 'greedy', ',', 'very', 'greedy', ',', 'it', 'tries', 'to', 'be', 'a', 'science', 'fiction', ',', 'a', 'drama', ',', 'a', 'thriller', ',', 'a', 'possible', 'romantic', 'love', 'story', ',', 'fairy', 'tale', ',', 'a', 'comedy', 'and', 'everything'

Run the code below to see how NLTK has handled the non-letter characters. 
* What does it do with most punctuation marks? 
* When does it not split tokens based on punctuation?

In [20]:
for sent in tokenized_sents:
    for tok in sent:
        if re.search(r'[^a-zA-Z0-9]', tok):  # find the non-letter and non-digit characters
            print(tok)  # print the entire token containing the non-letter/non-digit character

...
concept-the
.
...
,
,
,
,
,
,
,
,
.
.
,
(
)
(
)
,
,
's
...
,
.
(
well-known
comic-book
)
?
``
''
comic-book
.
,
's
(
bus-ride
)
's
?
(
!
)
.
``
''
``
``
''
...
...
``
''
.
,
.
``
''
,
.


## 3.3 Word Normalisation

Many words can appear in different forms, including: 
* Conjugated verbs like "think", "thinks" and "thought",
* Plural and singular nouns like "dog" and "dogs",
* Common abbrevations and synonyms like "USA" and "US". 

Mapping all of these surface forms to a single root form reduces the size of the vocabulary that we have to deal with and can therefore improve the performance of text classifiers or topic models.

The two most widely used tools for this task in English are the Porter Stemmer and WordNet Lemmatizer. These tools apply a series of regular expression substitutions to tokenised text to convert words to a standard format. 
* The Porter stemmer is much faster but just removes word prefixes and endings, which leads to some errors. It is often used when real-time or high-volume text processing is needed.
* As well as applying regular expressions, lemmatizers look words up in a dictionary to find their root forms, so are more accurate but much slower. 

Let's start by applying the [Porter Stemmer class](https://www.nltk.org/_modules/nltk/stem/porter.html) to our tokenised text by calling the stem() method. The output may look a bit strange, but note that the aim of the stemmer is *not* to produce readable text, but to quickly and efficiently reduce variations of words to a single form. 

In [21]:
stemmer = nltk.PorterStemmer() 
stemmed_sents = []

for sent in tokenized_sents:
    stemmed_sent = [stemmer.stem(tok) for tok in sent]
    
    stemmed_sents.append(stemmed_sent)
    
    print("<STEMMED TOKENS>")
    print(stemmed_sent)

<STEMMED TOKENS>
['it', 'is', 'hard', 'to', 'describ', 'thi', 'film', 'and', 'one', 'want', 'to', 'tri', 'hard', 'not', 'to', 'dismiss', 'it', 'too', 'quickli', 'becaus', 'you', 'have', 'a', 'feel', 'that', 'thi', 'might', 'just', 'be', 'the', 'perfect', 'film', 'for', 'some', '12', 'year', 'old', 'girl', '...', 'thi', 'film', 'ha', 'a', 'nice', 'concept-th', 'modern', 'version', 'of', 'sleep', 'beauti', 'with', 'a', 'twist', '.']
<STEMMED TOKENS>
['it', 'ha', 'some', 'rather', 'dreami', 'shot', 'and', 'some', 'nice', 'sketch', 'of', 'the', 'young', 'boy', 'relationship', 'with', 'hi', 'singl', 'work', 'mother', 'and', 'hi', 'schoolmat', '...', 'a', 'nice', 'start', 'you', 'might', 'say', ',', 'but', 'then', 'it', 'got', 'a', 'bit', 'greedi', ',', 'veri', 'greedi', ',', 'it', 'tri', 'to', 'be', 'a', 'scienc', 'fiction', ',', 'a', 'drama', ',', 'a', 'thriller', ',', 'a', 'possibl', 'romant', 'love', 'stori', ',', 'fairi', 'tale', ',', 'a', 'comedi', 'and', 'everyth', 'under', 'the', 'su

Now let's compare the stemming results to lemmatisation. For this task, NLTK provides the [class WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html) with the method lemmatize(). This method takes an argument, `pos`, that determines whether the lemmatizer is applied to nouns, verbs, adjectives or adverbs.

**TO-DO 6:** Use the WordNetLemmatizer to lemmatize the nouns in the tokenized sentences. Set the `pos` argument to 'n'. 

**TO-DO 7:** Add a second call to lemmatize() to lemmatize the verbs in the sentences as well. Set the `pos` argument to 'v'. 

How do the results compare with the Porter stemmer? 

How have the verbs in the sentences changed?

In [22]:
lemmatizer = nltk.WordNetLemmatizer() 
lemma_sents = []
for sent in tokenized_sents:
    
    ### WRITE YOUR OWN CODE HERE
    lemma_sent = [lemmatizer.lemmatize(lemmatizer.lemmatize(tok, pos='v'), pos='n') for tok in sent]
    #######
    
    lemma_sents.append(lemma_sent)
    
    print("<LEMMATIZED TOKENS>")
    print(lemma_sent)

<LEMMATIZED TOKENS>
['It', 'be', 'hard', 'to', 'describe', 'this', 'film', 'and', 'one', 'want', 'to', 'try', 'hard', 'not', 'to', 'dismiss', 'it', 'too', 'quickly', 'because', 'you', 'have', 'a', 'feel', 'that', 'this', 'might', 'just', 'be', 'the', 'perfect', 'film', 'for', 'some', '12', 'year', 'old', 'girl', '...', 'This', 'film', 'have', 'a', 'nice', 'concept-the', 'modern', 'version', 'of', 'Sleeping', 'Beauty', 'with', 'a', 'twist', '.']
<LEMMATIZED TOKENS>
['It', 'have', 'some', 'rather', 'dreamy', 'shot', 'and', 'some', 'nice', 'sketch', 'of', 'the', 'young', 'boy', 'relationship', 'with', 'his', 'single', 'work', 'mother', 'and', 'his', 'schoolmate', '...', 'a', 'nice', 'start', 'you', 'might', 'say', ',', 'but', 'then', 'it', 'get', 'a', 'bite', 'greedy', ',', 'very', 'greedy', ',', 'it', 'try', 'to', 'be', 'a', 'science', 'fiction', ',', 'a', 'drama', ',', 'a', 'thriller', ',', 'a', 'possible', 'romantic', 'love', 'story', ',', 'fairy', 'tale', ',', 'a', 'comedy', 'and', 'e

# 5. Vector Representations of Text

Regular expressions are great for tasks such as finding specific patterns in text. However, it is not always possible to write a regular expression that captures all the patterns we want to find. For example, suppose we want to classify social media posts into positive and negative sentiment to see if they are favourable towards a particular famous person. We can't write down a pattern to capture all the ways of saying favourable things about that person -- it's way too diverse. 

Instead, we can use machine learning to learn to recognise a wide range of patterns from a set of examples. To classify a new example, a machine learning classifier requires a *representation* of each piece of text that it can compare against the patterns it has learned. Raw text is not usually a suitable representation, and we usually need a way to turn text data into vectors -- essentially, lists of numbers. Vector representations have several advantages: for example, they map words, sentences and documents to points in a high-dimensional space, so we can learn to separate the space into different regions corresponding to classes; they allow us to compute the similarity between pieces of text by computing distances. 

In this section, we'll loook at the simplest way to obtain vector representations of words and documents by constructing a *term-document matrix*. A term-document matrix has rows referring to terms, and columns referring to documents. Each element contains a count of how many times a particular term occurred in a particular document. We can treat rows as vector representations of terms, and columns as vector representations of documents.

To compute term-document matrices, we need to use the text normalisation steps above. Most importantly, we need to tokenise the text into words (and other types of token) so we can count their occurrences. Normalising the words is often helpful too, as it reduces the number of rows in the matrix and makes it less sparse.

We can compute a term-document matrix using the CountVectorizer class from Scikit-learn. By default, this class takes raw text sequences and applies an English tokenizer automatically:

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

input_text = [review['text'] for review in train_dataset]  # use a list of sentences as an example. 

vectorizer = CountVectorizer()
vectorizer.fit(input_text)  

0,1,2
,"input  input: {'filename', 'file', 'content'}, default='content' - If `'filename'`, the sequence passed as an argument to fit is  expected to be a list of filenames that need reading to fetch  the raw content to analyze. - If `'file'`, the sequence items must have a 'read' method (file-like  object) that is called to fetch the bytes in memory. - If `'content'`, the input is expected to be a sequence of items that  can be of type string or byte.",'content'
,"encoding  encoding: str, default='utf-8' If bytes or files are given to analyze, this encoding is used to decode.",'utf-8'
,"decode_error  decode_error: {'strict', 'ignore', 'replace'}, default='strict' Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'.",'strict'
,"strip_accents  strip_accents: {'ascii', 'unicode'} or callable, default=None Remove accents and perform other character normalization during the preprocessing step. 'ascii' is a fast method that only works on characters that have a direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) means no character normalization is performed. Both 'ascii' and 'unicode' use NFKD normalization from :func:`unicodedata.normalize`.",
,"lowercase  lowercase: bool, default=True Convert all characters to lowercase before tokenizing.",True
,"preprocessor  preprocessor: callable, default=None Override the preprocessing (strip_accents and lowercase) stage while preserving the tokenizing and n-grams generation steps. Only applies if ``analyzer`` is not callable.",
,"tokenizer  tokenizer: callable, default=None Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if ``analyzer == 'word'``.",
,"stop_words  stop_words: {'english'}, list, default=None If 'english', a built-in stop word list for English is used. There are several known issues with 'english' and you should consider an alternative (see :ref:`stop_words`). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``. If None, no stop words will be used. In this case, setting `max_df` to a higher value, such as in the range (0.7, 1.0), can automatically detect and filter stop words based on intra corpus document frequency of terms.",
,"token_pattern  token_pattern: str or None, default=r""(?u)\\b\\w\\w+\\b"" Regular expression denoting what constitutes a ""token"", only used if ``analyzer == 'word'``. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.",'(?u)\\b\\w\\w+\\b'
,"ngram_range  ngram_range: tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means unigrams and bigrams, and ``(2, 2)`` means only bigrams. Only applies if ``analyzer`` is not callable.","(1, ...)"


Have you seen the method "fit()" before with other Scikit-learn classes? What do you think it does for the CountVectorizer?

Looking inside the vectorizer, we can see the vocabulary it has created from the input text: 

In [24]:
vectorizer.vocabulary_

{'thought': 4409,
 'this': 4404,
 'movie': 2888,
 'was': 4765,
 'really': 3535,
 'great': 1939,
 'helena': 2050,
 'did': 1198,
 'an': 197,
 'amazing': 180,
 'job': 2359,
 'in': 2207,
 'it': 2325,
 'she': 3910,
 'played': 3285,
 'her': 2060,
 'character': 707,
 'very': 4693,
 'well': 4796,
 'awesome': 350,
 'actress': 97,
 'br': 538,
 'the': 4380,
 'also': 173,
 'funny': 1807,
 'too': 4458,
 'jokes': 2370,
 'were': 4798,
 'couldnt': 973,
 'stop': 4156,
 'laughing': 2490,
 'think': 4400,
 'everyone': 1498,
 'should': 3930,
 'see': 3828,
 'dynasty': 1358,
 'revisited': 3660,
 'hawaii': 2024,
 'full': 1803,
 'of': 3053,
 'clichés': 783,
 'highly': 2079,
 'predictable': 3364,
 'unrealistic': 4623,
 'and': 202,
 'sometimes': 4040,
 'even': 1489,
 'stupid': 4206,
 'if': 2172,
 'you': 4924,
 'have': 2021,
 'nothing': 3013,
 'better': 451,
 'to': 4445,
 'do': 1268,
 'however': 2137,
 'does': 1275,
 'provide': 3443,
 '40': 42,
 'minutes': 2808,
 'simple': 3962,
 'unpretensive': 4620,
 'entertain


Next, we need to call "transform()" to get a term-document matrix. Try it out and find out what it produces:

In [25]:
term_doc_mat = vectorizer.transform(input_text).T  # transpose so that rows are terms
print(term_doc_mat)
print(term_doc_mat.shape)

<Compressed Sparse Column sparse matrix of dtype 'int64'
	with 14095 stored elements and shape (4946, 100)>
  Coords	Values
  (97, 0)	1
  (173, 0)	1
  (180, 0)	1
  (197, 0)	2
  (350, 0)	1
  (538, 0)	6
  (707, 0)	1
  (973, 0)	1
  (1198, 0)	1
  (1498, 0)	1
  (1807, 0)	1
  (1939, 0)	2
  (2050, 0)	1
  (2060, 0)	1
  (2207, 0)	1
  (2325, 0)	2
  (2359, 0)	1
  (2370, 0)	1
  (2490, 0)	1
  (2888, 0)	2
  (3285, 0)	1
  (3535, 0)	3
  (3828, 0)	1
  (3910, 0)	2
  (3930, 0)	1
  :	:
  (3749, 99)	1
  (3895, 99)	1
  (3933, 99)	4
  (3938, 99)	2
  (3949, 99)	1
  (3958, 99)	2
  (3962, 99)	1
  (3965, 99)	1
  (4045, 99)	1
  (4071, 99)	1
  (4153, 99)	1
  (4379, 99)	7
  (4380, 99)	15
  (4385, 99)	1
  (4400, 99)	2
  (4404, 99)	3
  (4407, 99)	1
  (4432, 99)	1
  (4593, 99)	1
  (4693, 99)	4
  (4765, 99)	2
  (4804, 99)	1
  (4853, 99)	1
  (4889, 99)	1
  (4924, 99)	1
(4946, 100)


**TO-DO 8:** Use the term-document matrix above to write a function that returns a term vector. Get the term vector for the word 'happy'. 

In [26]:
# WRITE YOUR ANSWER HERE
def get_term_vector(vectorizer, term_doc_mat, word):
    index = vectorizer.vocabulary_[word]
    term_vec = term_doc_mat[index].toarray()
    
    return term_vec.flatten()

vector = get_term_vector(vectorizer, term_doc_mat, 'happy')
print(vector.shape)
print(vector)


(100,)
[0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


**TO-DO 9:** What do the values in the vector mean? This representation is known as a 'bag of words' because it ignores the word order and document structure. Can you think of any disadvantages of representing documents as bags of words? 

ANSWER = these are the number of times the chosen word (or other token type, such as punctuation mark or number) occured in each document in the dataset.

# 6. Comparing Vectors with Cosine Similarity

Vectors representations allow us to compare documents or terms by computing their similarity. This is useful for tasks such as clustering documents into topics, or finding documents that are similar to a 'query' document. 
In order to compute similarity or distance, we need to represent documents as numerical vectors. 

The most common way to compare vectors is to compute the cosine of the angles between them. This measures how much the vectors point in the same direction. It ignores their magnitude, which means that shorter documents with lower word counts can be directly compared to long documents with more words. 

Let's take a term from the IMDB dataset as a 'query' and compare it to two others using cosine similarity:

In [27]:
# get our first document
query_vec = get_term_vector(vectorizer, term_doc_mat, 'happy')

# get the second term
term2_vec = get_term_vector(vectorizer, term_doc_mat, 'sad')

# get a third term
term3_vec = get_term_vector(vectorizer, term_doc_mat, 'enjoy')

Cosine similarity is defined as:

$$similarity<v_1, v_2> = \frac{v_1 \cdot v_2}{|| v_1 || \cdot || v_2 ||}$$

**TO-DO 9:** Complete the function below to computes cosine similarity between two vectors. Hint: use Numpy's dot function for the dot product.

In [28]:
def cossim(vec1, vec2):   
    ### WRITE YOUR OWN CODE HERE
    dot_prod = np.dot(vec1, vec2)
    normaliser = np.sum(vec1**2)**0.5 * np.sum(vec2**2)**0.5
    return dot_prod / normaliser 

**TO-DO 10:** Which term do you expect to have higher similarity to the query? Run the code below to use your cosine similarity function, and see if the results meet your expectations.

ANSWER: -- The 'enjoy' should be more similar to 'happy' than 'sad'. 

In [29]:
cos_sim1 = cossim(query_vec, term2_vec)
print(f'The cosine similarity between the query and term2 is: {cos_sim1}')

cos_sim2 = cossim(query_vec, term3_vec)
print(f'The cosine similarity between the query and term3 is: {cos_sim2}')

The cosine similarity between the query and term2 is: 0.0
The cosine similarity between the query and term3 is: 0.3651483716701107


# 7. Bags of N-grams

Our representations above were purely bag-of-words representations: they ignored word order and simply counted single tokens, or 'unigrams'. However, word order is important for understanding the meaning of a piece of text. What if we expand our bags of words to also count other _features_ that help us account for word order? Features are any attributes of the text that we can measure; a simple improvement is to count pairs of consecutive tokens, or _bigrams_, to capture phrases as well as individual words. 

In [30]:
bigram_vectorizer = CountVectorizer(ngram_range=(1,2))  # include bigrams as well as unigrams
bigram_vectorizer.fit(input_text)  
bigram_vectorizer.vocabulary_  # show the vocabulary of the bigram vectorizer, which includes both unigrams and bigrams

{'thought': 19368,
 'this': 19196,
 'movie': 12080,
 'was': 20830,
 'really': 15090,
 'great': 7778,
 'helena': 8321,
 'did': 5044,
 'an': 785,
 'amazing': 725,
 'job': 10124,
 'in': 9049,
 'it': 9821,
 'she': 16276,
 'played': 14298,
 'her': 8348,
 'character': 3665,
 'very': 20596,
 'well': 21127,
 'awesome': 2033,
 'actress': 344,
 'br': 2854,
 'the': 18080,
 'also': 667,
 'funny': 7299,
 'too': 19853,
 'jokes': 10165,
 'were': 21164,
 'couldnt': 4436,
 'stop': 17197,
 'laughing': 10566,
 'think': 19168,
 'everyone': 6110,
 'should': 16365,
 'see': 15988,
 'thought this': 19380,
 'this movie': 19262,
 'movie was': 12173,
 'was really': 20921,
 'really really': 15113,
 'really great': 15101,
 'great helena': 7800,
 'helena did': 8322,
 'did an': 5045,
 'an amazing': 794,
 'amazing job': 730,
 'job in': 10128,
 'in it': 9128,
 'it thought': 9968,
 'thought she': 19376,
 'she played': 16293,
 'played her': 14301,
 'her character': 8360,
 'character very': 3681,
 'very well': 20633,
 'w

Now we have a vectorizer that can produce representations including bigrams. Let's apply it to the text to get an expanded term-document matrix:   

In [36]:
bigram_doc_mat = bigram_vectorizer.transform(input_text).T
print("The shape of the bigram document-term matrix is: ", bigram_doc_mat.shape)

The shape of the bigram document-term matrix is:  (22190, 100)


# OPTIONAL: 

This part aims to give you some more understanding of bigrams and n-grams in general, and shows you how to use the lemmatizer with the CountVectorizer class. It is not required to do this part, and we will revisit the use of n-grams and lemmatizers in the later lab on classifiers. 

The code below chooses a document that we can experiment with. 

**TO-DO 11:** Find the top three documents that are most similar to `selected_doc` when using vectors of bigrams+unigrams. Print them out. Hint: numpy contains useful functions such as argsort, for sorting a list or array. 

**TO-DO 12:** Repeat the process with the pure unigram bag of words representations. Does the list change? Can you see why it may be different? 

**TO-DO 13:** Experiment with other choices of `selected_doc` and increasing the length of the features bigrams to trigrams and other lengths of n-gram.  

In [None]:
selected_doc = 1
scores = []

print(input_text[selected_doc])  # print the document we're using as our query
print("\n")

### WRITE YOUR OWN CODE HERE 
for i in range(len(input_text)):
    scores.append(cossim(bigram_doc_mat[:, selected_doc].toarray().flatten(), bigram_doc_mat[:, i].toarray().flatten()))

most_sim = np.argsort(scores)[-4:-1]

for doc in most_sim:
    print(f"Document {doc} with similarity score {scores[doc]}: ")
    print(input_text[doc])
    print("\n")


print("Now for unigrams only: ")
scores = []
for i in range(len(input_text)):
    scores.append(cossim(term_doc_mat[:, selected_doc].toarray().flatten(), term_doc_mat[:, i].toarray().flatten()))

most_sim = np.argsort(scores)[-4:-1]

for doc in most_sim:
    print(f"Document {doc} with similarity score {scores[doc]}: ")
    print(input_text[doc])
    print("\n")

print("Now for different sizes of n-grams: ")
ngram_vectorizer = CountVectorizer(ngram_range=(3,3))  # include bigrams as well as unigrams
ngram_vectorizer.fit(input_text)  
ngram_doc_mat = ngram_vectorizer.transform(input_text).T
scores = []
for i in range(len(input_text)):
    scores.append(cossim(ngram_doc_mat[:, selected_doc].toarray().flatten(), ngram_doc_mat[:, i].toarray().flatten()))

most_sim = np.argsort(scores)[-4:-1]

for doc in most_sim:
    print(f"Document {doc} with similarity score {scores[doc]}: ")
    print(input_text[doc])
    print("\n")



Dynasty Revisited in Hawaii... Full of clichés, highly predictable, unrealistic and sometimes even stupid. If you have nothing better to do however, it does provide 40 minutes of simple, unpretensive entertainment, endless looks at great male and female muscles and very good photography of the spectacular Hawaiian scenery. On the other hand, If you are looking for anything more than that, stay away...<br /><br />Oh, and by the way, if you have ever worked in a Hotel or know anything about running one, you have two options: 1. You will feel sick every two minutes at the sheer stupidity and silliness of how the show presents Hotel Business or, 2. Look at it as science fiction comedy as I did, lie back, relax, and laugh about it!


Document 40 with similarity score 0.4372419700686498: 
The recent boom of dating show on U. S. television screens has reached a fevered pitch since the first episode of "The Bachelor." Unsuspecting audiences have since been subjected to countless clones and var

The vocabulary size is probably getting very large, now that we are using bigrams and other n-grams. 

To apply lemmatization, we have to go back to the CountVectorizer and define a new tokenizer class that will carry out the extra step of lemmatization. The code below shows how to apply lemmatization with the CountVectorizer class to reduce the vocabulary size. 

In [59]:
class LemmaTokenizer(object):  # this 'tokenizer' will also do additional preprocessing steps, namely, lemmatize verbs and adjectives
    
    def __init__(self):
        self.wnl = nltk.WordNetLemmatizer()
        
    def __call__(self, docs):
        return [self.wnl.lemmatize(self.wnl.lemmatize(tok, pos='v'), pos='a') for tok in nltk.word_tokenize(docs)]
    
lemm_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(), ngram_range=(1,2), token_pattern=None)  # include bigrams as well as unigrams

lemm_vectorizer.fit(input_text)
lemm_term_doc_mat = lemm_vectorizer.transform(input_text).T

# Print out some of the features in the vocabulary:
print(list(vectorizer.vocabulary_))



In [60]:
print(f'Vocabulary size: {len(vectorizer.vocabulary_)}')
print(f'Size of term document matrix with lemmatization: {lemm_term_doc_mat.shape}')

Vocabulary size: 4946
Size of term document matrix with lemmatization: (22142, 100)


**TO-DO 14:** Run the code below and compare with your previous results. Print out the vocabulary to see how the lemmatizer has changed the results. You can also experiment with the 'pos' parameter to lemmatise different categories of word (verbs, adjectives, nouns). 

In [61]:
### WRITE YOUR OWN CODE HERE
print("Now for unigrams and bigrams with lemmatisation: ")
scores = []
for i in range(len(input_text)):
    scores.append(cossim(lemm_term_doc_mat[:, selected_doc].toarray().flatten(), lemm_term_doc_mat[:, i].toarray().flatten()))

most_sim = np.argsort(scores)[-4:-1]

for doc in most_sim:
    print(f"Document {doc} with similarity score {scores[doc]}: ")
    print(input_text[doc])
    print("\n")

Now for unigrams and bigrams with lemmatisation: 
Document 35 with similarity score 0.5922753813024627: 
What the movie The 60s really represents (to those of us who growled around in the belly of America in those times) is the turbulence and diversity of the decade. Despite the exaggerated, stereotyped characters, the genuineness of the issues remains clear.<br /><br />Not only were those radical times of change, but also very confusing times. Two basic things changed our world then: the 1964 Civil Rights Act, and the overwhelming influence of the media. Those two new freedoms began social changes that soon became institutionalized.<br /><br />From chaos came sensitivity, from disorder came values. Bear in mind however, that the bulk of Americans were not involved in this... they worked, they played, they watched the news... and slowly they became effected by the efforts and struggles of the minorities... the Civil Rights workers, the Political Activists, the Anti-War efforts, the War