<div class="alert alert-danger" style="color:black"><b>Running ML-LV Jupyter Notebooks:</b><br>
    <ol>
        <li>Make sure you are running all notebooks using the <code>adv_ai</code> kernel.
        <li><b>It is very important that you do not create any additional files within the weekly folders on CSCT cloud.</b> Any additional files, or editing the notebooks with a different environment may prevent submission/marking of your work.</li>
            <ul>
                <li>NBGrader will automatically fetch and create the correct folders files for you.</li>
                <li>All files that are not the Jupyter notebooks should be stored in the 'ML-LV/data' directory.</li>
            </ul>
        <li>Please <b>do not pip install</b> any python packages (or anything else). You should not need to install anything to complete these notebooks other than the packages provided in the Jupyter CSCT Cloud environment.</li>
    </ol>
    <b>If you would like to run this notebook locally you should:</b><br>
    <ol>
        <li>Create an environment using the requirements.txt file provided. <b>Any additional packages you install will not be accessible when uploaded to the server and may prevent marking.</b></li>
        <li>Download a copy  of the notebook to your own machine. You can then edit the cells as you wish and then go back and copy the code into/edit the ones on the CSCT cloud in-place.</li>
        <li><b>It is very important that you do not re-upload any notebooks that you have edited locally.</b> This is because NBGrader uses cell metadata to track marked tasks. <b>If you change this format it may prevent marking.</b></li>
    </ol>
</div>

# 2 Language Representation

## 2.0 Import libraries

1. [Sklearn (scikit-learn)](https://scikit-learn.org/stable/) - is a comprehensive Python library for Machine Learning. We will use its text pre-processing features and also for PCA.

In [2]:
import os
import spacy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk import ngrams
from collections import Counter
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
%matplotlib inline

# Get the status of NBgrader (for skipping cell execution while validating/grading)
grading = True if os.getenv('NBGRADER_EXECUTION') else False

# Increase pandas display width
pd.set_option('display.width', 500)
# Set seaborn style for matplotlib plots
plt.style.use('seaborn-v0_8-white')

# Get the project directory (should be in ML-LV)
path = ''
while os.path.basename(os.path.abspath(path)) != 'ML-LV':
    path = os.path.abspath(os.path.join(path, '..'))

# Set the directory to the data folder (should be in ML-LV/data/imdb)
data_dir = os.path.join(path, 'data', 'imdb')

# Load the Spacy language model ('en_core_web_md' should be in shared/models/spacy)
nlp = spacy.load(os.path.join(path, '..', 'shared', 'models', 'spacy'))

2025-03-26 16:44:04.221099: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-26 16:44:04.224108: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-26 16:44:04.233063: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-26 16:44:04.250018: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-26 16:44:04.250043: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-26 16:44:04.261446: I tensorflow/core/platform/cpu_feature_guard.cc:

## 2.1 Representation options

The following cells demonstrate each of the language representation options discussed in the lecture. For most we use numpy/plain Python to demonstrate the process and then sklearn's built in functions.

Like pre-processing the appropriate representation is dependant on the task, and generally the input *shape* of the data for a given model.

### One-hot Encoding

One-hot encoding converts a word into an array of length `vocab_size`, with a **1** at the index of the words position in the vocabulary and **0's** in every other position. Encoding a sentence then becomes a 2D array of shape `vocab_size` x `sequence_length`.

In [3]:
# Create spacy document object
text = "This is a test sentence which is a very long test sentence."
doc = nlp(text)
print(f"Document: {doc}\n")

# Tokenise the document
tokens = [token.text for token in doc]
print(f"Tokens: {tokens}\n")

# Create simple vocabulary
vocab = list(set(tokens))
print(f"Vocabulary: {vocab}\n")

# Get a list of token indices within vocabulary
token_indices = [vocab.index(token) for token in tokens]
print(f"Token indices: {token_indices}\n")

# Create a one-hot vector with numpy
num_unique = len(vocab) # Need to know how many features there are
one_hot_np = np.eye(num_unique)[token_indices]
print(f"One-hot vector with numpy:\n {one_hot_np}\n")

# Create a one-hot vector with sklearn
token_indices = np.array(token_indices).reshape(-1, 1) # Need to reshape the array to 2D
one_hot_sk = OneHotEncoder(sparse_output=False).fit_transform(token_indices)
print(f"One-hot vector with sklearn:\n {one_hot_sk}\n")

Document: This is a test sentence which is a very long test sentence.

Tokens: ['This', 'is', 'a', 'test', 'sentence', 'which', 'is', 'a', 'very', 'long', 'test', 'sentence', '.']

Vocabulary: ['which', 'test', 'a', '.', 'very', 'is', 'This', 'long', 'sentence']

Token indices: [6, 5, 2, 1, 8, 0, 5, 2, 4, 7, 1, 8, 3]

One-hot vector with numpy:
 [[0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0.]]

One-hot vector with sklearn:
 [[0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 

### Bag-of-words (BOW)

BOW converts a sentence into an array of length `vocab_size`. Simply count the number of times a word appears within the sequence and increment the index according to its position within the vocabulary.

<div class="alert alert-success" style="color:black"><b>Note:</b> The output of sklearn's CountVectorizer() is different to the numpy implementation. Can you work out why?
</div>

In [11]:
# Create spacy document object
text = "This is a test sentence which is a very long test sentence."
doc = nlp(text)
print(f"Document: {doc}\n")

# Tokenise the document
tokens = [token.text for token in doc]
print(f"Tokens: {tokens}\n")

# Create simple vocabulary (add word not in our input text)
vocab = list(set(tokens + ['supercalifragilisticexpialidocious']))
print(f"Vocabulary: {vocab}\n")

# Get a list of token indices within vocabulary
token_indices = [vocab.index(token) for token in tokens]
print(f"Token indices: {token_indices}\n")

# Create a BOW with numpy
bow_np = np.zeros(len(vocab), dtype=np.int32)
for i in range(len(token_indices)):
    bow_np[token_indices[i]] += 1
print(f"BOW with numpy:\n {bow_np}\n")

# Create a BOW with sklearn
bow_vectoriser = CountVectorizer(vocabulary=vocab, lowercase=False)
bow_sk = bow_vectoriser.fit_transform([text])
print(f"BOW with sklearn:\n {bow_sk.toarray()}\n")

Document: This is a test sentence which is a very long test sentence.

Tokens: ['This', 'is', 'a', 'test', 'sentence', 'which', 'is', 'a', 'very', 'long', 'test', 'sentence', '.']

Vocabulary: ['very', 'which', 'is', 'long', 'a', 'test', '.', 'supercalifragilisticexpialidocious', 'sentence', 'This']

Token indices: [9, 2, 4, 5, 8, 1, 2, 4, 0, 3, 5, 8, 6]

BOW with numpy:
 [1 1 2 1 2 2 1 0 2 1]

BOW with sklearn:
 [[1 1 2 1 0 2 0 0 2 1]]



### TF-IDF

TF-IDF converts a corpus into an array of length `num_documents` x `vocab_size`. TF is the frequency of a word within a *document* and IDF is frequency of a word within the *corpus*. The TF-IDF for a word is then TF(word) x IDF(word):

$w =$ word/term

$d =$ document

$N =$ number of documents in corpus

$TF(w) = \frac{count(w, d)}{len(d)}$

$IDF(w) = log\frac{N}{\sum_{d=1}^{N} count(w, d) + 1}$

$TF-IDF(w) = TF(w) \times IDF(w)$

<div class="alert alert-success" style="color:black"><b>Note:</b> The class TFIDF() mimics the sklearn implementation (as best as possible). Try different normalisations ('l1' or 'l2') and set smoothing True/False.
</div>

In [12]:
class TFIDF():
    """TF-IDF vectoriser."""
    
    def __init__(self, tokeniser=None, vocabulary=None, norm=None, smooth_idf=True):
        """ Arguments:
                tokeniser (callable): A function that takes a string and returns a list of tokens
                vocabulary (list): A list of tokens to use as the vocabulary
                norm (str): The normalisation to use when calculating the tf-idf vectors
                smooth_idf (bool): Whether to use Laplace smoothing when calculating the idf
        """
        self.corpus = None
        self.N = None
        self.tokeniser = tokeniser
        self.vocabulary = vocabulary
        self.norm = norm
        self.smooth_idf = smooth_idf

        if not self.tokeniser:
            self.tokeniser = self._tokenise

        # l1 norm is the sum of the absolute values of the vector
        if self.norm and self.norm == 'l1':
            self.norm = 1
        # l2 norm is the square root of the sum of the squared values of the vector
        elif self.norm and self.norm == 'l2':
            self.norm = 2

    def _tokenise(self, s):
        return s.split()

    def get_vocabulary(self):
        vocab = []
        for doc in self.corpus:
            vocab.extend(self.tokeniser(doc))

        vocab = list(set(vocab))
        vocab.sort()
        return vocab

    def _tf(self):
        """Get the term frequency for each document in the corpus."""

        tf = []
        for doc in self.corpus:
            tf.append(Counter(self.tokeniser(doc)))
        return tf

    def _df(self):
        """Get the document frequency of each word in the corpus."""

        df = Counter()
        for doc in self.corpus:
            df.update(set(self.tokeniser(doc)))
        return df

    def _idf(self):
        """Calculate inverse document frequency for each word in the vocabulary."""

        # Calculate the DF
        df = self._df()

        idf = {}
        for word in self.vocabulary:
            if self.smooth_idf:
                idf[word] = 1.0 + np.log((self.N + 1) / (df[word] + 1))
            else:
                idf[word] = 1.0 + np.log(np.divide(self.N, df[word]))
        return idf

    def _tfidf(self):
        """Calculate the TF-IDF for each document in the corpus."""

        # Calculate TF and IDF
        tf = self._tf()
        idf = self._idf()

        # Calculate TF-IDF
        tfidf = np.zeros((self.N, len(self.vocabulary)))

        for i, doc in enumerate(self.corpus):
            for j, word in enumerate(self.vocabulary):
                tfidf[i, j] = tf[i][word] * idf[word]
        
        if self.norm:
            tfidf = tfidf / np.linalg.norm(tfidf, ord=self.norm, axis=1, keepdims=True)
        return tfidf

    def fit(self, corpus):
        # Set corpus/N
        self.corpus = np.array(corpus)
        self.N = len(self.corpus)

        # Set vocabulary
        if not self.vocabulary:
            self.vocabulary = self.get_vocabulary()

        # Calculate TF-IDF
        self.tfidf = self._tfidf()
        return self

    def transform(self, corpus):
        # Update corpus/N
        self.corpus = np.append(self.corpus, corpus, axis=0)
        self.N = len(self.corpus)

        # Calculate TF-IDF
        self.tfidf = self._tfidf()
        return self.tfidf[-len(corpus):]

corpus = ['the car is driven on the road', 'the truck is driven on the highway']

# Create a TF-IDF with numpy
tfidf_numpy = TFIDF(norm='l1', smooth_idf=False).fit(corpus)
terms = tfidf_numpy.get_vocabulary()
matrix_np = tfidf_numpy.transform(corpus)
print(f"TF-IDF with numpy:\n {pd.DataFrame(data=matrix_np, columns=terms)}\n")

# Transform a new sentence
matrix_np = tfidf_numpy.transform(['the car is driven in the sky'])
print(f"{pd.DataFrame(data=matrix_np, columns=terms)}\n")

# Create a TF-IDF with sklearn
tfidf_sklearn = TfidfVectorizer(norm='l1', smooth_idf=False).fit(corpus)
terms = tfidf_sklearn.get_feature_names_out()
matrix_sk = tfidf_sklearn.transform(corpus).toarray()
print(f"TF-IDF with sklearn:\n {pd.DataFrame(data=matrix_sk, columns=terms)}\n")

# Transform a new sentence
matrix_sk = tfidf_sklearn.transform(['the car is driven in the sky']).toarray()
print(f"{pd.DataFrame(data=matrix_sk, columns=terms)}\n")

TF-IDF with numpy:
         car    driven   highway        is        on      road       the     truck
0  0.201895  0.119242  0.000000  0.119242  0.119242  0.201895  0.238484  0.000000
1  0.000000  0.119242  0.201895  0.119242  0.119242  0.000000  0.238484  0.201895

        car    driven  highway        is   on  road       the  truck
0  0.274156  0.181461      0.0  0.181461  0.0   0.0  0.362922    0.0

TF-IDF with sklearn:
         car    driven   highway        is        on      road       the     truck
0  0.201895  0.119242  0.000000  0.119242  0.119242  0.201895  0.238484  0.000000
1  0.000000  0.119242  0.201895  0.119242  0.119242  0.000000  0.238484  0.201895

        car   driven  highway       is   on  road     the  truck
0  0.297401  0.17565      0.0  0.17565  0.0   0.0  0.3513    0.0



### N-grams

N-grams are sequences of N words. Typically uni-grams (1), bi-grams (2) and tri-grams (3). Bi-grams and tri-grams (or larger) provide some context to words and can be used as replacement for uni-grams in many models. Here we use NLTK to create tuples of all bi-grams and tri-grams from the text.

<div class="alert alert-success" style="color:black"><b>Note:</b> The sklearn CountVectorizer() and TfidfVectorizer() have an <code>ngram_range</code> argument which allows you to vectorise N-grams instead of single words.
</div>

In [13]:
# Create spacy document object
text = 'I sat by the riverbank. I went to the bank to withdraw money.'
doc = nlp(text)
print(f"Document: {doc}\n")

# Create N-grams with nltk
for sent in doc.sents:
    print(f"Sentence: {sent}")

    tokens = [token.text for token in sent]

    bi_grams = list(ngrams(tokens, 2))
    print(f"Bi-grams: {bi_grams}")

    tri_grams = list(ngrams(tokens, 3))
    print(f"Tri-grams: {tri_grams}\n")

Document: I sat by the riverbank. I went to the bank to withdraw money.

Sentence: I sat by the riverbank.
Bi-grams: [('I', 'sat'), ('sat', 'by'), ('by', 'the'), ('the', 'riverbank'), ('riverbank', '.')]
Tri-grams: [('I', 'sat', 'by'), ('sat', 'by', 'the'), ('by', 'the', 'riverbank'), ('the', 'riverbank', '.')]

Sentence: I went to the bank to withdraw money.
Bi-grams: [('I', 'went'), ('went', 'to'), ('to', 'the'), ('the', 'bank'), ('bank', 'to'), ('to', 'withdraw'), ('withdraw', 'money'), ('money', '.')]
Tri-grams: [('I', 'went', 'to'), ('went', 'to', 'the'), ('to', 'the', 'bank'), ('the', 'bank', 'to'), ('bank', 'to', 'withdraw'), ('to', 'withdraw', 'money'), ('withdraw', 'money', '.')]



### Word Vectors

Word vectors represent single words as a vector (list) of real numbers which capture some aspect of their meaning and relationships to other words. The best known (and first) method is [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf) which uses either skip-gram (given context word predict surrounding target words), or a continuous bag of words (predict target word given context words). Word vector models are typically trained on 100's of millions of words to produce a set of weights - an embedding matrix - of shape `vocab_size` x `embedding_dim`, where the embedding dimension is the length of a vector for each word (usually 50 to 300).

Once trained these embeddings can be used as semantically rich word representations for other NLP tasks, such as classification. This is called *transfer learning*, where the weights for a model trained on one objective (predicting words) can be used as input to train models on a different task (classification, language modelling, etc). There are lots of pre-trained word vectors available to download which can be used to map words to vectors for input into your models.

Spacy comes with pre-trained 300 dimensional word vectors, so it is easy to create a document and get the word vector for each token.

With word vectors we can also use cosine similarity, which is a measure of similarity between two sequences of numbers, to calculate a similarity score in the range [0, 1].

In [14]:
# Create spacy document object
raw_text = "dog cat banana apple fish"
doc = nlp(raw_text)
print(f"Document: {doc}\n")

# Print the word, its vectors (first 10 dimensions) and size of vector
for token in doc:
    print(F"{token.text} vector: {token.vector[:10]} shape: {token.vector.shape}\n")

# Print the similarity between two words
print(f"Similarity between '{doc[0].text}' and '{doc[1].text}': {doc[0].similarity(doc[1])}\n")
print(f"Similarity between '{doc[0].text}' and '{doc[2].text}': {doc[0].similarity(doc[2])}\n")

# Similarity of two documents
doc1 = nlp("I like fish and chips.")
doc2 = nlp("I like cats and dogs.")
doc3 = nlp("NLP, it's fun!.")

# Print the similarity between two documents
print(f"Similarity between '{doc1}' and '{doc2}': {doc1.similarity(doc2)}\n")
print(f"Similarity between '{doc1}' and '{doc3}': {doc1.similarity(doc3)}\n")

Document: dog cat banana apple fish

dog vector: [  1.233     4.2963   -7.9738  -10.121     1.8207    1.4098   -4.518
  -5.2261   -0.29157   0.95234] shape: (300,)

cat vector: [  3.7032     4.1982    -5.0002   -11.322      0.031702  -1.0255
  -3.087     -3.7327     0.53875    3.5679  ] shape: (300,)

banana vector: [ 0.20778 -2.4151   0.36605  2.0139  -0.23752 -3.1952  -0.2952   1.2272
 -3.4129  -0.54969] shape: (300,)

apple vector: [-1.0084  -2.0308  -0.64185  2.6928   0.31771 -2.6662  -3.7372   5.4714
 -5.1751   0.51958] shape: (300,)

fish vector: [-1.1278  -4.2107  -4.1088   0.73152  3.3726  -2.538   -1.8874   4.4615
 -5.8596   1.8804 ] shape: (300,)

Similarity between 'dog' and 'cat': 0.8220816850662231

Similarity between 'dog' and 'banana': 0.2090904712677002

Similarity between 'I like fish and chips.' and 'I like cats and dogs.': 0.8804120795837534

Similarity between 'I like fish and chips.' and 'NLP, it's fun!.': 0.6635456799293546



Classic $King - Man + Woman \approx Queen$ example.

<div class="alert alert-warning" style="color:black"><b>Bias in word embeddings:</b> As with many problems in machine learning, the models we train tend to pick up the underlying biases within the data we train them on. Word embeddings are no different, and often reflect the human biases present within the huge amounts of text data they were trained on.

This is illustrated by Nissim, M., et al. (2020), who provide example analogies like *"man is to computer programmer as woman is to homemaker"*. They also argue that these biases, when present in embeddings, could be propagated outside of NLP and AI into other domains, misleading those who are less well equipped to understand, or even be aware of the presence of such bias.

*Nissim, M., et al. (2020) Fair Is Better than Sensational: Man Is to Doctor as Woman Is to Doctor. Computational Linguistics. 46 (2), pp. 487–497. Available from: https://aclanthology.org/2020.cl-2.7/.*
</div>

In [15]:
# Get the vectors for each word
doc = nlp('queen king woman man')
queen, king, woman, man = doc[0].vector, doc[1].vector, doc[2].vector, doc[3].vector

# Perform vector arithmetic
new_vec = king - man + woman

# Find the most similar word to the new vector
print("Word Similarities:\n")
for word in doc:
    sim = cosine_similarity(np.expand_dims(word.vector, axis=0), np.expand_dims(new_vec, axis=0))
    print(f"{word} similarity: {sim[0][0]}\n")

Word Similarities:

queen similarity: 0.6178013682365417

king similarity: 0.8489542603492737

woman similarity: 0.3099472224712372

man similarity: 0.07003619521856308



## 2.2 Visualising word vectors

A common use-case for word vectors is to create an embedding matrix, which is used as a lookup table to map words to their vector representations. Here we will do this with the IMDB reviews that we have annotated and processed. The resulting embedding matrix will have shape `vocab_size` x `embedding_dim`.

In [16]:
# Load the imdb reviews
imdb_reviews = pd.read_csv(os.path.join(data_dir, 'imdb_reviews.csv'))

# Tokenise the corpus
imdb_corpus = imdb_reviews['review'].apply(lambda x: [token.text for token in nlp.tokenizer(x)])

# Create simple vocabulary
imdb_vocab = imdb_corpus.explode().unique().tolist()
print(f"Vocabulary size: {len(imdb_vocab)}\n")

# Set the dimensionality of the word vectors
embedding_dim = 300

# Create an empty numpy array
embedding_matrix = np.zeros((len(imdb_vocab), embedding_dim))

# For each word in the imdb vocabulary
for i, word in enumerate(imdb_vocab):
    # If the word has a vector
    if nlp.vocab.has_vector(word):
        # Get the vector for the word
        embedding_matrix[i] = nlp.vocab[word].vector
    else:
        # Get a random vector
        embedding_matrix[i] = np.random.uniform(np.min(embedding_matrix), np.max(embedding_matrix), embedding_dim)

# Create dataframe with words and vectors
embedding_df = pd.DataFrame(embedding_matrix, index=imdb_vocab)
print(f"Embedding dataframe shape:\n {embedding_df.shape}\n")
embedding_df.head(10)

Vocabulary size: 4432

Embedding dataframe shape:
 (4432, 300)



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
there,1.0923,4.4546,-2.835,2.1846,-0.30957,1.2677,-0.041528,8.6433,-2.4481,2.7436,...,2.0041,-2.7586,4.9892,-4.3195,-1.9144,0.74608,-1.9334,1.4751,-3.5624,1.7234
is,1.475,6.0078,1.1205,-3.5874,3.7638,3.1987,-2.206,3.2128,-2.0816,-0.002931,...,10.955,-2.9619,4.5407,-2.2999,-0.99536,1.2619,-2.3326,-0.22893,-0.85967,9.7466
a,-9.3629,9.2761,-7.2708,4.3879,10.316,-6.8469,1.5755,7.9405,8.0812,2.6194,...,-8.6711,3.6026,0.94914,5.9861,0.14368,9.7066,4.4738,2.6801,-6.816,3.5737
new,2.593,4.3454,-1.466,-1.2656,2.8705,-1.0454,3.9176,4.8105,-4.4885,3.2401,...,2.0445,4.7535,-2.7616,-0.20498,1.4246,-2.5834,-4.1083,-1.261,-5.4478,4.2885
nuclear,3.2159,-1.1824,-0.65593,0.95864,7.4044,-0.60504,1.0876,3.7238,-0.85234,-1.4552,...,3.4866,-3.9436,-1.4454,-0.93429,-1.855,-2.1077,3.7509,2.501,-1.178,2.4227
arms,0.33463,0.007916,-5.5622,3.1729,6.6135,2.3728,-2.5857,2.0095,3.4046,2.6612,...,1.0959,1.1904,5.5969,-1.7531,-1.9094,0.044897,-2.0561,3.2893,-2.9075,0.23962
race,-4.3898,0.36213,-2.7586,1.5399,5.0026,2.8317,5.8762,12.024,-1.3522,-0.39541,...,0.56888,2.2888,-3.0819,0.078426,-1.5772,0.55451,4.1041,1.558,-1.9294,-3.1873
underway,1.9133,1.6525,-0.98616,0.8054,2.703,0.88577,1.5113,2.7556,1.4992,2.5116,...,1.7143,1.9735,0.51199,-1.6483,-3.0546,1.4862,2.0454,-3.3049,1.513,2.0062
superman,-2.2788,0.87593,1.4747,-0.028177,1.5889,-0.44738,1.0005,0.58913,2.4636,1.33,...,0.9449,-3.015,-0.55688,1.6968,1.1284,1.8316,2.1094,-1.7908,1.2269,-0.41504
forbidden,-0.83916,0.56858,0.62002,0.24629,2.5595,0.33772,2.6507,-0.62044,-0.65833,0.60981,...,-1.7237,0.90849,0.88191,-4.8933,0.10066,-1.9159,-1.9116,-0.51103,-2.0317,-0.30917


Calculate the similarity between all words in the embedding matrix.

In [17]:
# Calculate the cosine similarity between the words
similarity_matrix = cosine_similarity(embedding_df)
# Create dataframe with words and similarity
similarity_df = pd.DataFrame(similarity_matrix, columns=imdb_vocab)
# Add word as second index
similarity_df.insert(0, 'word_ind', imdb_vocab)
similarity_df.set_index('word_ind', inplace=True, append=True)
similarity_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,there,is,a,new,nuclear,arms,race,underway,superman,forbidden,...,beauty,sludge,mixture,conflict,boorman,grasp,casablanca,vertigo,r,v
Unnamed: 0_level_1,word_ind,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,there,1.0,0.185102,0.268007,0.224541,0.328851,0.185183,0.189985,0.435817,0.056798,0.360909,...,0.254705,0.200327,0.319161,0.426701,-0.022115,0.277568,0.111729,0.412712,0.057114,-0.083122
1,is,0.185102,1.0,0.289764,0.047177,0.158434,0.004547,0.08354,0.081484,0.01727,0.112925,...,0.205849,0.036951,0.213609,0.184534,-0.043874,0.300133,-0.004319,0.174874,0.059598,-0.055503
2,a,0.268007,0.289764,1.0,0.255128,0.22014,0.172469,0.122219,0.215751,0.103133,0.047692,...,0.187148,0.065131,0.242395,0.213781,-0.058815,0.168449,0.124677,0.222516,0.066532,-0.062982
3,new,0.224541,0.047177,0.255128,1.0,0.263856,0.065058,0.055729,0.336416,0.122601,0.129612,...,0.229109,0.116252,0.198524,0.198738,-0.009455,0.081388,0.09397,0.12286,0.05054,-0.02843
4,nuclear,0.328851,0.158434,0.22014,0.263856,1.0,0.306704,0.139159,0.450367,0.239214,0.205328,...,0.063567,0.207092,0.239058,0.492736,-0.034024,0.099839,0.18914,0.166196,-0.034553,-0.045302
5,arms,0.185183,0.004547,0.172469,0.065058,0.306704,1.0,0.171109,0.163449,0.064,0.193794,...,0.096928,-0.02248,0.167853,0.269723,-0.075271,0.157007,0.196318,0.200296,-0.052197,-0.063862
6,race,0.189985,0.08354,0.122219,0.055729,0.139159,0.171109,1.0,0.194672,0.110399,0.095752,...,0.097606,0.031124,0.144889,0.282314,-0.034714,0.106489,0.181087,0.12176,0.025867,-0.016325
7,underway,0.435817,0.081484,0.215751,0.336416,0.450367,0.163449,0.194672,1.0,0.175782,0.159313,...,0.083011,0.277957,0.192092,0.353972,-0.157315,0.091973,0.203828,0.264346,-0.060568,-0.025443
8,superman,0.056798,0.01727,0.103133,0.122601,0.239214,0.064,0.110399,0.175782,1.0,0.055095,...,-0.025244,0.041558,0.007592,0.195978,-0.028031,-0.012031,0.144935,0.040963,-0.086127,-0.025004
9,forbidden,0.360909,0.112925,0.047692,0.129612,0.205328,0.193794,0.095752,0.159313,0.055095,1.0,...,0.221911,0.09958,0.104553,0.286433,-0.034382,0.161164,0.233923,0.083684,0.06722,-0.083448


Now we can use the vectors to visualise the most similar and least similar words to a given target word.

1. We will use [Principal Component Analysis (PCA)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to reduce the dimensionality of the embeddings so we can visualise them.

2. Next find the N most similar and dissimilar words to a target word.

3. Create a 3D plot of the embeddings. With `N=5` and 'reviewers' you should see that, for example, 'review' and 'critics' are very close in the embedding (vector) space.

In [18]:
# Set the number of similar/dissimilar words and a target word
N = 5
word = 'reviewers'

# Perform PCA (dimensionality reduction) on the embedding matrix
pca_embeddings = PCA(n_components=3).fit_transform(embedding_matrix)

# Find the N most/least similar words
most_sim = similarity_df[word].sort_values(ascending=False)[0:N + 1]
least_sim = similarity_df[word].sort_values(ascending=True)[0:N + 1]

most_sim_words = [w for ind, w in most_sim.index.values]
least_sim_words = [w for ind, w in least_sim.index.values]

# Get the indices of the most/least similar words from the reduced embedding matrix
most_sim_pca = pca_embeddings[[ind for ind, w in most_sim.index.values]]
least_sim_pca = pca_embeddings[[ind for ind, w in least_sim.index.values]]

# Plot the most/least similar words
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(most_sim_pca[:, 0], most_sim_pca[:, 1],  most_sim_pca[:, 2], linewidths=1, color='blue')
ax.scatter(least_sim_pca[:, 0], least_sim_pca[:, 1],  least_sim_pca[:, 2], linewidths=1, color='red')
# Add words to the plot
for i, word in enumerate(most_sim_words):
    ax.text(most_sim_pca[i, 0]+.02, most_sim_pca[i, 1], most_sim_pca[i, 2], word, size=10, zorder=1)
for i, word in enumerate(least_sim_words):
    ax.text(least_sim_pca[i, 0]+.02, least_sim_pca[i, 1], least_sim_pca[i, 2], word, size=10, zorder=1)

KeyError: 'reviewers'

<div class="alert alert-success" style="color:black"><h3>Before you submit this notebook to NBGrader for marking:</h3> 

1. Make sure have completed all exercises marked by <span style="color:blue">**blue cells**</span>.
2. For automatically marked exercises ensure you have completed any cells with `# YOUR CODE HERE`. Then click 'Validate' button above, or ensure all cells run without producing an error.
3. For manually marked exercises ensure you have completed any cells with `"YOUR ANSWER HERE"`.
4. Ensure all cells are run with their output visible.
5. Fill in your student ID (**only**) below.
6. You should now **save and download** your work.

</div>

**Student ID:** 15006280