![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logo/0x150/1643104191/logo-mse.png)

# AnTeDe Lab 6: Applications of word2vec
## Objective
* Compare pre-trained word2vec models with models trained on your workstation, on word similarity and analogy tasks.	

## General instructions
* You can do the lab alone or in groups of two students.
* Please write the required code, but also reply explicitly to the questions, as Python comments in code cells or text in markdown cells. 
* To submit your practical work, please make sure all cells are executed, then save and zip the notebook, and submit it as homework on [Moodle](https://moodle.msengineering.ch/course/view.php?id=1869).
* Useful documentation: [section on word2vec in Gensim](https://radimrehurek.com/gensim/models/word2vec.html) as well as the [section on KeyedVectors in Gensim](https://radimrehurek.com/gensim/models/keyedvectors.html).
* Training can be done locally if you have at least 16 GB of memory (it takes minutes, not hours), or using [Google Colab](https://colab.research.google.com/).

## 1. Testing a word2vec model trained on Google News
a. Install Gensim the latest version of Gensim, for instance by running in your Conda environment `pip install --upgrade gensim`. 

In [2]:
# If you want to install from here, run the command ('!' indicates a command for the shell)
# !pip install --upgrade gensim

# Please run the following verification:
!pip show gensim

Name: gensim
Version: 4.3.2
Summary: Python framework for fast Vector Space Modelling
Home-page: https://radimrehurek.com/gensim/
Author: Radim Rehurek
Author-email: me@radimrehurek.com
License: LGPL-2.1-only
Location: C:\Users\Ruben\anaconda3\envs\TSM_AnTeDe\Lib\site-packages
Requires: numpy, scipy, smart-open
Required-by: 


In [3]:
import gensim, os
from gensim import downloader
# help(gensim.models.word2vec) # check if you are curious, but don't include output in the final notebook

In [4]:
# Download the model file from Gensim, for the first time only
# gensim.downloader.load("word2vec-google-news-300")
# We can assign the returned value to a model, but it is twice larger than needed.

b. Where is the model stored on your computer?  What is the size of the file?  Please store the absolute path in a variable called `path_to_model_file`, and use `os.path.getsize` to display the size converted in gigabytes with two decimals.

In [5]:
# Please write your Python code below and execute it.

path_to_model_file = "./gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz"
# The model is stored in the 'gensim-data' directory, in a file called 'word2vec-google-news-300.gz'.

# Get the size of the model file

model_size = os.path.getsize(path_to_model_file) / (1024 * 1024 * 1024)  # Co

print(f"Size of the model file: {model_size:.2f} GB")

Size of the model file: 1.62 GB


In [6]:
# Load the model into the notebook:
from gensim.models import KeyedVectors
wv_model = gensim.models.KeyedVectors.load_word2vec_format(path_to_model_file, binary=True)  # C bin format

c. What is the memory size of the process corresponding to this notebook?  Please simply write the value you obtain from any OS-specific utility that you wish to use for this purpose.

In [7]:
# Please write your size below.
import os
import psutil

process = psutil.Process(os.getpid())
memory_info = process.memory_info()
print(f"Memory usage: {memory_info.rss / (1024 * 1024)} MB")

Memory usage: 1789.31640625 MB


d. What is the size of the vocabulary of this model?  (I.e., how many words does it know?)

In [8]:
# Please write the Python code needed to display the vocabulary size and execute it.
print(f"Vocabulary size: {len(wv_model.key_to_index)}")

Vocabulary size: 3000000


e. Compare the vocabulary size with the number of words in an English dictionary.  How do you explain the difference?  Illustrate your explanation by showing at least 5 words which are in the model's vocabulary, and 2 that are not.

In [9]:
# Please write your Python code below and execute it.

#Compare the vocabulary size with the number of words in an English dictionary.  How do you explain the difference?  Illustrate your explanation by showing at least 5 words which are in the model's vocabulary, and 2 that are not."""

print("NUMBER OF WORDS IN AN ENGLISH DICTIONARY: 171476")
# SOURCE https://wordcounter.io/blog/how-many-words-are-in-the-english-language#:~:text=The%20English%20Dictionary&text=The%20Second%20Edition%20of%20the,Section%2C%20includes%20some%20470%2C000%20entries.

print("The vocabulary size of the model is 3000000, which is much larger than the number of words in an English dictionary. This is because the model is trained on a large corpus of text, which includes many different words, including proper nouns, acronyms, and other specialized terms. The model is designed to capture the meaning of words in context, so it needs to be able to represent a wide range of vocabulary.")

# Words in the model's vocabulary:
import random
random.seed(42)
words_in_vocab = random.sample(list(wv_model.key_to_index.keys()), 5)

print(f"5 Random words in the model's vocabulary: {words_in_vocab}\n")

# Now give me code which check ia word is in the vocalbulary or not
words_not_in_vocab = ['Ruben_Yannis', 'Yannis_Ruben']
for word in words_not_in_vocab:
    if word in wv_model.key_to_index:
        print(f"{word} is in the model's vocabulary")
    else:
        print(f"{word} is not in the model's vocabulary")

5 Random words in the model's vocabulary: ['espresso_martinis', 'wherefore', 'HARD', 'courtly_manners', "Hawai'ian"]

Ruben_Yannis is not in the model's vocabulary
Yannis_Ruben is not in the model's vocabulary


f. Determine the size of the vector space for this word2vec model, i.e. the dimensionality of the embedding space, using two methods: either using the vector of a word from the vocabulary, or directly using the shape of the model.

In [10]:
# Please write your Python code below and execute it.

# Method 1: Using the vector of a word from the vocabulary
vector_size_1 = len(wv_model['wherefore'])

# Method 2: Using the shape of the model
print(wv_model.vectors.shape)
#The shape of the model's vectors, which is (3000000, 300), meaning that there are 3,000,000 vectors in the model, each of which has a dimensionality of 300.
vector_size_2 = wv_model.vectors.shape[1]

print(f"Vector size using method 1: {vector_size_1}")
print(f"Vector size using method 2: {vector_size_2}")

(3000000, 300)
Vector size using method 1: 300
Vector size using method 2: 300


## 2. Using word2vec trained on Google News for word similarity
In this section, you are going to use word vectors to compute (cosine) similarity between words (use the [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) documentation).  You will experiment with three tasks: (a) rank a small number of word pairs by decreasing similarity; (b) test the model on the WordSimilarity-353 set; (c) test the model on the analogy task.

a. Sort the word pairs given below by decreasing similarity (i.e. most similar first).  Display also the similarity value found by word2vec, with 2 decimals.

In [11]:
test_pairs = [('car','automobile'), ('car', 'bike'), ('car', 'oil'), ('car', 'pedal'),  ('bike', 'pedal'), ('bike', 'bicycle'), ('oil', 'gas'), ('car', 'bus')]

# Please write your Python code below and execute it.
sorted_pairs = sorted([(pair, wv_model.similarity(pair[0], pair[1])) for pair in test_pairs], key=lambda x: x[1], reverse=True)

for pair, similarity in sorted_pairs:
    print(f"{pair}: {similarity:.2f}")


('bike', 'bicycle'): 0.85
('oil', 'gas'): 0.71
('car', 'bike'): 0.59
('car', 'automobile'): 0.58
('car', 'bus'): 0.47
('bike', 'pedal'): 0.47
('car', 'pedal'): 0.29
('car', 'oil'): 0.15


b. What are the five closest words to *car* in the whole vocabulary and their similarity values with *car*? 

In [12]:
wv_model.init_sims(replace=True) # run this to avoid memory footprint doubling with the first call 
# of "most_similar" (which caches unit vectors without replacement, unless told explicitly to do so).
# Will have the same effect on evaluate_word_analogies below.

# Please write your Python code below and execute it.
car_similarities = wv_model.most_similar('car', topn=5)
for word, similarity in car_similarities:
    print(f"{word}: {similarity:.2f}")

  wv_model.init_sims(replace=True) # run this to avoid memory footprint doubling with the first call


vehicle: 0.78
cars: 0.74
SUV: 0.72
minivan: 0.69
truck: 0.67


c. Using the [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) documentation, evaluate the model on the WordSimilarity-353 task.  This compares similarities assigned to word pairs by word2vec with those assigned by humans.  Please display only the Pearson Correlation Coefficient, with two decimals. 

In [13]:
# Please write your Python code below and execute it.

from gensim.test.utils import datapath

# Load evaluation dataset of word similarities
word_similarity_dataset = datapath('wordsim353.tsv')

# Evaluate the model on the WordSimilarity-353 task
pearson_correlation = wv_model.evaluate_word_pairs(word_similarity_dataset)[0][0]

# Display the Pearson Correlation Coefficient with two decimals
print(f"Pearson Correlation Coefficient: {pearson_correlation:.2f}")

Pearson Correlation Coefficient: 0.62


d. Using the [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) documentation, evaluate the model on the Analogy Tasks (e.g., "*What is to Thailand what Athens is to Greece?*"). The task is specified in a data file called `questions-words.txt`.  Note: this takes around 5 minutes.  Store the output in a variable for future use.

In [14]:
# Please write your Python code below and execute it.
analogy_scores = wv_model.evaluate_word_analogies(datapath('questions-words.txt'))

In [20]:
# Store the output in a variable for future use in case juypter crashes
import pickle

with open('analogy_scores.pkl', 'wb') as f:
    pickle.dump(analogy_scores, f)

In [32]:
# Load the output from the variable

with open('analogy_scores.pkl', 'rb') as f:
    analogy_scores = pickle.load(f)

In [33]:
result = wv_model.most_similar(positive=['Thailand', 'Athens'], negative=['Greece'], topn=3)

print(result)

print(f"\n {result[0][0]} is to Thailand what Athens is to Greece")

[('Bangkok', 0.6855700016021729), ('Bangkok_Thailand', 0.5587929487228394), ('Thai', 0.5496402978897095)]

 Bangkok is to Thailand what Athens is to Greece


e. Using the output above, Please display the accuracy (number of correctly solved analogies), and then pick four categories of your choice, and display for each of them the accuracy, a correctly-solved analogy, and an incorrectly-solved one.  How many analogy tasks are there in total?

In [30]:
# Please write your Python code below and execute it.

# Country neighbors
result_country_neighbors = wv_model.most_similar(positive=['Greece', 'Switzerland'], negative=['France'], topn=1)
print(f"{result_country_neighbors} Result is not a neighbor of Greece so this analogy does not work\n")


# Country currencies
result_currency = wv_model.most_similar(positive=['Euro', 'Japan'], negative=['France'], topn=1)
print(f"{result_currency} Result is the currency of Japan so this analogy works\n")

# Gender relationships
result_gender = wv_model.most_similar(positive=['queen', 'man'], negative=['woman'], topn=1)
print(f"{result_gender} Result it the gender relationship so this analogy works\n")


# Singular/plural
result_singular_plural = wv_model.most_similar(positive=['apples', 'horse'], negative=['apple'], topn=1)
print(f"{result_singular_plural} Result is the plural of horse so this analogy works\n")


[('Latvia', 0.5221792459487915), ('Cyprus', 0.5046451687812805), ('Iceland', 0.48725128173828125)]
[('Latvia', 0.5221792459487915), ('Cyprus', 0.5046451687812805), ('Iceland', 0.48725128173828125)] Result is not a neighbor of Greece so this analogy does not work
[('JPY', 0.5418580770492554)]
[('JPY', 0.5418580770492554)] Result is the currency of Japan so this analogy works
[('king', 0.6958590745925903)]
[('king', 0.6958590745925903)] Result it the gender relationship so this analogy works
[('horses', 0.7881851196289062)]
[('horses', 0.7881851196289062)] Result is the plural of horse so this analogy works


f. Create a short file called `questions-words-NAME.txt` (where `NAME` is your name) with several new test items for analogies (at least 10 lines), following the template of `questions-words.txt`.  For instance, from the three following pairs: (eye, see), (ear, listen), and (foot, walk) you can create 12 test items, varying the item that the system must predict and the initial items.  What is the accuracy of the model on your test set?

In [60]:
# Please write your Python code below and execute it.

analogy_scores = wv_model.evaluate_word_analogies(datapath('questions-words.txt'))

analogy_scores_Ruben = wv_model.evaluate_word_analogies("./questions-words-Ruben.txt")

Demander à Killian

# i want to see the x first caracters of the analogy_scores
analogy_scores_Ruben


(0.0,
 [{'section': 'sensory-verbs',
   'correct': [],
   'incorrect': [('EYE', 'SEE', 'EAR', 'EAR'),
    ('SEE', 'EYE', 'LISTEN', 'WALK'),
    ('EAR', 'FOOT', 'EYE', 'SEE'),
    ('LISTEN', 'SEE', 'SEE', 'EYE'),
    ('EYE', 'SEE', 'FOOT', 'EYE'),
    ('SEE', 'EYE', 'WALK', 'WALK'),
    ('FOOT', 'FOOT', 'EYE', 'SEE'),
    ('WALK', 'SEE', 'SEE', 'EYE'),
    ('EAR', 'LISTEN', 'EYE', 'EYE'),
    ('LISTEN', 'EAR', 'SEE', 'LISTEN'),
    ('EYE', 'EAR', 'EAR', 'LISTEN'),
    ('SEE', 'EAR', 'LISTEN', 'EAR'),
    ('EAR', 'LISTEN', 'FOOT', 'SEE'),
    ('LISTEN', 'EAR', 'WALK', 'EYE'),
    ('FOOT', 'LISTEN', 'EAR', 'LISTEN'),
    ('WALK', 'FOOT', 'LISTEN', 'EAR'),
    ('FOOT', 'WALK', 'EYE', 'WALK'),
    ('WALK', 'FOOT', 'SEE', 'EAR'),
    ('EYE', 'LISTEN', 'FOOT', 'WALK'),
    ('SEE', 'FOOT', 'WALK', 'FOOT'),
    ('FOOT', 'WALK', 'EAR', 'WALK'),
    ('WALK', 'FOOT', 'LISTEN', 'EYE'),
    ('EAR', 'SEE', 'FOOT', 'WALK'),
    ('LISTEN', 'LISTEN', 'WALK', 'FOOT')]},
  {'section': 'Total accuracy',
  

## 3. Training a word2vec model from scratch
In this section, you will first use `gensim.downloader` to retrieve a 100-million character corpus ('text8' excerpt from Wikipedia).  You will use this data to train your own word2vec model.  Then, you will test the model on word similarity and analogies tasks.
* [documentation of gensim.downloader](https://radimrehurek.com/gensim/downloader.html)
* [corpora and pre-trained models available from gensim-data](https://github.com/RaRe-Technologies/gensim-data) -- the list can also be accessed with the command `gensim.downloader.info()` 

Please run the following code first.

In [65]:
import gensim.downloader as api
text8_corpus = api.load('text8') # Downloads file once if needed -- if not, loads it from local copy.

a. How many words are there in the 'text8' corpus?

In [63]:
from gensim.models import Word2Vec

# Please write your Python code below and execute it.
# text8_model = Word2Vec(text8_corpus)
# text8_model.save("./text8-word2vec.bin")


In [64]:
text8_model = Word2Vec.load("./text8-word2vec.bin") 

print(f"Number of words in the 'text8' corpus: {text8_model.wv.vectors.shape[0]}")


Number of words in the 'text8' corpus: 71290


b. Using the [documentation of Gensim's Word2Vec class](https://radimrehurek.com/gensim/models/word2vec.html), train your own word2vec model using 'text8'.  How many seconds does this take? (Use the difference between start and end times obtained with `time.time()`.)

In [66]:
import time
from gensim.models import Word2Vec
# Please write your Python code below and execute it.

start_time = time.time()

text8_model.train(text8_corpus, total_examples=text8_model.corpus_count, epochs=10)

end_time = time.time()

print(f"Training the model took {end_time - start_time:.2f} seconds")

Training the model took 81.81 seconds


c. Using your code from Section 1, what are the vocabulary size and the dimensionality of the embedding space of this model?

In [None]:
# Please write the Python code needed to display the vocabulary size and execute it.

print(f"Vocabulary size: {text8_model.wv.vectors.shape[0]}\n")

print(f"Dimensionality of the embedding space: {text8_model.wv.vectors.shape[1]}")

d. Please read the "*Usage examples*" of the [Word2Vec class](https://radimrehurek.com/gensim/models/word2vec.html) to understand the difference between saving the full Word2Vec model (which enables future retraining on additional data) or saving only the vectors, an instance of KeyedVectors, which will save space.  Now, (1) save the vectors only, (2) load the vectors into a new variable, and (if everything worked fine), (3) delete the old model variable from the notebook's memory using `del`.  Note: saving the vectors may create one or more files, depending on the size of the model.

In [67]:
# Please write your Python code below and execute it.

# Save the vectors only
text8_model.wv.save("./text8-word2vec-vectors.kv")

# Load the vectors into a new variable
wv_text8 = KeyedVectors.load("./text8-word2vec-vectors.kv")

# Delete the old model variable
#del text8_model




e. Evaluate the new model on WordSimilarity-353 and Analogies tasks, reusing your code from above.  How does this model compare with the one trained on Google News?  Why?

In [68]:
wv_text8.init_sims(replace=True) # see (2b) but less important as the model is much smaller

# Please write your Python code below and execute it.

# Evaluate the model on the WordSimilarity-353 task
pearson_correlation_text8 = wv_text8.evaluate_word_pairs(word_similarity_dataset)[0][0]

# Display the Pearson Correlation Coefficient with two decimals
print(f"Pearson Correlation Coefficient for the 'text8' model: {pearson_correlation_text8:.2f}")

# Evaluate the model on the Analogy Tasks
analogy_scores_text8 = wv_text8.evaluate_word_analogies(datapath('questions-words.txt'))

# Display the accuracy of the model on the analogy tasks
print(f"Accuracy of the 'text8' model on the analogy tasks: {analogy_scores_text8[0]}")

  wv_text8.init_sims(replace=True) # see (2b) but less important as the model is much smaller


Pearson Correlation Coefficient for the 'text8' model: 0.66
Accuracy of the 'text8' model on the analogy tasks: 0.3280417344477478


In [69]:
# Please write below a short comment to compare the 'Text8' model with the'Google News' model.

"The 'text8' model performs slightly better on the WordSimilarity-353 task than the 'Google News' model, with a Pearson Correlation Coefficient of 0.66 compared to 0.62. This suggests that the 'text8' model is better at capturing the similarity between words in the dataset. However, the 'Google News' model performs much better on the analogy tasks, with an accuracy of 0.73 compared to 0.32 for the 'text8' model."

"The 'text8' model performs slightly better on the WordSimilarity-353 task than the 'Google News' model, with a Pearson Correlation Coefficient of 0.66 compared to 0.62. This suggests that the 'text8' model is better at capturing the similarity between words in the dataset. However, the 'Google News' model performs much better on the analogy tasks, with an accuracy of 0.73 compared to 0.32 for the 'text8' model."

f. Compare the accuracies on the analogy tasks of the two models for each category of tasks.  For which category are accuracies the most similar.  Can you explain this?

In [93]:
# Please write your Python code below and execute it.

def get_accuracies(accuracy_result):
    accuracies = {}
    detailed_accuracies = accuracy_result[1]  # Assume this directly corresponds to detailed accuracies
    for section in detailed_accuracies:
        section_name = section['section']
        # Assuming 'correct' and 'incorrect' are counts
        accuracy = len(section['correct']) / (len(section['correct']) + len(section['incorrect']))
        accuracies[section_name] = accuracy
    return accuracies

def print_top_accuracies(accuracies):
    # Sort accuracies in descending order by their score
    sorted_accuracies = sorted(accuracies.items(), key=lambda x: x[1], reverse=True)
    
    for section, accuracy in sorted_accuracies:
        print(f"{section}: {accuracy:.2%}")

    
categories_accuracies_google = get_accuracies(analogy_scores)
categories_accuracies_text8 = get_accuracies(analogy_scores_text8)

print("\nGoogle News accuracies:\n")
print_top_accuracies(categories_accuracies_google)

print("\nText8 accuracies:\n")
print_top_accuracies(categories_accuracies_text8)



Google News accuracies:

gram3-comparative: 91.29%
gram6-nationality-adjective: 90.18%
gram4-superlative: 87.97%
gram8-plural: 87.01%
family: 86.17%
capital-common-countries: 83.20%
capital-world: 81.32%
gram5-present-participle: 78.50%
Total accuracy: 74.01%
city-in-state: 72.11%
gram9-plural-verbs: 68.16%
gram7-past-tense: 65.38%
gram2-opposite: 43.47%
gram1-adjective-to-adverb: 29.23%
currency: 28.47%

Text8 accuracies:

gram3-comparative: 63.51%
family: 62.38%
gram6-nationality-adjective: 60.62%
capital-common-countries: 56.92%
gram8-plural: 39.79%
gram5-present-participle: 36.17%
Total accuracy: 32.80%
gram7-past-tense: 32.12%
gram9-plural-verbs: 30.92%
capital-world: 26.12%
gram4-superlative: 25.10%
city-in-state: 16.35%
gram1-adjective-to-adverb: 14.52%
gram2-opposite: 10.98%
currency: 10.07%


In [117]:
def print_words_from_section(section_name):
    # Accessing the list of dictionaries
    for section in analogy_scores[1]:
        # Check if this dictionary is for the section we're interested in
        if section["section"] == section_name:
            # Assuming you want to print both Correct and Incorrect words
            print(f"Correct words in {section_name}: {section['correct']}")
            print(f"Incorrect words in {section_name}: {section['incorrect']}")
            break  # Break after finding and printing the section to avoid unnecessary iterations

# Example usage
print_words_from_section("gram3-comparative")

Correct words in gram3-comparative: [('BAD', 'WORSE', 'BIG', 'BIGGER'), ('BAD', 'WORSE', 'BRIGHT', 'BRIGHTER'), ('BAD', 'WORSE', 'CHEAP', 'CHEAPER'), ('BAD', 'WORSE', 'COLD', 'COLDER'), ('BAD', 'WORSE', 'DEEP', 'DEEPER'), ('BAD', 'WORSE', 'EASY', 'EASIER'), ('BAD', 'WORSE', 'FAST', 'FASTER'), ('BAD', 'WORSE', 'GOOD', 'BETTER'), ('BAD', 'WORSE', 'GREAT', 'GREATER'), ('BAD', 'WORSE', 'HARD', 'HARDER'), ('BAD', 'WORSE', 'HEAVY', 'HEAVIER'), ('BAD', 'WORSE', 'HIGH', 'HIGHER'), ('BAD', 'WORSE', 'HOT', 'HOTTER'), ('BAD', 'WORSE', 'LARGE', 'LARGER'), ('BAD', 'WORSE', 'LONG', 'LONGER'), ('BAD', 'WORSE', 'LOUD', 'LOUDER'), ('BAD', 'WORSE', 'LOW', 'LOWER'), ('BAD', 'WORSE', 'QUICK', 'QUICKER'), ('BAD', 'WORSE', 'SAFE', 'SAFER'), ('BAD', 'WORSE', 'SHARP', 'SHARPER'), ('BAD', 'WORSE', 'SHORT', 'SHORTER'), ('BAD', 'WORSE', 'SIMPLE', 'SIMPLER'), ('BAD', 'WORSE', 'SLOW', 'SLOWER'), ('BAD', 'WORSE', 'SMART', 'SMARTER'), ('BAD', 'WORSE', 'STRONG', 'STRONGER'), ('BAD', 'WORSE', 'TALL', 'TALLER'), ('BAD'

In [ ]:
"Comparative words seems to be more easier to compute distance so even if we have a less efficient model we can still have a good accuracy on this category"

## 4. Compare the two models on your own analogy tasks
In this section, you will evaluate the new model on the analogy tasks you defined in Section 2f.  You will then try to diagnose the performance by inspecting the word vectors.

a. Reusing the code from above, what is the accuracy of the model trained on Text 8 on your analogy tasks from 2f?

In [None]:
# Please write your Python code below and execute it.
Demander à Killian

b. We are now going to visualize the word vectors for the words in your analogy task.  Store the list of words in a variable and check which ones are in the vocabulary; create a new variable with them.

In [None]:
# Please write your Python code below and execute it.

# b. We are now going to visualize the word vectors for the words in your analogy task.  Store the list of words in a variable and check which ones are in the vocabulary; create a new variable with them.

words = ['eye', 'see', 'ear', 'listen', 'foot', 'walk', 'apples', 'horse']

words_in_vocab = [word for word in words if word in wv_text8.key_to_index]

print(f"Words in the vocabulary: {words_in_vocab}")


Voir lesquels sont dans le vocabulary du model text8

c. The function below will help you plot a 2D representation of the word vectors using [PCA from scikit.learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).  (It is also possible to use [UMAP](https://umap-learn.readthedocs.io/en/latest/basic_usage.html) instead of PCA in display_scatterplot).  Please display the word vectors for your model trained on Text8, and then for the model trained on Google News.  Please comment on the differences. 

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
def display_scatterplot(model, words): # assumes all words are in the vocabulary
    word_vectors = [model[word] for word in words]
    twodim = PCA().fit_transform(word_vectors)[:,:2]
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x + 0.03, y + 0.03, word)

In [None]:
# Please write your Python code below and execute it.

In [None]:
# Please write your Python code below and execute it.

## End of Lab 6
Please make sure all cells have been executed, save this completed notebook, compress it to a *zip* file, and upload it to [Moodle](https://moodle.msengineering.ch/course/view.php?id=1869).