## TC 3007B

### Word Embeddings

<br>

#### Activity 1: Exploring Word Embeddings with GloVe and Numpy

<br>

- Objective:
  - To understand the concept of word embeddings and their significance in Natural Language Processing.
  - To learn how to manipulate and visualize high-dimensional data using dimensionality reduction techniques like PCA and t-SNE.
  - To gain hands-on experience in implementing word similarity and analogies using GloVe embeddings and Numpy.

<br>

- Instructions:

  - Download GloVe pre-trained vectors from the provided link in Canvas, the official public project:
    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation
    https://nlp.stanford.edu/data/glove.6B.zip

  - Create a dictorionay of the embeddings so that you carry out fast look ups. Save that dictionary e.g. as a serialized file for faster loading in future uses.

  - PCA and t-SNE Visualization: After loading the GloVe embeddings, use Numpy and Sklearn to perform PCA and t-SNE to reduce the dimensionality of the embeddings and visualize them in a 2D or 3D space.

  - Word Similarity: Implement a function that takes a word as input and returns the 'n' most similar words based on their embeddings. You should use Numpy to implement this function, using libraries that already implement this function (e.g. Gensim) will result in zero points.

  - Word Analogies: Implement a function to solve analogies between words. For example, "man is to king as woman is to \_\_\_\_". You should use Numpy to implement this function, using libraries that already implement this function (e.g. Gensim) will result in zero points.

  - Submission: This activity is to be submitted in teams of 3 or 4. Only one person should submit the final work, with the full names of all team members included in a markdown cell at the beginning of the notebook.

<br>

- Evaluation Criteria:

      - Code Quality (10%): Your code should be well-organized, clearly commented, and easy to follow. Use also markdown cells for clarity.
      - Functionality (90%): All functions should work as intended, without errors.
         - Visualization of PCA and t-SNE (15% each for a total of 30%)
         - Similarity function (30%)
         - Analogy function (30%)

  |


#### Import libraries


In [None]:
# Import libraries
import torch
import torch.nn.functional as F
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np
from numpy.linalg import norm
import pickle

plt.style.use("ggplot")

#### Load file


In [None]:
# PATH = '/media/pepe/DataUbuntu/Databases/glove_embeddings/glove.6B.200d.txt'
PATH = "/media/pepe/DataUbuntu/Databases/glove_embeddings/glove.6B.50d.txt"
emb_dim = 50

In [None]:
# Create dictionary with embeddings
def create_emb_dictionary(path):
    pass

In [None]:
# create dictionary
embeddings_dict = create_emb_dictionary(PATH)

In [None]:
# Serialize
with open("embeddings_dict_50D.pkl", "wb") as f:
    pickle.dump(embeddings_dict, f)

# Deserialize
# with open('embeddings_dict_200D.pkl', 'rb') as f:
#     embeddings_dict = pickle.load(f)

#### See some embeddings


In [None]:
# Show some
def show_n_first_words(path, n_words):
    with open(path, "r") as f:
        for i, line in enumerate(f):
            print(line.split(), len(line.split()[1:]))
            if i >= n_words:
                break

In [None]:
show_n_first_words(PATH, 5)

### Plot some embeddings


In [None]:
def plot_embeddings(emb_path, words2show, emb_dim, embeddings_dict, func=PCA):
    pass

In [None]:
words = [
    "burger",
    "tortilla",
    "bread",
    "pizza",
    "beef",
    "steak",
    "fries",
    "chips",
    "argentina",
    "mexico",
    "spain",
    "usa",
    "france",
    "italy",
    "greece",
    "china",
    "water",
    "beer",
    "tequila",
    "wine",
    "whisky",
    "brandy",
    "vodka",
    "coffee",
    "tea",
    "apple",
    "banana",
    "orange",
    "lemon",
    "grapefruit",
    "grape",
    "strawberry",
    "raspberry",
    "school",
    "work",
    "university",
    "highschool",
]

In [None]:
#
plot_embeddings(PATH, words, emb_dim, embeddings_dict, PCA)

In [None]:
# t-SNE dimensionality reduction for visualization
embeddings = plot_embeddings(PATH, words, emb_dim, embeddings_dict, tSNE)

### Let us compute analogies


In [None]:
# analogy
def analogy(word1, word2, word3, embeddings_dict):
    pass

In [None]:
analogy("man", "king", "woman", embeddings_dict)

In [None]:
# most similar
def find_most_similar(word, embeddings_dict, top_n=10):
    pass

In [None]:
most_similar = find_most_similar("mexico", embeddings_dict)

In [None]:
for i, w in enumerate(most_similar, 1):
    print(f"{i} ---> {w[0]}")