# Module 1 Project 2: Word2Vec and BERT

Implement Word2Vec and BERT and play around with them

Note: needs gensim installed for word2vec to work properly

## STEP 1: IMPORTS
- Import the necessary libraries to get Word2Vec to work properly
- Download the 'punkt' package from nltk to get the word tokenization model

In [None]:
from gensim.models import Word2Vec, KeyedVectors
import pandas as pd
import nltk
import os

nltk.download('punkt')

## STEP 2: LOAD THE DATA FILE
- Load the file containing text you want to embed
- Norm MacDonald's Wikipedia page is included in `data.txt`` as an example here

In [None]:
lines = []

with open("data.txt", "r") as file:
    for line in file.readlines():
        lines.append(line.strip()) # Strip whitespace out

lines = list(set(lines))
print(lines[:10])

## STEP 3: TOKENIZE AND INITIALIZE WORD2VEC MODEL
- Tokenize the text into a vector format using `nltk.word_tokenize()`
- Set up the Word2Vec model to be able to embed the vectorized text

In [None]:
vector = [nltk.word_tokenize(line) for line in lines]

print(vector)

# Chose vector size as 32 arbitrarily here as an example
model = Word2Vec(vector, min_count=1, vector_size=32) 

## STEP 4: FIND SIMILARITIES AND DO VECTOR MATH
- Find the most similar embeddings to a given word from the corpus
- Using the common examples of vector math, add and subtract vectors to see the resulting word similarities.

In [None]:
# Word similarity
print(model.wv.most_similar("Norm"))

# Vector math
vec = model.wv['Norm'] + model.wv['Macdonald'] - model.wv['Donald']
print(model.wv.most_similar([vec]))