# NLP - Word2vec Demo

# 1. Train Your Own Word2Vec Model

## Different ways of importing text data

1. Web scrapping
2. Pdf/Word Files

The `requests.get(url)` function is used to fetch the HTML content of the **url** specified. The HTML content is then passed to the **BeautifulSoup constructor**, which returns a **BeautifulSoup object** that can be used to parse the HTML.

`soup.find_all("p")[:3]` expression is used to find all the **<p>** elements in the HTML and select only the first three. The resulting list of elements is stored in the **paragraphs** variable which is a list of the 3 paragraphs

In [2]:
#!pip install PyPDF2
#!pip install nltk

In [120]:
import requests
from bs4 import BeautifulSoup

url = "https://www.imdb.com/title/tt0848228/plotsummary/"

# this helps you to go to the website to fetch the content
response = requests.get(url)

# BeautifulSoup constructor takes the text as input and returns a BeautifulSoup object.
# BeautifulSoup object makes it easier to parse and extract information from the HTML content.
soup = BeautifulSoup(response.text, "html.parser")

# Find the first paragraph element in the HTML
paragraphs = soup.find_all("p")[:10]

# Extract the text from the selected paragraphs and concatenate them
combined_paragraph = ' '.join([p.text for p in paragraphs])

print(combined_paragraph)





In [123]:
#Import the library
import PyPDF2


In [124]:
# Open the PDF file
with open("Avengers.pdf", "rb") as file:
    # Create a PDF object
    pdf = PyPDF2.PdfReader(file)
    
    # Initialize a variable to store the extracted text
    corpus = ""
    
    # Extract the text from each page of the PDF. We have only one page.
    for page in pdf.pages:
        corpus += page.extract_text()
        
    # Print the extracted text
    print(corpus)


Nick Fury ( Samuel  L. Jackson ), director of S.H.I.E.L.D. (Strategic Homeland 
Intervention, Enforcement, and Logistics Division), arrives at S.H.I.E.L.D. 
headquarters outside of Santa Fe, New Mexico, during an evacuation. The 
Tesseract, an energy sour ce of unknown potential, has activated. It opens a 
portal through space and the exiled Norse god Loki ( Tom  Hiddleston ) steps 
through, carrying a strange spear w ith a blue glowing tip. Loki takes the 
Tesseract and uses the spear to take control of the minds of several SHIELD 
personnel, including Dr. Erik Selvig ( Stellan  Skarsgård ) , and Agent Clint 
"Hawkeye" Barton ( Jeremy  Renner ), to aid him in his getaway. SHIELD 
personnel pull out of their base when an energy surge from  the Tesseract 
causes the ground beneath the base to collapse and destroying it. A short 
pursuit of Loki fails to capture him.  
 
In response to the attack, Nick Fury issues a state of emergency, telling his top 
agents Phil Coulson ( Clark  Gregg 

In [125]:
# Perform text preprocessing on the words

import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')

nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
wordnet_lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/swapnil/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/swapnil/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/swapnil/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [50]:
# create a function to apply pre-processing

def preprocess_text(text):
    
    # Convert the text to lowercase
    text = text.lower()
    
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    
    # Remove stop words and punctuation from each sentence
    preprocessed_sentences = []
    
    import string
    
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        words = [token for token in words if token not in string.punctuation]
        words = [token for token in words if token not in stop_words]
        words = [wordnet_lemmatizer.lemmatize(word) for word in words]
        #preprocessed_sentence = ' '.join(words)
        preprocessed_sentences.append(words)
    
    return preprocessed_sentences

In [126]:
# lets pre-process and save the the text in a variable
text = preprocess_text(corpus)

In [127]:
#view the text. You need the sentences as a list within a list. This is the format that works in word2vec
text

[['nick', 'fury', 'samuel', 'l.', 'jackson', 'director', 's.h.i.e.l.d'],
 ['strategic',
  'homeland',
  'intervention',
  'enforcement',
  'logistics',
  'division',
  'arrives',
  's.h.i.e.l.d'],
 ['headquarters', 'outside', 'santa', 'fe', 'new', 'mexico', 'evacuation'],
 ['tesseract', 'energy', 'sour', 'ce', 'unknown', 'potential', 'activated'],
 ['open',
  'portal',
  'space',
  'exiled',
  'norse',
  'god',
  'loki',
  'tom',
  'hiddleston',
  'step',
  'carrying',
  'strange',
  'spear',
  'w',
  'ith',
  'blue',
  'glowing',
  'tip'],
 ['loki',
  'take',
  'tesseract',
  'us',
  'spear',
  'take',
  'control',
  'mind',
  'several',
  'shield',
  'personnel',
  'including',
  'dr.',
  'erik',
  'selvig',
  'stellan',
  'skarsgård',
  'agent',
  'clint',
  "''",
  'hawkeye',
  "''",
  'barton',
  'jeremy',
  'renner',
  'aid',
  'getaway'],
 ['shield',
  'personnel',
  'pull',
  'base',
  'energy',
  'surge',
  'tesseract',
  'cause',
  'ground',
  'beneath',
  'base',
  'collapse

In [128]:
# we have 51 sentences in the text
len(text)

123

In [32]:
#intall gensim
#!pip install gensim

In [133]:
from gensim.models import Word2Vec
model = Word2Vec(window=10, min_count=1, vector_size=150)

The choice of **`vector_size`** parameter determines the dimensionality of the word vectors produced by the model. If you set `vector_size=3`, the word vectors produced by the model will have a dimensionality of 3. If you set `vector_size=10`, the word vectors produced by the model will have a dimensionality of 10. The choice depends on the specific application and the size of the dataset. A higher vector_size generally means that the model will be able to capture more fine-grained distinctions between words, but it will also require more computational resources and more data to train. A lower vector_size may result in less accurate word vectors, but it may also require less computational resources and less data to train.

The **`window`** parameter specifies the maximum distance between the target word and its context words (i.e., words that appear in the same context or window as the target word) within a sentence.

For example, if `window=5`, then the Word2Vec model will consider the 5 words before and 5 words after the target word as its context words. In other words, the context window will have a total size of 11 words (5 words before the target word, the target word itself, and 5 words after the target word).

The window parameter is important because it affects the way the Word2Vec model learns word embeddings (i.e., vector representations of words). A smaller window size may capture more syntactic relationships between words (e.g., plural/singular forms, verb tenses), while a larger window size may capture more semantic relationships between words (e.g., related concepts, topic associations).

In practice, the optimal value for window depends on the specific task and the nature of the text data being analyzed. It may be necessary to experiment with different values of window to find the optimal setting for a given task or dataset.

**`min_count`** specifies the minimum nunmber of words the sentence should have or else it gets ignored with total frequency lower than this number.

In [134]:
# build vocabulary
model.build_vocab(text)

**`build_vocab()`** method is used to construct the vocabulary of the model, which is a collection of unique words that the model will use to learn word embeddings. It takes as input a list of sentences (or any iterable of words) and constructs a vocabulary by assigning a unique integer ID to each unique word in the input data. It also counts the frequency of each word in the data, which is used later in the training process to determine the importance of each word in the context of its surrounding words.`


In [135]:
# to check unique words in the vocabulary 
model.wv.get_normed_vectors().shape

(747, 150)

We have 418 unique words in the vocab and each word is represented by a vector of size 150.

In [136]:
#train the model
model.train(text, total_examples=len(text), epochs=5)

(5825, 6975)

Now, we have a word2vec model ready.

In [143]:
#let's find similar words
model.wv.most_similar('avenger')

[('placed', 0.24626438319683075),
 ('spear', 0.241495281457901),
 ('know', 0.21610021591186523),
 ('fury', 0.2143598347902298),
 ('turning', 0.21424345672130585),
 ('plan', 0.20982405543327332),
 ('ro', 0.20566891133785248),
 ('oppose', 0.20075683295726776),
 ('third', 0.20014487206935883),
 ('saying', 0.19851651787757874)]

Our dataset is small, so the results may not be accurate, however, when we look at the nearest words connected to the word avenger, we get the list above.

In [144]:
#lets check the vector representation of the word 'avenger'
model.wv['avenger']

array([ 6.46621687e-03, -6.77131815e-03, -4.52126516e-03,  1.88308570e-03,
        4.34773136e-03, -3.80339567e-03,  1.88142364e-03,  6.71937224e-03,
       -4.54686815e-03, -3.86765786e-03, -3.21385590e-03, -2.55719828e-03,
        8.60671862e-04,  6.63531851e-03,  3.93053191e-03,  2.44986819e-04,
        1.87286851e-03,  5.50903007e-03,  6.14729803e-03,  3.96000873e-03,
        3.67340306e-03, -5.07923076e-03, -2.13990291e-03, -3.54535645e-03,
        4.31587594e-03, -1.54591631e-03, -6.35393104e-03,  4.84336400e-03,
        5.53453062e-03, -2.58656358e-03,  6.01981068e-03, -2.93887395e-04,
       -2.52163340e-03, -5.33496612e-04,  1.40032425e-05, -2.14555487e-03,
        2.20195088e-03,  3.92756285e-03,  4.69684787e-03, -6.27568876e-03,
       -1.54334877e-03, -3.65370675e-03,  4.75600315e-03,  4.03339276e-03,
       -2.61034584e-03,  1.57559896e-03, -3.88459931e-03, -7.10575405e-05,
        6.37413468e-03, -1.62970822e-03, -3.60586052e-03, -4.88168513e-03,
       -2.08787806e-03, -

In [145]:
# since we set the vector-siz=150
model.wv['avenger'].shape

(150,)

## Lets create a plot to visualize the related words

But first, we need to perform **dimentionality reduction** on the text vector. Right now we have the words represented as vectors of size 150. It's is a 150 dimentional space, that we cannot visualize. Therefore, we will reduce it to 3 dimensions using **Principal Component Analysis (PCA)**

In [146]:
#import PCA
from sklearn.decomposition import PCA

In [147]:
#set to 3 dimension
size = PCA(n_components=3)

In [148]:
# tain and transform
new = size.fit_transform(model.wv.get_normed_vectors())

In [149]:
# we went from this
model.wv.get_normed_vectors()[:1]

array([[-0.00957275, -0.00236291,  0.06708392,  0.12601295, -0.1281499 ,
        -0.10457395,  0.08964259,  0.14604385, -0.06809499, -0.04481853,
         0.10817556, -0.02562193, -0.0755891 ,  0.09680159, -0.07995019,
        -0.02739835,  0.04630097,  0.01124558, -0.11380569, -0.1259331 ,
         0.09596393,  0.07028997,  0.10540289,  0.01858084,  0.09482995,
        -0.04774164, -0.02828027,  0.07270938, -0.10970478, -0.06506913,
        -0.10696231, -0.0079626 ,  0.1287963 , -0.11016849, -0.03704391,
        -0.01935   ,  0.12584853, -0.07965282,  0.00679659, -0.07715843,
        -0.13847142,  0.07160766, -0.13037744, -0.07104725,  0.00722145,
        -0.00129208, -0.10598   ,  0.12569694,  0.07428049,  0.13197109,
        -0.11851989,  0.06538057, -0.06212136,  0.01621149,  0.11672914,
        -0.05413518,  0.06501617, -0.09173374, -0.04894328,  0.1294983 ,
        -0.0236433 , -0.00800834, -0.05939234, -0.11032293, -0.02393637,
         0.02774317, -0.01815248,  0.05856332, -0.0

In [150]:
# to this
new[:1]

array([[-0.01623621, -0.05289235, -0.24216872]], dtype=float32)

In [151]:
#since we don't know which vector represents which word, lets find out
word=model.wv.index_to_key

In [152]:
# let's find what word the 1st vector represents
word[:1]

['loki']

## Use Matplotlib to Plot a 3d Plot of the Vectors

**`Plotly Express`** is a Python data visualization library based on `Plotly.js`, which allows you to create a wide range of interactive visualizations in a few lines of code. It is built on top of the popular data analysis library Pandas, and provides a high-level interface for creating expressive and interactive plots.

In [153]:
#import express
import pandas as pd
import numpy as np
from plotly import express as ex

In [154]:
plt=ex.scatter_3d(new[:100], x=0, y=1,z=2, color=word[:100])
plt.show()

# 2. Use Google's Pretrained Word2Vec Model

**`word2vec-google-news-300`** is a pre-trained Word2Vec model that was trained on a massive dataset of Google News articles. The model contains 300-dimensional word vectors for over 3 million words and phrases.

The training data for this model consists of about 100 billion words from news articles published between 2006 and 2013. The Word2Vec algorithm was used to learn word embeddings, which are dense vector representations of words that capture their semantic meaning.

In [None]:
import gensim.downloader as api

# Load the Google News pre-trained Word2Vec model
model = api.load('word2vec-google-news-300')

# Get the word vector for a given word
vector = model['apple']
print(vector)


In this case, each word is a 300 dimentional vector.

### Note: This is a 1.5Gb Model. 