<a href="https://colab.research.google.com/github/sdinesh01/NLP-assignments/blob/main/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings (Word2Vec, Sent2Vec, and Doc2Vec)

## Description

Implement a semantic search engine using the word2vec algorithm. Use pre-trained word embeddings and build a search engine that can retrieve documents related to a given query based on semantic similarity.

### Objective

1. Familiarize yourself with the word2vec algorithm: Start by reading about the word2vec algorithm and its applications in NLP. 

2. Choose a pre-trained word embedding model: There are many pre-trained word embedding models available online, such as Google's Word2Vec, Stanford's GloVe, and Facebook's fastText. 

3. Preprocess the data: Choose a dataset of documents that you want to use for your search engine. 

4. Map the documents to vectors: Use the pre-trained word embedding model to map the words in each document to vectors. Average the vectors of the individual words in each document or using a more sophisticated technique such as doc2vec.

5. Implement the search engine: Given a query, map it to a vector using the same technique you used for the documents. Then, retrieve the documents that are most similar to the query vector based on cosine similarity or another distance metric.

6. Write a brief summary of your algorithm and document it's usage with some examples.

### Outcomes

1. Implement a semantic search engine using word embeddings.
2. Use pre-trained word embedding models.
3. Map documents to vectors using word embeddings.
4. Discover how cosine similarity can be used to cluster documents.


## Dataset

The dataset used in this assignment is the same as the one used in the EDA assignment. That is, the input for this assignment is the output you created in the EDA assignment. You can download the preprocessed dataset from the following link:

In [26]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import spacy

data_source = 'https://raw.githubusercontent.com/JamesMTucker/DATA_340_NLP/master/Notebooks/data/news-2023-02-01.csv'

articles = pd.read_csv(data_source)

## Load the spacy model: nlp
NLP = spacy.load('en_core_web_sm')

tqdm.pandas()

### Dataset description

In [27]:
articles.describe()

Unnamed: 0,source,title,text
count,11587,11586,11419
unique,20,716,1062
top,politicususa,Nicolle Wallace Devastates Trump And Shows Why...,Contact Us\nThis material may not be published...
freq,720,127,698


## Preprocessing

Clean, deduplicate, and tokenize the documents. You should be able to repurpose your code from the EDA assignment to do this.

In [28]:
# use pd.duplicated to identify duplicate rows on source, title, and text. returns true if the row index is a duplicate
# subset df where df.duplicated is not true, therefore false
df = articles[~articles.duplicated(['title', 'source','text'])]

# identify where any value in a row is na, and subset the database excluding nan
df = df[~df.isna().any(axis=1)]
df.head()

Unnamed: 0,source,title,text
0,politicususa,Prosecutors Pay Attention: Stormy Daniels Than...,Manhattan prosecutors are likely to notice tha...
1,politicususa,Investigators Push For Access To Trump Staff C...,Print\nInvestigators looking into Donald Trump...
2,politicususa,The End Is Near For George Santos As He Steps ...,The AP reported:\nRepublican Rep. George Santo...
3,politicususa,Rachel Maddow Cuts Trump To The Bone With Stor...,Rachel Maddow showed how Trump committed a cri...
4,vox,Alec Baldwin has been formally charged with in...,Candles are placed in front of a photo of cine...


In [29]:
# tokenize with spacy library

df['tokens'] = df['text'].progress_apply(lambda x: [x.lemma_.lower() for x in NLP(x) if x.lemma_.lower()])
df.head()

  0%|          | 0/1185 [00:00<?, ?it/s]

Unnamed: 0,source,title,text,tokens
0,politicususa,Prosecutors Pay Attention: Stormy Daniels Than...,Manhattan prosecutors are likely to notice tha...,"[manhattan, prosecutor, be, likely, to, notice..."
1,politicususa,Investigators Push For Access To Trump Staff C...,Print\nInvestigators looking into Donald Trump...,"[print, \n, investigator, look, into, donald, ..."
2,politicususa,The End Is Near For George Santos As He Steps ...,The AP reported:\nRepublican Rep. George Santo...,"[the, ap, report, :, \n, republican, rep., geo..."
3,politicususa,Rachel Maddow Cuts Trump To The Bone With Stor...,Rachel Maddow showed how Trump committed a cri...,"[rachel, maddow, show, how, trump, commit, a, ..."
4,vox,Alec Baldwin has been formally charged with in...,Candles are placed in front of a photo of cine...,"[candle, be, place, in, front, of, a, photo, o..."


In [30]:
# join words in "review" by space -- ONLY RUN THIS CELL ONCE
df['tokens'] = df['tokens'].apply(' '.join)

In [31]:
# clean tokens
import re
def clean_text(article):
  
    article = article.lower()  # Convert to lowercase
    article = re.sub(r"<[^>]*>", "", article)  # Remove HTML tags
    article = re.sub(r"[^a-z0-9]+", " ", article)  # Remove non-alphanumeric characters
    return article.strip()

df['tokens'] = df['tokens'].progress_apply(clean_text)

df

  0%|          | 0/1185 [00:00<?, ?it/s]

Unnamed: 0,source,title,text,tokens
0,politicususa,Prosecutors Pay Attention: Stormy Daniels Than...,Manhattan prosecutors are likely to notice tha...,manhattan prosecutor be likely to notice that ...
1,politicususa,Investigators Push For Access To Trump Staff C...,Print\nInvestigators looking into Donald Trump...,print investigator look into donald trump s al...
2,politicususa,The End Is Near For George Santos As He Steps ...,The AP reported:\nRepublican Rep. George Santo...,the ap report republican rep george santos of ...
3,politicususa,Rachel Maddow Cuts Trump To The Bone With Stor...,Rachel Maddow showed how Trump committed a cri...,rachel maddow show how trump commit a crime th...
4,vox,Alec Baldwin has been formally charged with in...,Candles are placed in front of a photo of cine...,candle be place in front of a photo of cinemat...
...,...,...,...,...
11500,thehill,"White House bids farewell to Klain, as Zients ...","White House bids farewell to Klain, as Zients ...",white house bid farewell to klain as zient off...
11543,thehill,Lawmakers clash over allowing guns in Natural ...,Lawmakers clash over allowing guns in Natural ...,lawmaker clash over allow gun in natural resou...
11559,westernjournal,Pizza Shop Employee Gets Rude Awakening After ...,Pizza Shop Employee Gets Rude Awakening After ...,pizza shop employee get rude awakening after t...
11560,westernjournal,White House Accused of 'Dishonesty and Evasive...,President Joe Biden boards Air Force One at th...,president joe biden board air force one at the...


## Word embeddings

Load the pre-trained word embedding model. You can use the code provided in the lecture notebooks to load the model. Vectorize the documents using the pre-trained word embedding model. You can do this by averaging the vectors of the individual words in each document or using a more sophisticated technique such as doc2vec (see SpaCy and Gensim packages).

In [32]:
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from collections import namedtuple

In [76]:
# create tagged documents for doc2vec
def create_tagged_document(list_of_lists):
  '''
  doc2vec training data should be a list of tagged documents. 
  params: 
    list_of_lists: pass a list of words as input to doc2vec.TaggedDocument()
  '''
  for i, list_of_words in enumerate(list_of_lists):
      yield TaggedDocument(list_of_words, [i])

docs = [i.split() for i in df['tokens']]
documents = list(create_tagged_document(docs))


In [77]:
# put original data in arrays for accessing 
data = np.array(df.text)
titles = np.array(df.title)

In [78]:
# create a list to hold all documents in their original order for indexing
TaggedDoc = namedtuple('TaggedDocument','words tags title index')
n = 0
alldocs = []
for i, line in enumerate(data): 
  if(type(line) == str):
    tokens = data[i].split()
    tags = [n]
    title = titles[i]
    alldocs.append(TaggedDoc(tokens, tags, title, i))
    n = n+1 

In [81]:
# example of accessing documents
index = 410
doc = alldocs[index]
print(doc, '\n')
print(data[doc.tags])

TaggedDocument(words=["Lowe's", 'Attempts', 'to', 'Thwart', 'Rampant', 'Theft', 'by', 'Developing', 'High-Tech', 'System', "That's", "'Invisible'", 'to', 'Customers', 'By', 'Jack', 'Davis', 'January', '31,', '2023', 'at', '5:11pm', 'MoreShare', 'Home', 'improvement', 'retailer', 'Lowe’s', 'is', 'rolling', 'out', 'a', 'new', 'concept', 'in', 'retail', 'theft', 'prevention', 'that', 'relies', 'on', 'technology', 'to', 'allow', 'shoppers', 'to', 'touch', 'and', 'not', 'just', 'look', 'when', 'they', 'want', 'to', 'buy', 'power', 'tools.', 'Project', 'Unlock', 'is', 'a', 'proof-of-concept', 'system', 'as', 'Lowe’s', 'looks', 'for', 'ways', 'to', 'stop', 'theft', 'without', 'locking', 'up', 'everything', 'before', 'it', 'walks', 'out', 'the', 'door,', 'Lowe’s', 'Chief', 'Digital', 'and', 'Information', 'Officer', 'Seemantini', 'Godbole', 'said,', 'according', 'to', 'Fox', 'Business', '.', 'The', 'process', 'is', 'essentially', '“invisible', 'for', 'the', 'customer.', 'They', 'should', 'not'

In [93]:
# instantiate model
model = Doc2Vec(vector_size=150, window=2, min_count=1, workers=4)
# build model vocabulary
model.build_vocab(documents)

In [94]:
# Vocabulary size
print('Vocabulary size:', len(model.wv.index_to_key))

Vocabulary size: 16453


In [95]:
# Check how many times a word is in the vocabulary
# e.g. : 
print(f"Word 'violence' appeared {model.wv.get_vecattr('violence', 'count')} times in the training corpus.")

Word 'violence' appeared 191 times in the training corpus.


In [96]:
# train the Doc2Vec model
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)
# save the model
model.save('model')
# instantiate trained model
model = Doc2Vec.load("model")

## Search engine

Write a search engine that can retrieve documents related to a given query based on semantic similarity. Given a query, map it to a vector using the same technique you used for the documents. Then, retrieve the documents that are most similar to the query vector based on cosine similarity or another distance metric.

In [103]:
def Search(query, num_results=5): 
  '''
  Query the articles by search term to find similar documents ranked by cosine similarity. 
  This function utilizes a doc2vec model built with the gensim library.
  ** Note: the Search function as currently written will yield the best results with single word searches **

  Parameters
    query: Search word (str)
    num_results: Number of results you want returned. Default value is 5. 
  '''
  query = clean_text(query)
  # instantiate model
  model = Doc2Vec.load("model")
  # Get the vector representation of the query
  query_vector = model.infer_vector(query.split() ,alpha=0.001)
  # Find similar documents
  tagsim = model.dv.most_similar([query_vector])
  # Print out titles
  titles = []
  for i,value in enumerate(tagsim):
    if i < num_results :  
      docsim = alldocs[tagsim[i][0]]
      if title not in titles: 
        print("Title : ", docsim.title)
        print("Similarity : ", tagsim[i][1])
        print("Document : ", data[docsim.index][:200], "\n")
        titles.append(docsim.title)
      else: 
        continue
  return 

In [104]:
# Example case: 1 
Search("politics")

Title :  The Joe and Hunter Biden Scandal Convergence | RealClearPolitics
Similarity :  0.43595951795578003
Document :  The Joe and Hunter Biden Scandal Convergence
Byron York , Washington Examiner January 31, 2023
 

Title :  Media Blows Biden Docs 'Scandal' Out of Proportion | RealClearPolitics
Similarity :  0.42821457982063293
Document :  Media Blows Biden Docs 'Scandal' Out of Proportion
Margaret Sullivan , The Guardian February 1, 2023
 

Title :  DOJ is at Biden's Delaware beach home. So why ignore the treasure trove of documents down the road? | Fox News
Similarity :  0.42474251985549927
Document :  Contact Us
This material may not be published, broadcast, rewritten,
      or redistributed. ©2023 FOX News Network, LLC. All rights reserved.
      Quotes displayed in real-time or delayed by at leas 

Title :  The largest House GOP caucus backs more than half a dozen ideas for fiscal reforms as the White House embarks on debt ceiling talks with Republicans.
Similarity :  0.42461496

In [105]:
# Example case: 2
Search("power")

Title :  Democrat Gov. Andy Beshear's tornado relief fund 'erroneously' sent unknown amounts of money to wrong people | Fox News
Similarity :  0.5419399738311768
Document :  Contact Us
This material may not be published, broadcast, rewritten,
      or redistributed. ©2023 FOX News Network, LLC. All rights reserved.
      Quotes displayed in real-time or delayed by at leas 

Title :  DOJ is at Biden's Delaware beach home. So why ignore the treasure trove of documents down the road? | Fox News
Similarity :  0.53704833984375
Document :  Contact Us
This material may not be published, broadcast, rewritten,
      or redistributed. ©2023 FOX News Network, LLC. All rights reserved.
      Quotes displayed in real-time or delayed by at leas 

Title :  Democrat warns Congress doesn't 'totally understand' how to handle classified documents: 'Tip of the iceberg' | Fox News
Similarity :  0.5335896611213684
Document :  Contact Us
This material may not be published, broadcast, rewritten,
      or redi

In [114]:
# Example case: 3
Search("democrats")

Title :  Republicans Move To Remove Ilhan Omar From House Foreign Affairs Committee | HuffPost Latest News
Similarity :  0.4471786320209503
Document :  Politics ilhan omar Foreign Affairs
Republicans Move To Remove Ilhan Omar From House Foreign Affairs Committee
Republicans say it would create "major problems" for Omar to serve on Foreign Affairs giv 

Title :  Republicans Rip Biden Court Pick For Bungling Questions On Constitution | HuffPost Latest News
Similarity :  0.42530784010887146
Document :  Politics Washington judicial nominees U.S. District Court
Republicans Rip Biden Court Pick For Bungling Questions On Constitution
It wasn’t a great moment for Charnelle Bjelkengren. It's also nothing  

Title :  Alaskans for Honest Elections Now Collecting Signatures to Rid State of Ranked-Choice Voting
Similarity :  0.41397905349731445
Document :  ShareShareShare Email
Democrat Rep. Mary Peltola “won” reelection in Alaska to a full term in the House in November after she defeated Sarah Pal

In [115]:
# Example case: 3
Search("RePuBlICans")

Title :  Republicans Rip Biden Court Pick For Bungling Questions On Constitution | HuffPost Latest News
Similarity :  0.4385392963886261
Document :  Politics Washington judicial nominees U.S. District Court
Republicans Rip Biden Court Pick For Bungling Questions On Constitution
It wasn’t a great moment for Charnelle Bjelkengren. It's also nothing  

Title :  Arizona's Secretary Of State Says Kari Lake's Tweet Broke State Law | HuffPost Latest News
Similarity :  0.4331142008304596
Document :  Politics 2022 elections Arizona Kari Lake
Arizona's Secretary Of State Says Kari Lake's Tweet Broke State Law
Adrian Fontes said the post shared by Lake, featuring a collage of 16 voter signatures, co 

Title :  George Santos OAN Interview Gets Awkward Fast | HuffPost Latest News
Similarity :  0.4267382025718689
Document :  Politics George Santos One America News Network
'You Seem Angry': George Santos OAN Interview Gets Awkward Fast
The mood soured when the New York lawmaker was asked about showin

### Summary

To implement this semantic search project, I decided to experiment with Gensim's `doc2vec` algorithm since I'd never used it before. If I were to continue this testing, I would create a model by averaging the vectors generated by `word2vec` and compare the results. 

The most challenging part of developing this workflow was figuring out how to link the `TaggedDocuments()` generated by `doc2vec` to the original data to generate and display similar documents. 

The `Search` function that I created preprocesses the query, infers the query vector, and returns the most similar vectors. Then, all the documents are retrieved by title, (cosine) similarity score, and the first 200 characters from the article. 

From running test examples above, I found that the search function worked best for words that had a higher frequency in the data. The rest of the similarity outputs seem more random. 