# **Word2Vec Model**

If you peeked at my 'News Headlines and NLP,' project, you might have noticed that I mentioned that the pre-trained models--NLTK's Vader Sentiment Analyzer and TextBlob--are not particularly good at understanding context. This is where **Word Embeddings** come in. 

One of the most famous word embedding models is the Word2Vec model, which was developed by Google. Word2Vec creates vector representations of words using a neural network. As opposed to One-Hot-Encoding representation of text--where each word is independent of another--Word2Vec representations are created by taking surrounding words into context. ***Distributed Representation***--meaning each word, along with its representation, is dependent on another word(s). We leverage the context as well, and ultimately end up with vector representation of words--where similar words are grouped togather.

**Scope:** *In this project, we create word2vec representations of words from two datasets--Fake News and Real News--and determine if the represenations differ from one-another. To test the difference, we examine the Top-5 similar words--defined by Cosine Similarities--to our list of query words.*

# **Import Libraries**

In [0]:
import pandas as pd
import numpy as np
from numpy import dot
from numpy.linalg import norm
import gensim as gn
from gensim.models import Word2Vec

# **Import Data**

In [0]:
data_dir = '/content/Fake news/fake_or_real_news.csv'
model_fname = 'word2vec_fake.model'
model_rname = 'word2vec_real.model'

In [0]:
data = pd.read_csv(data_dir)

In [0]:
data["text"] = data["title"].map(str) + data["text"]
data = data.loc[:,['text','label']]
data['label'] = data['label'].apply(lambda x: 1 if x=='FAKE' else 0)

In [8]:
data.head(10)

Unnamed: 0,text,label
0,"You Can Smell Hillary’s FearDaniel Greenfield,...",1
1,Watch The Exact Moment Paul Ryan Committed Pol...,1
2,Kerry to go to Paris in gesture of sympathyU.S...,0
3,Bernie supporters on Twitter erupt in anger ag...,1
4,The Battle of New York: Why This Primary Matte...,0
5,"Tehran, USA \nI’m not an immigrant, but my gr...",1
6,Girl Horrified At What She Watches Boyfriend D...,1
7,‘Britain’s Schindler’ Dies at 106A Czech stock...,0
8,Fact check: Trump and Clinton at the 'commande...,0
9,Iran reportedly makes new push for uranium con...,0


# **Split and Clean the Dataset**

Since both the ***Real*** and ***Fake*** news are in the same DataFrame, here we split the DataFrame into 2 separate frames because we will be preparing a word2vec representation of each separately. Also, whereas in the previous project we wrote a method to clean the text, the ***Gensim*** library offers us the '***simple_preprocess***' method that does this for us. We do need to create a method to remove our stop-words. The '***simple_preprocess***' method also splits our text into an array of arrays--the tokenized form we need to use the Gensim Word2Vec model.

In [0]:
def remove_stop_words(text):
  stop_words = ['mr', 'mrs', 'ms', 'his', 'her', 'he', 'she', 'himself', 'herself']
  clean = [word for word in text if (word not in stop_words) and (len(word) > 1)]
  return clean  

In [0]:
fake_data = data.loc[data['label']==1]
real_data = data.loc[data['label']==0]

In [0]:
fake_text = list(fake_data['text'])
real_text = list(real_data['text'])

In [0]:
fake_text_list = []
for article in fake_text:
  fake_text_list.append(remove_stop_words(list(gn.utils.simple_preprocess(article))))

In [0]:
real_text_list = []
for article in real_text:
  real_text_list.append(remove_stop_words(list(gn.utils.simple_preprocess(article))))

# **Building the Word2Vec Models**

In [0]:
word2vec_fake = gn.models.Word2Vec(fake_text_list, size = 75, window = 3, min_count = 5, iter = 15)

In [0]:
word2vec_real = gn.models.Word2Vec(real_text_list, size = 75, window = 3, min_count = 5, iter = 15)

In [34]:
print("The 5 Most similar Words to 'Hillary' Using the built-in Function (Fake News):\n")
word2vec_fake.wv.most_similar(positive='hillary', topn = 5)

The 5 Most similar Words to 'Hillary' Using the built-in Function (Fake News):



  if np.issubdtype(vec.dtype, np.int):


[('hilary', 0.6196736097335815),
 ('foundation', 0.603364109992981),
 ('trump', 0.5960244536399841),
 ('killary', 0.5486205816268921),
 ('bill', 0.5370841026306152)]

In [36]:
print("The 5 Most similar Words to 'Hillary' Using the built-in Function (Real News):\n")
(word2vec_real.wv.most_similar(positive='hillary', topn = 5))

The 5 Most similar Words to 'Hillary' Using the built-in Function (Real News):



  if np.issubdtype(vec.dtype, np.int):


[('sanders', 0.6123138070106506),
 ('trump', 0.5801287293434143),
 ('bill', 0.5583798885345459),
 ('husband', 0.4913654029369354),
 ('chelsea', 0.48292475938796997)]

# **Saving and Loading Model**
The following step isn't necessary, unless you want to save and load the models later.

In [37]:
word2vec_fake.save(model_fname)
word2vec_real.save(model_rname)
word2vec_fake = Word2Vec.load(model_fname)
word2vec_real = Word2Vec.load(model_rname)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


# **Loading The Query File**

The query words were stored in a separate text file, so they can be easily changed without editing the code.

In [0]:
query_dir = '/content/query.txt'
f = open(query_dir)
l = f.readline()
query_words = []
while(l != ""):
  l = l.split()
  query_words.append(l[0].lower())
  l = f.readline()
f.close()

# **Cosine Similarity**

***Cosine Similarity*** is a measure of similarity between two non-zero vectors. As mentioned earlier, in Word2Vec representations, similar words are grouped closer together. We can use cosine similarity to determine if two datasets are different from one-another--if the neighboring words in the vector space are different in the word2vec representations of the two models. 
*(i.e: if the neighboring words to a particular word--say 'immigration'--in the Fake News word2vec representation are different than the neighboring words in the Real News word2vec representation)*

In [0]:
def cosine_similarity(model, query, top_num):
  cosine_sim = {}
  vocab = list(model.wv.vocab)
  a = model[query]
  norma = norm(a)
  for v in vocab:
    if v != query:
      b = model[v]
      sim = np.dot(a,b)/(norma*norm(b))
      cosine_sim[v] = sim
  cosine_sim = sorted(cosine_sim.items(), key = lambda dist: dist[1], reverse = True)
  most_sim = []
  i = 0;
  for item in cosine_sim:
    most_sim.append((item[0], item[1]))
    i += 1
    if i == top_num:
      break
  return most_sim

## **Normalizing Vectors**
To use Cosine Similarity, we must normalize the vectors.

In [0]:
# Normalize Vectors
word2vec_fake.init_sims(replace = True)
word2vec_real.init_sims(replace = True)

In [31]:
print("The Query Words Are:\t{}".format(query_words))

The Query Words Are:	['hillary', 'trump', 'obama', 'immigration']


In [39]:
fo = open("Output.txt", "w")
print("Fake News Data -- Top 5 Similar Words to Query Words")
fo.write("Fake News Data -- Top 5 Similar Words to Query Words\n")
for word in query_words:
  sim_words = cosine_similarity(word2vec_fake, word, 5)
  print(word)
  fo.write("\t{}\n".format(word))
  for sw in sim_words:
    print("\t{}".format(sw))
    fo.write("\t\t{}\n".format(sw))
print("\nReal News Data -- Top 5 Similar Words to Query Words")
fo.write("\n\nReal News Data -- Top 5 Similar Words to Query Words\n")
for word in query_words:
  sim_words = cosine_similarity(word2vec_real, word, 5)
  print(word)
  fo.write("\t{}\n".format(word))
  for sw in sim_words:
    print("\t{}".format(sw))
    fo.write("\t\t{}\n".format(sw))
fo.close()

Fake News Data -- Top 5 Similar Words to Query Words


  after removing the cwd from sys.path.
  


hillary
	('hilary', 0.6196736)
	('foundation', 0.603364)
	('trump', 0.59602445)
	('killary', 0.5486205)
	('bill', 0.5370841)
trump
	('hillary', 0.59602445)
	('hrc', 0.5110749)
	('sanders', 0.50072205)
	('bernie', 0.4949449)
	('clinton', 0.49419704)
obama
	('bush', 0.5795401)
	('reagan', 0.5775068)
	('congress', 0.49343568)
	('carter', 0.48429)
	('saakashvili', 0.47779813)
immigration
	('international', 0.57597435)
	('discrimination', 0.53561467)
	('neoliberal', 0.5291665)
	('tax', 0.52200806)
	('domestic', 0.51336336)

Real News Data -- Top 5 Similar Words to Query Words
hillary
	('sanders', 0.61231387)
	('trump', 0.5801288)
	('bill', 0.55838)
	('husband', 0.4913655)
	('chelsea', 0.48292473)
trump
	('sanders', 0.6144249)
	('hillary', 0.5801288)
	('candidacy', 0.5760898)
	('romney', 0.5731019)
	('mogul', 0.5703247)
obama
	('netanyahu', 0.5898408)
	('congress', 0.50710166)
	('hollande', 0.5024687)
	('bush', 0.48952076)
	('sanders', 0.46739402)
immigration
	('abortion', 0.6159248)
	('enti

# **Conclusion and Parting Thoughts**

If you look at the top-5 most similar words to 'Hillary,' in the Fake News dataset representation, the 4th most similar word is 'Killary.' Suggesting that a bulk of the Fake News articles in our data-set was anti-Hillary rhetoric during the 2016 General Election season. For the other query words, the similarities across the datasets seem to be consistent. 

But note the magic of the word2vec model. In either dataset representations, the similar words to 'Obama' are either previous Presidents or other world leaders; the similar words to 'Trump' are other Presidential candidates; the similar words to 'Immigration' are other hot-button issues. This is because unlike the models from the previous project, the word2vec model learns from context and surrounding words. *One word is dependent on another word(s).*

In this project, we employed a labeled dataset but labeled datasets are often difficult to come across. But luckily, we can use the properties of word embeddings to our advantage. In word2vec models, semantically similar words are grouped closer to one-another in the vector space. We can use this property to cluster and classify observations in an unlabeled dataset. This is the scope of my next NLP project--which is actually a sub-project of larger project involving identifying market opportunities--unsupervised sentiment analysis.