<div style="display: block; width: 100%; height: 120px;">

<p style="float: left;">
    <span style="font-weight: bold; line-height: 24px; font-size: 16px;">
        DIGHUM160 - Critical Digital Humanities
        <br />
        Digital Hermeneutics 2020
    </span>
    <br >
    <span style="line-height: 22x; font-size: 14x; margin-top: 10px;">
        Week 4-3: Word Embeddings<br />
        Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk)<br />
    </span>
</p>

# Word Embeddings

Today, we'll have a look at word embeddings using Gensim's `word2vec` and `doc2vec` methods. 

The goal of word vector embedding models is to learn dense, numerical vector representations for each term in a corpus vocabulary. If successful, the vectors for each term encode information about the meaning or concept the term represents, as well as the relationship between it and other terms in the vocabulary. Word vector models are  fully unsupervised: they learn all of these meanings and relationships without any advance knowledge.

After working through today's notebook, you'll be able to:

1. Use Gensim's word2vec method to create word vectors for a corpus;
2. Use these word vectors to reflect on implicit binaries and normativities in your data;
3. Visualize topic models using K-means clustering.

**Note: I encourage you to use your own dataset from here on out to start looking at semantic patterns, regularities, ideologies, myths, and so on, that are relevant to your essay.**

In [None]:
# General
from pprint import pprint
from collections import Counter
import os
import re
import logging
import string
import pickle
import numpy as np
import pandas as pd
import smart_open
import multiprocessing 
from time import time  # To time our operations
from collections import defaultdict  # For word frequency

# Gensim
import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.models.phrases import Phrases, Phraser

# NLTK
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer 
wordnet_lemmatizer = WordNetLemmatizer()

# Spacy
import spacy 

# Plotting
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

# Clustering 
from sklearn.cluster import KMeans
from sklearn.neighbors import KDTree
from sklearn.manifold import TSNE

# Suppressing warnings
import warnings
warnings.simplefilter("ignore", DeprecationWarning)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

## Preprocessing

We'll start by cleaning up our data a bit. Let's load it up.

In [None]:
downloaded = drive.CreateFile({'id':"1nY9JtXoGJa7B-OmU6afh4qfcFGPCQIHW"})   
downloaded.GetContentFile('TRP-comments.csv')

In [None]:
# load into df
df_com = pd.read_csv("TRP-comments.csv", lineterminator='\n')

In [None]:
# Get rid of empty values and reset index
df_com = df_com[~df_com['body'].isin(['[removed]', '[deleted]' ])].dropna(subset=['body']).reset_index(drop=True)

Let's create a small function that cleans up our text by removing all escape-tabs and escape-newlines, as well as all non symbol characters (except for the dot). It also normalizes spaces to a single character and removes leading and trailing spaces.

In [None]:
def clean_text(text):
  # Normalize tabs and remove newlines
  no_tabs = text.replace('\t', ' ').replace('\n', '');
  # Remove all characters except A-Z and a dot.
  alphas_only = re.sub("[^a-zA-Z\.]", " ", no_tabs);
  # Normalize spaces to 1
  multi_spaces = re.sub(" +", " ", alphas_only);
  # Strip trailing and leading spaces
  no_spaces = multi_spaces.strip();
  return no_spaces

In [None]:
df_com

Unnamed: 0.1,Unnamed: 0,idint,idstr,created,author,parent,submission,body,score,subreddit,distinguish,textlen,body_clean
0,1650944,28219675724,t1_cyp9l58,1452171945,Stories_of_Red,t3_3zv11k,t3_3zv11k,"If you marry this woman, do not ever blame her...",1897,TheRedPill,,191,If you marry this woman do not ever blame her ...
1,2616308,30034466365,t1_dspqukd,1516029389,KirthWGersen,t3_7qk2y3,t3_7qk2y3,The scary thing about the Ansari case is that ...,1776,TheRedPill,,1297,The scary thing about the Ansari case is that ...
2,2457822,29621801610,t1_dlw20ei,1503252966,Thotwrecker,t3_6uw3cv,t3_6uw3cv,This subject comes up in many forms and I can ...,1474,TheRedPill,,5553,This subject comes up in many forms and I can ...
3,2630144,30079740149,t1_dtgp81h,1517320675,2comment,t3_7u0msk,t3_7u0msk,">I don't understand her thought process, it's ...",1366,TheRedPill,,616,I don t understand her thought process it s li...
4,2572373,29920059025,t1_dqtmpap,1512512112,bickisnotmyname,t3_7ht6tk,t3_7ht6tk,Haha. Wow lol 😝. I can’t believe what an idiot...,1284,TheRedPill,,262,Haha. Wow lol . I can t believe what an idiot ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,2151848,28973419429,t1_db60xid,1481679010,blacwidonsfw,t1_db5jjjy,t3_5i50pn,Lol if you get rejected at that point then you...,107,TheRedPill,,417,Lol if you get rejected at that point then you...
9996,859738,27425728147,t1_clkkj9v,1414362203,Position5hero,t3_2kenbj,t3_2kenbj,Shoutout to the old granny in the comments sec...,107,TheRedPill,,634,Shoutout to the old granny in the comments sec...
9997,1560210,28097868149,t1_cwoqtth,1446678660,TRPhd,t1_cwoqjmo,t3_3rjch9,"Never, ever, under any circumstances, intentio...",107,TheRedPill,,510,Never ever under any circumstances intentional...
9998,2208740,29076667233,t1_dcvhw0x,1485319263,dammit_redskins,t3_5q09dd,t3_5q09dd,I swear to christ if the progression of humani...,107,TheRedPill,,100,I swear to christ if the progression of humani...


We can now use the Pandas `.apply` method, allowing us to apply a function along an axis of the DataFrame. We'll use a lambda function: a small stand-in function that can take arguments, but only one expression. This is what a lambda looks like. Do you see how it works?

In [None]:
df_com["body_clean"] = df_com["body"].apply(lambda x: clean_text(x))

We now have an additional column in our DataFrame with cleaned up text.

In [None]:
df_com.head()

Unnamed: 0.1,Unnamed: 0,idint,idstr,created,author,parent,submission,body,score,subreddit,distinguish,textlen,body_clean
0,1650944,28219675724,t1_cyp9l58,1452171945,Stories_of_Red,t3_3zv11k,t3_3zv11k,"If you marry this woman, do not ever blame her...",1897,TheRedPill,,191,If you marry this woman do not ever blame her ...
1,2616308,30034466365,t1_dspqukd,1516029389,KirthWGersen,t3_7qk2y3,t3_7qk2y3,The scary thing about the Ansari case is that ...,1776,TheRedPill,,1297,The scary thing about the Ansari case is that ...
2,2457822,29621801610,t1_dlw20ei,1503252966,Thotwrecker,t3_6uw3cv,t3_6uw3cv,This subject comes up in many forms and I can ...,1474,TheRedPill,,5553,This subject comes up in many forms and I can ...
3,2630144,30079740149,t1_dtgp81h,1517320675,2comment,t3_7u0msk,t3_7u0msk,">I don't understand her thought process, it's ...",1366,TheRedPill,,616,I don t understand her thought process it s li...
4,2572373,29920059025,t1_dqtmpap,1512512112,bickisnotmyname,t3_7ht6tk,t3_7ht6tk,Haha. Wow lol 😝. I can’t believe what an idiot...,1284,TheRedPill,,262,Haha. Wow lol . I can t believe what an idiot ...


Let's turn it into a list.

In [None]:
text_li = df_com['body_clean'].tolist()

Next, we'll create a function that uses NLTK's `sent_tokenize()` method. This tokenizer splits our texts into sentences, which in turn are split into tokens. We'll also remove stopwords.

In [None]:
def sentence_tokenize(text):
    sentence_doc = sent_tokenize(text)
    sentences = [gensim.utils.simple_preprocess(str(doc), deacc=True) for doc in sentence_doc]  # deacc=True removes punctuations
    stop = set(stopwords.words('english') + ['’', '“', '”', 'nbsp', 'http'])
    no_stop = [[word for word in sentence if word not in stop] for sentence in sentences]
    return no_stop

In [None]:
com_sent_li = [sentence_tokenize(text) for text in text_li]

Note that we now have a list (of comments) of lists (sentences) of lists (tokens). Let's index the first token of the first sentence of the first comment:

In [None]:
com_sent_li[0][0][0]

'marry'

We actually don't need the comment-level demarcation for the rest of our analysis. We can *flatten* our `com_sent_li` object to do so – this way, we create a list (of sentences) of lists (tokens).

In [None]:
sent_li = []
for sentence in com_sent_li:
    for tokens in sentence:
        sent_li.append(tokens)

Writing the same in a list comprehension looks like this, by the way:

In [None]:
sent_li = [tokens for sentence in com_sent_li for tokens in sentence]

Next, let's create a trigrams model using Gensim's `Phrases` and `Phraser` classes:

In [None]:
bigram = Phrases(sent_li, min_count=5, threshold=80)
trigram = Phrases(bigram[sent_li], threshold=80)  
bigram_mod = Phraser(bigram)
trigram_mod = Phraser(trigram)



And let's run that model over our list of lists.

In [None]:
trigrams = [trigram_mod[bigram_mod[sentence]] for sentence in sent_li]

## Word2Vec

Let's create our word embeddings model. Its input is a text corpus (split up in sentences) and its output is a set of "vectors" in N dimensions. It allows us to group the vectors of similar words together in vectorspace. We can then reduce the dimensionality to visualize the results in a way humans can understand (such as in a 2-dimensional space), or to perform linear algebra in order to find how words are related.

Word2vec is one example of a word embeddings model. It learns by taking words and their contexts (e.g. sentences) into account, and can then try to predict other words. Given enough data, usage and contexts, word2vec can make accurate guesses about a word’s meaning based on its appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic.

### How many cores?
Word2Vec can work using independent threads doing simultaneous training. In general, you'll never want to use more workers than the number of CPU cores you have in your machine. So let's check out how many you have.



In [None]:
cores = multiprocessing.cpu_count() # Count the number of cores in your computer
cores

2

We now instantiate and train our Word2Vec model, using the parameters below.

In [None]:
num_features = 300        # Word vector dimensionality (how many features each word will be given)
min_word_count = 2        # Minimum word count to be taken into account
num_workers = cores       # Number of threads to run in parallel (equal to your amount of cores)
context = 10              # Context window size
downsampling = 1e-2       # Downsample setting for frequent words
seed_n = 1                # Seed for the random number generator (to create reproducible results) 
sg_n = 1                  # Skip-gram = 1, CBOW = 0

model = Word2Vec(trigrams, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling, seed=seed_n, sg=sg_n)

That was it! We have a Word Embeddings model now.

How many terms are in our vocabulary?

In [None]:
print('{:,} terms in the vocabulary.'.format(len(model.wv.vocab)))

16,814 terms in the vocabulary.


### Getting related terms

With the information in our word embeddings model, we can try to find similarities between words that interest us (i.e. words that have a similar vector). Let's create a function that retrieves related terms to some input.

In [None]:
def get_related_terms(token, topn=20):
    """
    look up the topn most similar terms to token and print them as a formatted list
    """

    for word, similarity in model.most_similar(positive=[token], topn=topn):
        print(word, round(similarity, 3))

In [None]:
get_related_terms(u'man')

woman 0.669
expects 0.641
queen 0.632
worthy 0.628
submit 0.626
deserves 0.625
loves 0.62
strong_independent 0.616
desired 0.615
fails 0.611
youth 0.607
loyal 0.606
passionate 0.604
charming 0.603
dude 0.603
chooses 0.602
single_mother 0.602
lonely 0.601
learns 0.601
guy 0.6


  if np.issubdtype(vec.dtype, np.int):


### Word algebra

Word algebra, also known as analogy completion, means doing math with words (like the famous example "king - man + woman = queen". The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure works as follows:

1. Provide a set of words or phrases you want to add or subtract.
2. Look up the vectors that represent those terms in the word vector model.
3. Add and subtract those vectors to produce a new, combined vector.
4. Look up the most similar vector(s) to this new, combined vector via cosine similarity.
5. Return the word(s) associated with the similar vector(s).

Let's try it out. We'll create a function that does this for us.

In [None]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = model.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print(term)

In [None]:
word_algebra(add=['game','dating'])

ltrs


  if np.issubdtype(vec.dtype, np.int):


In [None]:
word_algebra(add=['game', 'dating'], subtract=['beta'])

training


  if np.issubdtype(vec.dtype, np.int):


## K-means clustering (advanced)
One convenience of word embeddings is that we can cluster them using, for instance, K-Means clustering. Don't worry if you don't understand all of the following, just check out how it works!

K-Means clustering aims to partition N observations into K clusters in which each observation belongs to the cluster with the nearest mean (called the "cluster centre"), which serves as a prototype of the cluster.

Since our words are all represented as vectors, applying K-Means is easy to do since the clustering algorithm will simply look at differences between vectors (and centers).

In [None]:
def clustering_on_wordvecs(word_vectors, num_clusters):
    # Initalize a k-means object and use it to extract centroids
    kmeans_clustering = KMeans(n_clusters = num_clusters, init='k-means++');
    idx = kmeans_clustering.fit_predict(word_vectors);
    return kmeans_clustering.cluster_centers_, idx;

In [None]:
Z = model.wv.syn0 # The syn0 array essentially holds raw word-vectors

In [None]:
centers, clusters = clustering_on_wordvecs(Z, 10);
centroid_map = dict(zip(model.wv.index2word, clusters));

Next, we get words in each cluster that are closest to the cluster center. To do this, we initialize a KDTree on the word vectors, and query it for the Top K words on each cluster center. Using the Index 2 word dictionary, we than correspond each word vector back to it’s original word representation and add them to a dataframe for easier printing.

In [None]:
def get_top_words(index2word, k, centers, wordvecs):
    tree = KDTree(wordvecs);
    # Use closest points for each cluster center to query closest 20 points to it
    closest_points = [tree.query(np.reshape(x, (1, -1)), k=k) for x in centers];
    closest_words_idxs = [x[1] for x in closest_points];
    # Query Word Index  for each position in the above array, and added to a Dictionary
    closest_words = {};
    for i in range(0, len(closest_words_idxs)):
        closest_words['Cluster #' + str(i)] = [index2word[j] for j in closest_words_idxs[i][0]]
    # Create DataFrame from dictionary
    df = pd.DataFrame(closest_words);
    df.index = df.index+1
    return df

Let’s get the top words and print the first 20 in each cluster:

In [None]:
top_words = get_top_words(model.wv.index2word, 5000, centers, Z);

In [None]:
top_words[:10]

Unnamed: 0,Cluster #0,Cluster #1,Cluster #2,Cluster #3,Cluster #4,Cluster #5,Cluster #6,Cluster #7,Cluster #8,Cluster #9
1,fortress,alpha_widow,wider,prospects,sexual_marketplace,insightful,climbs,cooked,sandals,punching_bag
2,overreaction,picky,ai,steal,elevate,commenters,puke,cocks,rats,cautious
3,butthole,heh,sciences,worthwhile,involves,depths,tracks,min,predicted,leveraged
4,kbbg,emotional_tampon,transferring,scarcity,pragmatic,tho,wandering,wakes,vulgar,failings
5,stayin,mess,structures,hassle,nurture,agreeing,handy,midnight,affectionate,inconsistent
6,outstanding,cuz,depressants,chores,principle,helpful,abstained,truck,slowing,cake_eat
7,fetuses,enjoys,sci,declining,undermine,misunderstood,shoved,waited,pains,crudely
8,nucleus,shrug,jurisprudence,lifelong,conflict,analogy,crosshairs,boot,duality,situational
9,swimmers,fooled,dogma,companionship,unwilling,askwomen,brad,crawling,nutjobs,victimization
10,dissolve,scumbag,canadian,prioritize,feminine_imperative,mentioning,borrow,lb,researched,condemns
