<div style="display: block; width: 100%; height: 120px;">

<p style="float: left;">
    <span style="font-weight: bold; line-height: 24px; font-size: 16px;">
        DIGHUM160 - Critical Digital Humanities
        <br />
        Digital Hermeneutics 2019
    </span>
    <br >
    <span style="line-height: 22x; font-size: 14x; margin-top: 10px;">
        Week 4-4: WORD EMBEDDINGS<br />
        Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk)<br />
        Partly taken from David Bradway's excellent <a href="https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb">"Modern NLP in Python"</a>.<br />
    </span>
</p>

<img style="width: 240px; height: 120px; float: right; margin: 0 0 0 0;" src="http://www.merritt.edu/wp/histotech/wp-content/uploads/sites/275/2018/08/berkeley-logo.jpg" />
</div>

# Word Embeddings

Today, we'll have a look at word embeddings using Gensim's `word2vec` method. 

The goal of word vector embedding models is to learn dense, numerical vector representations for each term in a corpus vocabulary. If successful, the vectors for each term encode information about the meaning or concept the term represents, as well as the relationship between it and other terms in the vocabulary. Word vector models are  fully unsupervised: they learn all of these meanings and relationships without any advance knowledge.

After working through today's notebook, you'll be able to:

1. Use Gensim's word2vec method to create word vectors for a corpus;
2. Use these word vectors to reflect on implicit binaries and normativities in your data;
3. Visualize topic models using K-means clustering and t-SNE.

In [1]:
# General
from pprint import pprint
from collections import Counter
import os
import re
import logging
import string
import pickle
import numpy as np
import pandas as pd

# Gensim
import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# NLTK
from nltk.corpus import stopwords
from nltk import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer 
wordnet_lemmatizer = WordNetLemmatizer()

# Plotting
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

# Clustering 
from sklearn.cluster import KMeans
from sklearn.neighbors import KDTree
from sklearn.manifold import TSNE

# Suppressing warnings
import warnings
warnings.simplefilter("ignore", DeprecationWarning)

## Preprocessing

We'll start by cleaning up our data a bit. Let's load it up. (Make sure it's in the same folder as this notebook. You can find it in the Class Dropbox.)

In [30]:
# load into df
dfSub = pd.read_csv("/Users/tomvannuenen/Jupyter notebooks/My notebooks/TRP-submissions-full.csv", lineterminator='\n')


How many entries do we have?

In [4]:
len(df)

104855

Let's get rid of the empty values (i.e. the submissions that have been deleted) to save us some trouble.

In [5]:
dfSubClean = dfSub[~df['selftext'].isin(['[removed]', '[deleted]' ])].dropna(subset=['selftext'])
print(len(dfSubClean))

43788

Let's turn that into a list, just for clarity's sake.

In [7]:
data = dfSubClean.selftext.tolist()

Let's have a look at what our texts actually look like.

In [8]:
data[1][:1000]

"I have yet to see a serious discussion on sexual strategy that goes beyond the pickup. And although I find men's rights to be eye-opening on a moral and legal ground, I have yet to see serious discussion on how to operate in light of the red-pill information they discuss.\n\nThe questions that need to be asked, (and answered) are along these lines:\n\nShould you focus on relationships and/or sex? Is there benefit to exclusivity? How do you keep a marriage together in light of hypergamy? Is there love?"

We've got lots of newline characters here, as well as some single quotes. Let's get rid of them using regex.
If you want to learn more about regex (you should!), see https://www.regular-expressions.info/tutorial.html

In [9]:
# Remove newline characters
data = [re.sub('\s+', ' ', txt) for txt in data]

# Remove distracting single quotes
data = [re.sub("\'", "", txt) for txt in data]

# remove tags
data = [re.sub("&lt;/?.*?&gt;"," &lt;&gt; ", txt) for txt in data]

Let's also prepare a stopwords list, which we can use in our tokenizer function further on.

In [10]:
# prepare stopwords
stop = set(stopwords.words('english') + ['’', '“', '”', 'nbsp', 'http'])

Now, let's split our data by sentences, as word2vec expects this input (though as long as you have a lot of data, you can use documents as input as well). We'll also get rid of the stopwords.

In [11]:
def sentenceTokenizer(text):
    """Returns a list of lists with tokenized sentences"""
    sentenceDoc = sent_tokenize(text)
    sentences = [gensim.utils.simple_preprocess(str(doc), deacc=True) for doc in sentenceDoc]  # deacc=True removes punctuations
    noStop = [[word for word in sentence if word not in stop] for sentence in sentences]
    return noStop

sentenceDocs = [sentenceTokenizer(text) for text in data]

We've got a list (docs) of lists (sentences) of lists (words) now. But we need a list of lists. So let's flatten it (note that this means we are not passing in any information of "documents" into our word embeddings model, just sentences consisting of tokens).

In [14]:
sentences = [item for sublist in sentenceDocs for item in sublist]

How many total sentences do we have?

In [15]:
len(sentences)

1294276

And what does it look like?

In [16]:
sentences[0][:10]

['im', 'going', 'discuss', 'briefly', 'intention', 'subreddit']

Looking good.

### Pickling

We can save our docs into an object, in case we want to use it later. This object is called a "pickle"; "pickling" is a way to convert a python object (list, dict, etc.) into a "character stream" that can be saved to disk. This character stream contains all the information necessary to reconstruct the object in another python script.

*Note that while this is a smart way to save Python objects, if you want to use a different programming environment to work with your data, it's better to save files to JSON or CSV!*

In [17]:
# save in pickle
with open("sentences.sent", "wb") as docP: 
    pickle.dump(sentences, docP)

The "sentences.sent" file is in the Dropbox, in case tokenizing the corpus takes too long for your machine. If so, here's how you import a pickle:

In [18]:
with open("sentences.sent", "rb") as cp: 
    sentences = pickle.load(cp)

## Creating Word Embeddings with word2vec

Let's create our word embeddings model. Its input is a text corpus and its output is a set of "vectors" (mathematical objects with a magnitude and direction) in N dimensions. The purpose and usefulness of such a model is to group the vectors of similar words together in vectorspace. We can then reduce the dimensionality to visualize the results in a way humans can understand (such as in a 2-dimensional space), or to perform linear algebra in order to find how words are related.

Word2vec is one example of a word embeddings model. It is basically a neural network, provided by Google and is trained on Google News data. It learns by taking words and their contexts (e.g. sentences) into account, and can then try to predict other words. Given enough data, usage and contexts, word2vec can make accurate guesses about a word’s meaning based on its appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic.

In [19]:
num_features = 300  # Word vector dimensionality
min_word_count = 2  # Minimum word count
num_workers = 4     # Number of threads to run in parallel
context = 10        # Context window size
downsampling = 1e-2

model = Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling, seed=1, sg=1)

That was it! Let's save this model in a Gensim object

In [20]:
model.save("word2vec.vec")

How many terms are in our vocabulary?

In [21]:
print('{:,} terms in the vocabulary.'.format(len(model.wv.vocab)))

62,107 terms in the vocabulary.


### Getting related terms

With the information in our word embeddings model, we can try to find similarities between words that interest us (i.e. words that have a similar vector). Let's create a function that retrieves related terms to some input.

In [22]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token and print them as a formatted list
    """

    for word, similarity in model.most_similar(positive=[token], topn=topn):
        print(word, round(similarity, 3))

In [29]:
get_related_terms(u'hamstered')

rationalized 0.615
gloating 0.6
wast 0.595
seing 0.59
rationalised 0.586
enchanted 0.586
leaved 0.585
actualy 0.582
coped 0.582
flabbergasted 0.579


### Word algebra

Word algebra, also known as analogy completion, means doing math with words (like the famous example "king - man + woman = queen". The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure works as follows:

1. Provide a set of words or phrases you want to add or subtract.
2. Look up the vectors that represent those terms in the word vector model.
3. Add and subtract those vectors to produce a new, combined vector.
4. Look up the most similar vector(s) to this new, combined vector via cosine similarity.
5. Return the word(s) associated with the similar vector(s).

Let's try it out. We'll create a function that does this for us.

In [69]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = model.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print(term)

In [70]:
word_algebra(add=['women','evening'])

rendezvous


Looks like the model picked up the evening activities TRP members engage in.

In [86]:
word_algebra(add=['dating', 'single'], subtract=['alpha'])

moms


### Your turn!

Use some terms to add and/or subtract, and see what happens.

In [None]:
# Your code here...








## Word Vector Visualization with t-SNE

t-Distributed Stochastic Neighbor Embedding, or t-SNE, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets. It attempts to map high-dimensional data onto a low two- or three-dimensional representation. It tries to keep the relative distances between points as closely as possible in both high-dimensional and low-dimensional space.

Scikit-learn provides a convenient implementation of the t-SNE algorithm with its `TSNE` class.

Our input for t-SNE will be the DataFrame of word vectors we created before. Let's first drop some stopwords, and
take only the 5,000 most frequent terms in the vocabulary for time's sake.

First, we need to create a DataFrame with the terms as the row labels, and the 100 dimensions of the word vector model as the columns.

In [93]:
# build a list of the terms, integer indices, and term counts from the word2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count) for term, voc in model.wv.vocab.items()]

# sort by the term counts, so the most common terms appear first
ordered_vocab.sort(key = lambda x: x[2])  

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a DataFrame with the food2vec vectors as data, and the terms as row labels
word_vectors = pd.DataFrame(model.wv.syn0norm[term_indices, :], index=ordered_terms)

word_vectors.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
untrusting,-0.073328,-0.049614,-0.067948,-0.005526,-0.035332,0.072338,-0.065346,-0.013099,-0.091648,-0.075509,...,0.002435,-0.112307,-0.067031,-0.032409,0.01656,0.070389,0.068888,0.099224,0.008335,-0.038412
monoculture,-0.110704,-0.060495,-0.067033,-0.02325,-0.013151,0.092887,-0.05625,0.025337,-0.075942,-0.114726,...,0.00551,-0.052808,-0.01716,-0.028991,0.007493,0.047888,0.048036,0.114295,0.014057,-0.056838
dissenters,-0.090997,-0.070274,-0.053969,-0.019231,-0.024001,0.105156,-0.075457,0.001598,-0.067055,-0.108417,...,0.016848,-0.043714,-0.035371,-0.030749,0.015953,0.051638,0.061654,0.116352,0.007557,-0.069491
heritable,-0.111836,-0.039076,-0.080776,-0.003432,-0.025344,0.068856,-0.00917,0.003824,-0.07833,-0.127198,...,-0.026814,-0.061688,0.00539,-0.034024,0.039529,0.045858,0.006605,0.054377,0.009067,-0.033266
swes,-0.123966,-0.05689,-0.070137,-0.027416,-0.045688,0.073255,-0.071537,-0.004191,-0.053086,-0.121544,...,-0.000782,-0.074426,-0.085196,-0.036075,0.016468,0.060584,0.062888,0.097253,0.079487,-0.029178


In [94]:
tsne = TSNE()
tsne_vectors = tsne.fit_transform(word_vectors.values)

In [97]:
# In case you want to open again
# with open('tsne-TRP') as f:
#    tsne = pickle.load(f)
    
# tsne_vectors = pd.np.load(tsne_vectors_filepath)

In [98]:
tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(word_vectors.index),
                            columns=['x_coord', 'y_coord'])

In [99]:
tsne_vectors.head()

Unnamed: 0,x_coord,y_coord
untrusting,-16.317139,16.218309
monoculture,-21.707802,2.384353
dissenters,-21.669952,2.417449
heritable,-26.65206,8.715102
swes,0.502706,-7.871472


In [100]:
tsne_vectors['word'] = tsne_vectors.index

In [101]:
output_notebook()

In [102]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title='t-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= ('pan, wheel_zoom, box_zoom, box_select, reset, reset'),
                   active_scroll='wheel_zoom')

# add a hover tool to display words on roll-over
tsne_plot.add_tools(HoverTool(tooltips = '@word'))

# draw the words as circles on the plot
tsne_plot.circle('x_coord', 'y_coord', source=plot_data,
                 color='blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color='black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value('14pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# show the plot
show(tsne_plot);



## K-means clustering
Now, we will analyze the results of word2vec. The first thing we will do is cluster the words using K-Means clustering. Since the words are all represented as vectors, applying K-Means is easy to do since the clustering algorithm will simply look at differences between vectors (and centers).

In [103]:
def clustering_on_wordvecs(word_vectors, num_clusters):
    # Initalize a k-means object and use it to extract centroids
    kmeans_clustering = KMeans(n_clusters = num_clusters, init='k-means++');
    idx = kmeans_clustering.fit_predict(word_vectors);
    
    return kmeans_clustering.cluster_centers_, idx;

In [104]:
Z = model.wv.syn0

In [105]:
centers, clusters = clustering_on_wordvecs(Z, 10);
centroid_map = dict(zip(model.wv.index2word, clusters));

Next, we get words in each cluster that are closest to the cluster center. To do this, we initialize a KDTree on the word vectors, and query it for the Top K words on each cluster center. Using the Index 2 word dictionary, we than correspond each word vector back to it’s original word representation and add them to a dataframe for easier printing.

In [106]:
def get_top_words(index2word, k, centers, wordvecs):
    tree = KDTree(wordvecs);
#Closest points for each Cluster center is used to query the closest 20 points to it.
    closest_points = [tree.query(np.reshape(x, (1, -1)), k=k) for x in centers];
    closest_words_idxs = [x[1] for x in closest_points];
#Word Index is queried for each position in the above array, and added to a Dictionary.
    closest_words = {};
    for i in range(0, len(closest_words_idxs)):
        closest_words['Cluster #' + str(i)] = [index2word[j] for j in closest_words_idxs[i][0]]
#A DataFrame is generated from the dictionary.
    df = pd.DataFrame(closest_words);
    df.index = df.index+1
    return df

Let’s get the top words and print the first 20 in each cluster:

In [107]:
top_words = get_top_words(model.wv.index2word, 5000, centers, Z);

In [108]:
top_words[:10]

Unnamed: 0,Cluster #0,Cluster #1,Cluster #2,Cluster #3,Cluster #4,Cluster #5,Cluster #6,Cluster #7,Cluster #8,Cluster #9
1,institutionally,xpost,itching,pakistan,gazillionaires,chaddad,snug,fvideo,behooves,physiologically
2,merciful,fnordsnord,sexworthy,organisations,murmur,fav,bun,yf,introspect,surges
3,redefinitions,sidebars,supplicative,severance,moo,blondie,fireplace,qp,acquisitive,preworkout
4,oppresses,articulates,successfull,enroll,recur,neighbour,briefs,beyondchocolate,benevolence,sexdrive
5,touting,pillscollide,blab,evaluators,referendum,algeria,rim,wo,unwarranted,serotonergic
6,sociocultural,revisiting,multitudes,traveller,hilterbrand,awoke,compression,perot,ineptitude,painless
7,cathexis,occamsusername,harboring,supervisory,hewitt,drummer,ribs,ubc,pragmatically,hydration
8,kupfer,srssucks,overinvested,shareholders,weightlifters,debrief,inserts,wac,conceptualize,toxins
9,fracturing,formatted,weirding,liquidated,shrank,phoned,doggystyle,filmmakers,holistic,fibers
10,unchanging,redpills,tlagc,lenders,trivialities,circled,squirming,faa,worsen,memorization


## Assignment: visualize!

Use Matplotlib or Seaborn to visualize these results!

In [None]:
# Your code here...








# Assignment

Think about how *you* could apply this method of finding similar words through the mapping of concepts (in this example, "men" and "women") onto dimensions. How could you use it to find **ideological binaries** and **normativities** in your data?