# Word Embeddings

Word embeddings can capture the "context" of a word. We can analyze word embeddings to understand the ideas associated with a word in a corpus.

In its most basic sense, word embedding models cluster text. This is how word embedding models understand which words are "similar" or "dissimilar" to one-another. What does this look like in practice? Like this: if I queried the word "Texas," my model might tell me that "California" and "Illinois" are similar (because each of these are states). If I instead queried the word "red" I might see that "yellow" and "blue" are similar (becuase these are each colors).

We can leverage the insight of a word embeddings model to see how language changed over time or across discourse communities. We can ask questions like: how does the U.S. Congress contextualize "immigration" in 2001 compared to 2015? Or, which words does the State School Board associate with "gay" in Texas compared to New York? 


### Word Embeddings with Gensim

We are going to reuse our code to import our data.

In [2]:
import os
import json
import re
import gensim

In [36]:
def data_import(fname):
    # Read file as list of lists. 
    # Then clean the list of lists 

    with open(fname, newline = '') as f:
            reader = json.loads(f.read()) # read the JSON file as a Python object 
            data = list(reader)[1:]
            data = list(map(str, data))

    data = [re.sub(r'\\\\n|\\\\t', '', word) for word in data] # remove line breaks and tab breaks
    data = [re.sub(r'[^\w\s]|_', '', word) for word in data] # remove punctuation and underscore
    data = [re.sub(r'\d{1, 3}', '', word) for word in data] # remove digits that are a minimum of 1 and a maximum of 3
    data = [re.sub(r'\w*\d\w*', '', word) for word in data] # remove character strings that contain a digit
        
    data = [word.lower() for word in data]
    data = [word.split() for word in data]

    for sublist in data:
        if 'sentence' in sublist:
                sublist.remove('sentence')

    return data

In [37]:
directory = os.path.abspath('')

directory

'/home/stephbuon/projects/faha/word-embeddings'

In [38]:
data = data_import(directory + '/congress_2001.json')

data[:5]

[['senator', 'daschle'],
 ['is', 'recognized'],
 ['mr', 'president'],
 ['on', 'behalf', 'of', 'the', 'entire', 'senate'],
 ['but', 'especially', 'this', 'senator']]

Now that our data is imported we can now model it. `Word2Vec()` uses a few unfamiliar words. Here, `workers` refers to the number of cores (aka "brains") in your laptop. This allows you to allocate work to more cores than just one. `min_count` tells our model not to consider words stated less than 20 times. I chose to remove words stated less than 20 times becuase a model can't accurately assess a word if there are not enought examples of its context. For example, it would be hard to understand the context of the word "run" if it is only used once. (Does run refer to an election? Exercise? A tear in some cloth?) 

In [5]:
%%time

period_model = gensim.models.Word2Vec(sentences = data, workers = 8, min_count = 20, vector_size = 100) 

CPU times: user 3min 33s, sys: 1.51 s, total: 3min 34s
Wall time: 1min


### Saving Our Word Embeddings Model

For this introduction we are working with just one year of the U.S. Congressional Records. Modeling word embeddings can take a long time when working with large data sets, like the Congressional Records for 100 years. 

Because generating a model is so intensive we would, ideally, only create a model once, not every time we want to do an analysis. Lucky for us we can save our model for later.

In [None]:
period_model.save(directory + '/congress_2001_word_embeddings_model.gensim')

You can save your models anywhere, but they will be a lot easier to find if we designate directories for them.

### Loading Our Word Embeddings Model

We can load our model whenever we want to work with our word embeddings some more.

In [None]:
period_model = gensim.models.Word2Vec.load(directory + '/congress_2001_word_embeddings_model.gensim')

### Exploring Our Word Embeddings Model

A word embeddings model represents the words that are considered similar to one-another based on its corpus. By exploring word embeddings, we can gain insight into how members of Congress associated words with one-another in the year 2001. 

Scores assigned to word embeddings range from 0 to 1. Larger scores are associated with greater similarity. 

### Similarity

We can see which words are considered to be "most similar" to one-another within the U.S. Congressional Records.

In [6]:
period_model.wv.most_similar('women', topn = 10)

[('adults', 0.6491211652755737),
 ('individuals', 0.6041426658630371),
 ('families', 0.570698618888855),
 ('lesbians', 0.566820502281189),
 ('teens', 0.5661625266075134),
 ('americans', 0.5566259622573853),
 ('children', 0.5461678504943848),
 ('servicewomen', 0.5455303192138672),
 ('people', 0.5398010015487671),
 ('workers', 0.5377645492553711)]

In [7]:
period_model.wv.most_similar('men', topn = 10)

[('servicemen', 0.7953948378562927),
 ('soldiers', 0.6381202340126038),
 ('firefighters', 0.6292278170585632),
 ('heroes', 0.6067168116569519),
 ('patriots', 0.6038002371788025),
 ('firemen', 0.57923823595047),
 ('children', 0.5760605335235596),
 ('brave', 0.5636528730392456),
 ('civilians', 0.5574391484260559),
 ('sailors', 0.5493330955505371)]

In [8]:
period_model.wv.most_similar('towers', topn = 10)

[('twin', 0.7140002250671387),
 ('camps', 0.675162136554718),
 ('tents', 0.6718695759773254),
 ('flames', 0.6657811999320984),
 ('parked', 0.6589339971542358),
 ('rubble', 0.6543099284172058),
 ('khobar', 0.6490092873573303),
 ('crashes', 0.6387389302253723),
 ('crash', 0.6295860409736633),
 ('airliners', 0.6281408667564392)]

In [9]:
period_model.wv.most_similar('soldier', topn = 10)

[('woman', 0.7726640701293945),
 ('man', 0.7626761794090271),
 ('hero', 0.7611467838287354),
 ('journalist', 0.7470694780349731),
 ('statesman', 0.7453933358192444),
 ('veteran', 0.7369786500930786),
 ('aviator', 0.7255291938781738),
 ('athlete', 0.7117759585380554),
 ('fireman', 0.6981242299079895),
 ('politician', 0.6962231993675232)]

In [11]:
period_model.wv.most_similar('global', topn = 10)

[('climate', 0.6193890571594238),
 ('globalization', 0.6128581166267395),
 ('international', 0.6067464351654053),
 ('warming', 0.5759508609771729),
 ('worldwide', 0.5730130076408386),
 ('multilateral', 0.5550599694252014),
 ('rapid', 0.5546202659606934),
 ('evolving', 0.5485063791275024),
 ('strategic', 0.5277345180511475),
 ('hivaids', 0.5258907675743103)]

### Subtracting Vectors

We can substract vectors to see which words are associated with one term, and not the other. 

In [12]:
# Which words are associated with woman and not man? 

diff = period_model.wv['woman'] - period_model.wv['man']
period_model.wv.similar_by_vector(diff)

[('female', 0.4222990870475769),
 ('clinics', 0.41148972511291504),
 ('immunizations', 0.40906789898872375),
 ('hospitals', 0.40902236104011536),
 ('exams', 0.39813071489334106),
 ('mental', 0.3952828347682953),
 ('physicians', 0.39134857058525085),
 ('psychiatrists', 0.3881763517856598),
 ('medical', 0.3865758180618286),
 ('screenings', 0.3844527304172516)]

In [13]:
# Which words are associated with man and not woman? 

diff = period_model.wv['man'] - period_model.wv['woman']
period_model.wv.similar_by_vector(diff)

[('man', 0.4958917498588562),
 ('bipartisanship', 0.40711140632629395),
 ('symbolize', 0.3811882436275482),
 ('billand', 0.3657127320766449),
 ('preaching', 0.36058422923088074),
 ('honesty', 0.35733601450920105),
 ('humility', 0.3560566306114197),
 ('gesture', 0.35543292760849),
 ('embodiment', 0.35346508026123047),
 ('mighty', 0.3475883901119232)]

### Find Similarity Score

We can also find the score that represents how "similar" any two words are within a corpus.

In [14]:
period_model.wv.similarity('soldiers', 'men')

0.6381203

In [15]:
period_model.wv.similarity('christian', 'rational')

-0.029213697

In [16]:
period_model.wv.similarity('soldiers', 'men')

0.6381203