# Word Embeddings

Word embeddings can capture the "context" of a word. We can analyze word embeddings to understand the ideas associated with a word in a corpus.

A well-trained set of word vectors provides insight into words that are closer to each other in meaning within a corpus. For example, if I queried the word "Texas" my model might tell me that "California" and "Illinois" are similar in meaning (becuase these are each states). If I instead queried the word "red" I might see that "yellow" and "blue" are similar in meaning (becuase these are each colors).

We can leverage the insight gleened from a word embeddings model to see how language changed over time or across discourse communities. We can ask questions like: how does the U.S. Congress contextualize "immigration" in 2001 compared to 2021? Or, which words does the state school board associate with "gay" in Texas compared to New York? 


In [1]:
import os
import csv
import re
import gensim

In [2]:
def data_import(working_dir, fname):
    # Read csv file as list of lists. 
    # Then clean the list of lists 

    with open(working_dir + fname, newline = '') as f:
            reader = csv.reader(f)
            data = list(reader)[1:]
            data = list(map(str, data))
            
    data = [re.sub(r'\\\\n|\\\\t|\'s', '', word) for word in data] # remove line breaks, tab breaks, and possessive "s"
    data = [re.sub(r'[^\w\s]|_', '', word) for word in data] # remove punctuation and underscore
    data = [re.sub(r'\d{1, 3}', '', word) for word in data] # remove digits that are a minimum of 1 and a maximum of 3
    data = [re.sub(r'\w*\d\w*', '', word) for word in data] # remove character strings that contain a digit
        
    data = [word.lower() for word in data]
    data = [word.split() for word in data]

    return data

In [4]:
working_dir = '/home/stephbuon/data/'
fname = 'congress_2001.csv'

In [5]:
data = data_import(working_dir, fname)

data[:5]

[['the', 'majority', 'leader'],
 ['senator'],
 ['is', 'recognized'],
 ['mr', 'president'],
 ['on', 'behalf', 'of', 'the', 'entire', 'senate']]

We can now model our data. `Word2Vec()` uses a few unfamiliar words. Here, `workers` refers to the number of cores (aka "brains") in your laptop. This allows you to allocate work to more cores than just one. `min_count` tells our model not to consider words stated less than 20 times. 

In [46]:
%%time

period_model = gensim.models.Word2Vec(sentences = data, workers = 8, min_count = 20, vector_size = 100) 

CPU times: user 3min 11s, sys: 1.15 s, total: 3min 12s
Wall time: 56.4 s


### Saving Our Word Embeddings Model

We are working with just one year of the U.S. Congressional Records. Modeling word embeddings can take a long time when working with large data sets, like the Congressional Records for 100 years. 

Ideally we would only create a work embeddings model from our data once, not every time we want to do an analysis. Lucky for us we can save our model to our computer for later.

The following code creates a folder named `word_embeddings` in our working directory if the folder does not already exist and then saves our model to it. 

In [None]:
working_folder = working_dir + 'word_embeddings'

if not os.path.exists(working_folder):
    os.mkdir(working_folder)

In [None]:
period_model.save(working_dir + 'congress_2001_word_embeddings_model.gensim')

### Loading Our Word Embeddings Model

We can load our model whenever we want to work with our word embeddings some more.

In [49]:
period_model = gensim.models.Word2Vec.load(working_dir + 'congress_2001_word_embeddings_model.gensim')

### Exploring Our Word Embeddings Model

A word embeddings model represents the words that are considered similar to one-another based on its training corpus. By exploring word embeddings, we gain insight into how members of Congress associated issues with one-another in the year 2001. 

Scores assigned to word embeddings range from 0 to 1. Larger scores are associated with greater similarity. 

### Similarity

We can see which words are considered to be "most similar" to one-another within the U.S. Congressional Records.

In [84]:
period_model.wv.most_similar('women', topn = 10)

[('adults', 0.63141268491745),
 ('individuals', 0.6123476028442383),
 ('patients', 0.5769939422607422),
 ('families', 0.5762478709220886),
 ('nurses', 0.5738139748573303),
 ('immigrants', 0.573233425617218),
 ('americans', 0.569057047367096),
 ('soldiers', 0.5658395290374756),
 ('hispanics', 0.5551348328590393),
 ('africanamericans', 0.5544487237930298)]

In [85]:
period_model.wv.most_similar('men', topn = 10)

[('servicemen', 0.7856674194335938),
 ('soldiers', 0.609350323677063),
 ('firemen', 0.6058236360549927),
 ('heroes', 0.5795339941978455),
 ('firefighters', 0.5795301795005798),
 ('patriots', 0.5669488906860352),
 ('countrymen', 0.5585413575172424),
 ('souls', 0.5551075339317322),
 ('rescuers', 0.5501298904418945),
 ('brave', 0.5463396906852722)]

In [64]:
period_model.wv.most_similar('towers', topn = 10)

[('twin', 0.7170605659484863),
 ('camps', 0.6846996545791626),
 ('crashing', 0.6712399125099182),
 ('khobar', 0.6627557873725891),
 ('airliners', 0.6483810544013977),
 ('crashed', 0.6441882252693176),
 ('tents', 0.6228711009025574),
 ('flames', 0.6196696162223816),
 ('shadows', 0.6154047250747681),
 ('rubble', 0.6119235754013062)]

In [52]:
period_model.wv.most_similar('soldier', topn = 10)

[('journalist', 0.7668976783752441),
 ('man', 0.7554538249969482),
 ('woman', 0.7496949434280396),
 ('hero', 0.7309134006500244),
 ('statesman', 0.7268860936164856),
 ('warrior', 0.7024351358413696),
 ('aviator', 0.6905612349510193),
 ('firefighter', 0.6881330609321594),
 ('politician', 0.6843069791793823),
 ('veteran', 0.6815338134765625)]

In [74]:
period_model.wv.most_similar('army', topn = 10)

[('navy', 0.8557624220848083),
 ('infantry', 0.759527325630188),
 ('commander', 0.7565478682518005),
 ('armys', 0.7555398344993591),
 ('marine', 0.7459832429885864),
 ('battalion', 0.739398717880249),
 ('airborne', 0.73427414894104),
 ('naval', 0.732434868812561),
 ('marines', 0.7063413262367249),
 ('wing', 0.6886097192764282)]

In [140]:
period_model.wv.most_similar('global', topn = 10)

[('international', 0.610640823841095),
 ('climate', 0.6087414026260376),
 ('globalization', 0.5622814297676086),
 ('nearterm', 0.5595552325248718),
 ('systemic', 0.5546252727508545),
 ('warming', 0.545982837677002),
 ('multilateral', 0.5427759289741516),
 ('technological', 0.5351625084877014),
 ('dynamic', 0.5350406765937805),
 ('evolving', 0.5344930291175842)]

### Subtracting Vectors

We can substract vectors to see which words are associated with one term, and not the other. 

In [92]:
# Which words are associated with woman and not man? 

diff = period_model.wv['woman'] - period_model.wv['man']
period_model.wv.similar_by_vector(diff)

[('mental', 0.40712496638298035),
 ('medicaid', 0.39979588985443115),
 ('immunizations', 0.3840039372444153),
 ('referrals', 0.3836553394794464),
 ('outpatient', 0.37981879711151123),
 ('admissions', 0.37965041399002075),
 ('female', 0.37700706720352173),
 ('adolescent', 0.3649798333644867),
 ('exams', 0.3635528087615967),
 ('ag', 0.3635350167751312)]

In [93]:
# Which words are associated with man and not woman? 

diff = period_model.wv['man'] - period_model.wv['woman']
period_model.wv.similar_by_vector(diff)

[('man', 0.5116848349571228),
 ('honesty', 0.4083215296268463),
 ('bipartisanship', 0.397760808467865),
 ('gesture', 0.3768903613090515),
 ('affection', 0.3675096035003662),
 ('boundless', 0.36692705750465393),
 ('love', 0.36035898327827454),
 ('humility', 0.35714954137802124),
 ('exemplifies', 0.35624781250953674),
 ('warrior', 0.3549908399581909)]

### Find Similarity Score

We can also find the score that represents how "similar" any two words are within a corpus.

In [115]:
period_model.wv.similarity('soldiers', 'men')

0.6093504

In [130]:
period_model.wv.similarity('christian', 'rational')

0.014649789

In [137]:
period_model.wv.similarity('soldiers', 'men')

0.6093504