# Word Embeddings

Word embeddings can capture the the "context" of a word. 

A well-trained set of word vectors will seek to represent words that are closer to each other in meaning. For example: "New York," "California," and "Texas," may be considered similar in meaning, while "red," "yellow," and "blue" may be considered similar in meaning. 

In [5]:
import os
import csv
import re
import gensim

In [26]:
def data_import(dir_path, fname):
    # Read csv file as list of lists. 
    # Then clean the list of lists 

    with open(dir_path + fname, newline = '') as f:
            reader = csv.reader(f)
            data = list(reader)[1:]
            data = list(map(str, data))
            
    data = [re.sub(r'\b[A-Z]+(?:\s+[A-Z]+)*\b', '', ls) for ls in data] # remove words that are all upper case - so names 
    data = [re.sub(r'\\\\n|\\\\t|\'s', '', ls) for ls in data] # remove line breaks, tab breaks, and possessive "s"
    data = [re.sub(r'[^\w\s]|_', '', ls) for ls in data] # remove punctuation and underscore
    data = [re.sub(r'\d{1, 3}', '', ls) for ls in data] # remove digits that are a minimum of 1 and a maximum of 3
    data = [re.sub(r'\w*\d\w*', '', ls) for ls in data] # remove character strings that contain a digit
        
    data = [word.lower() for word in data]
    data = [ls.split() for ls in data]

    return data

In [44]:
working_dir = '/home/stephbuon/data/'
fname = 'us_congress_2001.csv'

In [45]:
data = data_import(working_dir, fname)

data[:5]

[['the', 'majority', 'leader'],
 ['senator'],
 ['is', 'recognized'],
 ['mr', 'president'],
 ['on', 'behalf', 'of', 'the', 'entire', 'senate']]

We can now model our data. `Word2Vec()` uses a few unfamiliar words. Here, `workers` refers to the number of cores (aka "brains") in your laptop. This allows you to allocate work to more cores than just one. `min_count` tells our model not to consider words stated less than 20 times. 

In [46]:
%%time

period_model = gensim.models.Word2Vec(sentences = data, workers = 8, min_count = 20, vector_size = 100) 

CPU times: user 3min 11s, sys: 1.15 s, total: 3min 12s
Wall time: 56.4 s


### Saving Our Word Embeddings Model

We are working with just one year of the U.S. Congressional Records. Modeling word embeddings can take a long time when working with large data sets, like the Congressional Records for 100 years. 

Ideally we would only create a work embeddings model from our data once, not every time we want to do an analysis. Lucky for us we can save our model to our computer for later.

The following code creates a folder named `word_embeddings` in our working directory if the folder does not already exist and then saves our model to it. 

In [None]:
working_folder = working_dir + 'word_embeddings'

if not os.path.exists(working_folder):
    os.mkdir(working_folder)

In [48]:
period_model.save(working_dir + 'congress_2001_word_embeddings_model.gensim')

### Loading Our Word Embeddings Model

We can load our model whenever we want to work with our word embeddings some more.

In [49]:
period_model = gensim.models.Word2Vec.load(working_dir + 'congress_2001_word_embeddings_model.gensim')

### Exploring Our Word Embeddings Model

A word embeddings model represents the words that are considered similar to one-another based on its training corpus. By exploring word embeddings, we gain insight into how members of Congress associated issues with one-another in the year 2001. 

In [82]:
period_model.wv.most_similar('woman', topn = 10)

[('man', 0.791344404220581),
 ('person', 0.7732507586479187),
 ('soldier', 0.7496948838233948),
 ('citizen', 0.6800270080566406),
 ('politician', 0.6747932434082031),
 ('mother', 0.6740400791168213),
 ('lady', 0.640896737575531),
 ('girl', 0.6405008435249329),
 ('teenager', 0.6390302181243896),
 ('lawyer', 0.63542640209198)]

In [64]:
period_model.wv.most_similar('towers', topn = 10)

[('twin', 0.7170605659484863),
 ('camps', 0.6846996545791626),
 ('crashing', 0.6712399125099182),
 ('khobar', 0.6627557873725891),
 ('airliners', 0.6483810544013977),
 ('crashed', 0.6441882252693176),
 ('tents', 0.6228711009025574),
 ('flames', 0.6196696162223816),
 ('shadows', 0.6154047250747681),
 ('rubble', 0.6119235754013062)]

In [52]:
period_model.wv.most_similar('soldier', topn = 10)

[('journalist', 0.7668976783752441),
 ('man', 0.7554538249969482),
 ('woman', 0.7496949434280396),
 ('hero', 0.7309134006500244),
 ('statesman', 0.7268860936164856),
 ('warrior', 0.7024351358413696),
 ('aviator', 0.6905612349510193),
 ('firefighter', 0.6881330609321594),
 ('politician', 0.6843069791793823),
 ('veteran', 0.6815338134765625)]

In [74]:
period_model.wv.most_similar('army', topn = 10)

[('navy', 0.8557624220848083),
 ('infantry', 0.759527325630188),
 ('commander', 0.7565478682518005),
 ('armys', 0.7555398344993591),
 ('marine', 0.7459832429885864),
 ('battalion', 0.739398717880249),
 ('airborne', 0.73427414894104),
 ('naval', 0.732434868812561),
 ('marines', 0.7063413262367249),
 ('wing', 0.6886097192764282)]

### Subtracting Vectors

We can substract vectors to see which words are associated with one term, and not the other. 

In [77]:
# Which words are associated with ENTER and not ENTER? 

diff = period_model.wv['army'] - period_model.wv['navy']
period_model.wv.similar_by_vector(diff)

[('army', 0.6420791149139404),
 ('infantryman', 0.4506945013999939),
 ('philippine', 0.4179878532886505),
 ('liberation', 0.3949863314628601),
 ('commander', 0.3947765827178955),
 ('infantry', 0.3868419826030731),
 ('invasion', 0.37694051861763),
 ('allied', 0.3742254078388214),
 ('marine', 0.37334108352661133),
 ('filipino', 0.3707565367221832)]

In [79]:
# Which words are associated with woman and not man? 

diff = period_model.wv['woman'] - period_model.wv['man']
period_model.wv.similar_by_vector(diff)

[('mental', 0.40712496638298035),
 ('medicaid', 0.39979588985443115),
 ('immunizations', 0.3840039372444153),
 ('referrals', 0.3836553394794464),
 ('outpatient', 0.37981879711151123),
 ('admissions', 0.37965041399002075),
 ('female', 0.37700706720352173),
 ('adolescent', 0.3649798333644867),
 ('exams', 0.3635528087615967),
 ('ag', 0.3635350167751312)]

In [78]:
# Which words are associated with man and not woman? 

diff = period_model.wv['man'] - period_model.wv['woman']
period_model.wv.similar_by_vector(diff)

[('man', 0.5116848349571228),
 ('honesty', 0.4083215296268463),
 ('bipartisanship', 0.397760808467865),
 ('gesture', 0.3768903613090515),
 ('affection', 0.3675096035003662),
 ('boundless', 0.36692705750465393),
 ('love', 0.36035898327827454),
 ('humility', 0.35714954137802124),
 ('exemplifies', 0.35624781250953674),
 ('warrior', 0.3549908399581909)]

In [80]:
# Which words are similar to man and not woman? 

diff = period_model.wv['christian'] - period_model.wv['muslim']
period_model.wv.similar_by_vector(diff)

[('universitys', 0.62779700756073),
 ('pioneering', 0.5837754607200623),
 ('brown', 0.5735412836074829),
 ('marshall', 0.5511050224304199),
 ('christian', 0.5446071028709412),
 ('cal', 0.5439032316207886),
 ('ripken', 0.5431542992591858),
 ('episcopal', 0.536072850227356),
 ('johns', 0.5347297191619873),
 ('william', 0.5316272377967834)]

In [81]:
# Which words are similar to man and not woman? 

diff = period_model.wv['muslim'] - period_model.wv['christian']
period_model.wv.similar_by_vector(diff)

[('exports', 0.4480922520160675),
 ('arab', 0.42241916060447693),
 ('imports', 0.4129955768585205),
 ('muslims', 0.4064711928367615),
 ('wool', 0.4007987678050995),
 ('except', 0.3967370390892029),
 ('estates', 0.3811449110507965),
 ('ome', 0.37807828187942505),
 ('somehow', 0.37490472197532654),
 ('are', 0.3723890483379364)]

### Find Similarity Score

In [None]:
congress_model_1860.wv.similarity('soldiers', 'men')