For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week12-word-context-vectors

# Hist 3368 - Week 12: Word Context Vectors with Gensim

#### By Jo Guldi


#### Word Vectors vs. Word Embeddings 

Wordcount vectors are just what we’ve looked at: a simple count of words, with one integer per every word.  Wordcount embeddings are similar. But they typically add one more row of data or more per document.  That might mean that there’s a count of how many nouns, verbs, or adjectives there are per document. That might mean that there’s a count of bigrams, trigrams, fourgrams, or more – or multi-word phrases, plus or minus a word, called a “skipgram.”  These “hidden layers” in word embedding models mean an even richer model of which documents are like other documents. Because they factor in grammar and sentence structure as well as lexicon, they produce models that are very good at matching rhetorical style in text, and getting at the nuances of grammatical meaning. That is to say, they’re good at noticing when you mean “apple” the fruit (which you might eat or make into pie) or “apple” the computer (which you might turn on or off).  

Functionally, you use word embeddings just the way you use wordcount vectors. You can measure the distance between them, just like we did in our notebook this week.  You can subtract them, just as we did, to get a litmus test of what’s different between two periods of time, or which words are used to signify masculinity and femininity.  

*Previously, you've made word count vectors by brute force -- by 'grouping' your data by each word, then creating a word vector for each word, and using log likelihood to measure the most distinctive collocates of each word.

*This time, however, we'll use the GENSIM package of word embeddings to work on a larger-scale sample of debates. We'll use GENSIM's pre-built tools to do analysis comparative to what you did with cosine distance and vector subtraction:

     wv.similar_by_vector() -- which allows you to search for by vector. it thus allows you to find the words that commonly appear in the same context as a given keyword.  

#### Skill Building for Historical Analysis

By the end of this notebook, you'll know how to replicate most of the fancy work with vectors in the reading.  You'll be able to:
* use word context vectors to analyze the intellectual history of concept words like "freedom," "gay", or "woman," detecting how their context changed from moment to moment
* visualize changes to word concepts as a dendrogram
* use GENSIM's "most_similar()" to generate a list of the words most similar to any concept (for instance "freedom") at different moments over time
* visualize changes to the context of an individual word over time

### Load programs

In [1]:
!pip install gensim --user

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/ba/b3/668ace2f0517b7fb01f780f93a75cb0592754d6365d808d2adccb2a94b92/gensim-4.1.2-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1MB)
[K     |████████████████████████████████| 24.1MB 11.6MB/s eta 0:00:01    |██████▏                         | 4.7MB 11.6MB/s eta 0:00:02     |███████████████████████████████▍| 23.6MB 11.6MB/s eta 0:00:01
[?25hCollecting smart-open>=1.8.1
[?25l  Downloading https://files.pythonhosted.org/packages/cd/11/05f68ea934c24ade38e95ac30a38407767787c4e3db1776eae4886ad8c95/smart_open-5.2.1-py3-none-any.whl (58kB)
[K     |████████████████████████████████| 61kB 1.5MB/s  eta 0:00:01
Collecting dataclasses; python_version < "3.7"
  Downloading https://files.pythonhosted.org/packages/fe/ca/75fac5856ab5cfa51bbbcefa250182e50441074fdc3f803f6e76451fab43/dataclasses-0.8-py3-none-any.whl
Installing collected packages: smart-open, dataclasses, gensim
Successfully installed dataclasses-0

In [85]:
import pandas as pd
import gensim
import string
import csv
import glob
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import scipy.spatial.distance
import matplotlib
import matplotlib.pyplot as plt
import itertools
import multiprocessing
from nltk.tokenize import sent_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer
from nltk.corpus import wordnet as wn

In [86]:
def lemmatize_list(sentence):
    result = [wn.morphy(item) for item in sentence]
    return(result)

In [89]:
def structure_data(sentences, lemma, stopwords, stemmed):
   # smoosh everything together
    one_string = ' '.join(sentences)
     
    # break it into sentences 
    sentences =  sent_tokenize(one_string) 
    
    # remove punctuation
    sentences = [''.join(c for c in sentence if not c in string.punctuation) for sentence in sentences]

    # lowercase
    sentences = [sent.lower() for sent in sentences]

    # tokenize documents with gensim's tokenize() function
    sentences_in_words = [sent.split() for sent in sentences]
    
    # build bigram model
    bigram_mdl = gensim.models.phrases.Phrases(sentences_in_words, min_count=1, threshold=2)

    # lemmatize the tokens
    if lemma == True:
        pool = multiprocessing.Pool()
        sentences_in_words =  pool.map(lemmatize_list, sentences_in_words) #[[wn.morphy(item) for item in list] for list in token_list] 
        sentences_in_words = [[item for item in sentence if item is not None] for sentence in sentences_in_words] 
    sentences_in_words[0][:15]

    # remove stopwords and/or do stemming
    from gensim.parsing.preprocessing import preprocess_string#, remove_stopwords#, #stem_text
    CUSTOM_FILTERS = []
    if stopwords == True:
        from gensim.parsing.preprocessing import remove_stopwords
        CUSTOM_FILTERS.append(remove_stopwords)
    if stemmed == True:
        from gensim.parsing.preprocessing import stem_text
        CUSTOM_FILTERS.append(stem_text)
        
    processed = [preprocess_string(" ".join(sentence), CUSTOM_FILTERS) for sentence in sentences_in_words]
    #processed = [[item for item in list if item] for list in processed]

    # apply bigram model to list
    result = [bigram_mdl[item] for item in processed]
        
    return(result)
   

### Define a Function for Preprocessing

Because Gensim likes data to be organized by sentence, we'll want to break a string of text up into a list of strings, each of which is a sentence.

We'll also want to have identified any bigrams in advance. 

Gensim makes it super easy to do both of these things with simple built-in commands:

    gensim.utils.tokenize(doc, lower=True) -- to break sentences into words
    gensim.models.phrases.Phrases(tokens, min_count=1, threshold=2) -- to find frequent bigrams in token strings.
    
We can package tokenization and bigrams into a function that we'll create called 'structure_data:'

#### Load some Data

In [10]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [11]:
congress = pd.read_csv("congress1967-2010.csv")
all_data = congress[congress['year'] >= 1967]
all_data = congress[congress['year'] <= 1983]
#congress = pd.read_csv("eighties_data.csv")

In [12]:
all_data[:5]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,speech,date,speaker,word_count,year,month,month_year
0,0,0,Those who do not enjoy the privilege of the fl...,1967-01-10,The VICE PRESIDENT,16,1967,1,1967-01-01
1,1,1,Mr. President. on the basis of an agreement re...,1967-01-10,Mr. MANSFIELD,35,1967,1,1967-01-01
2,2,2,The Members of the Senate have heard the remar...,1967-01-10,The VICE PRESIDENT,40,1967,1,1967-01-01
3,3,3,The Chair lays before the Senate the following...,1967-01-10,The VICE PRESIDENT,151,1967,1,1967-01-01
4,4,4,Secretary of State.,1967-01-10,Mrs. AGNES BAGGETT,3,1967,1,1967-01-01


In [82]:
all_data['5yrperiod'] = np.floor(all_data['year'] / 5) * 5 # round each year to the nearest 5 -- by dividing by 5 and "flooring" to the lowest integer
all_data = all_data.drop(['date', 'year', 'speaker','Unnamed: 0', 'Unnamed: 0.1', 'word_count', 'month'], 1)

KeyError: 'year'

*If you see a warning above, it isn't an error.*

In [14]:
all_data['index'] = np.arange(len(all_data)) # create an 'index' column

In [15]:
all_data.head()

Unnamed: 0,speech,month_year,5yrperiod,index
0,Those who do not enjoy the privilege of the fl...,1967-01-01,1965.0,0
1,Mr. President. on the basis of an agreement re...,1967-01-01,1965.0,1
2,The Members of the Senate have heard the remar...,1967-01-01,1965.0,2
3,The Chair lays before the Senate the following...,1967-01-01,1965.0,3
4,Secretary of State.,1967-01-01,1965.0,4


#### Downsample

In this exercise, the first pass, we're going to do some memory-intensive work on the computer by creating word context vectors 'by hand' -- i.e., using onyl SKLEARN's CountVectorizer() and .fit_transform + cosine distance and subtraction.  

Doing it this way is slower than loading some other packages that have been built specifically for working with large-scale wordcount vectors, where the code is packaged with high-dimensional math designed to make the comparisons run faster.

We're doing it this way, however, so that you can really see for yourself how a word vector is built and what's inside it at every moment.   

When we structure the data, build the vectors, subtract and measure the distance between vectors, we'll be able to inspect what's in the vector at every turn. You'd be able to do the math yourself if you looked more carefully.

Later in the notebook, we'll return to a 'word embedding' software package that uses high-dimensional math and hidden layers to make whip-fast vectors.

However, as we're doing old-fashioned vectors by hand, it'll go best if we "downsample" the data, taking a random sample of 5000 sentences spoken in Congress between 1985-2010.

Let's create some downsamples so we don't break the computer.

In [83]:
sample_l = all_data.sample(500000)
sample_m = sample_l.sample(50000)
sample = sample_m.sample(5000)

## Introducing Gensim, a Tool for Studying Word Embeddings

At this point in the code, we're shifting from word vectors made with SKLEARN to word "embeddings" made with the GENSIM package.

GENSIM uses higher-level math to condense the matrices, meaning that we'll be able to deal with more information than the downsized sample above. Word embeddings like GENSIM also typically have a "hidden layer" of modeling which includes information about word order and part-of-speech, designed to make the word vectors more accurate models of the way that words are used in sentences. 

In [17]:
import gensim 

Gensim wants to work with a dataset of texts where each row is a sentence, organized as a list of words.

#### Break the data into sentences

We'll need to break our dataframe of speeches into sentences.

NOTE: the lines below may take a while. Splitting sentences and words can be intensive on a dataset of this scale. If it's not working for you, try *sample_l*, *sample_m* or *sample* where you see *all_data* below:

In [94]:
sentences = structure_data(all_data['speech'], lemma = False, stopwords = True, stemmed = True) # <---- switch out sample_l to all_data, sample_s or sample_m here

In [95]:
sentences[:5]

[['enjoy', 'privilege', 'floor', 'retire', 'chamber'],
 ['mr_president'],
 ['basis', 'agreement', 'reach'],
 ['suggest', 'chamber', 'clear', 'attache'],
 ['absolutely', 'important', 'business', 'attend', 'chamber']]

We're now ready to model our Congress data with the help of GENSIM.

### Setting up GENSIM

The first step is to "train" the GENSIM model with the function `gensim.models.Word2Vec()`. This function has a couple dozen parameters, some of which are more important than others.

Here are a few major ones. Only two are MANDATORY: these are marked with an asterisk:

1. `sentences*`: This is where you provide your data. It must be in a format of iterable of iterables.
2. `sg`: Your choice of training algorithm. There are two standard ways of training W2V vectors -- 'skipgram' and 'CBOW'. If you enter 1 here the skip-gram is applied; otherwise, the default is CBOW.
3. `size*`: This is the length of your resulting word vectors. If you have a large corpus (>few billion tokens) you can go up to 100-300 dimensions. Generally word vectors with more dimensions give better results.
4. `window`: This is the window of context words you are training on. In other words, how many words come before and after your given word. A good number is 4 here but this can vary depending on what you are interested in. For instance, if you are more interested in embeddings that embody semantic meaning, smaller window sizes work better. 
5. `alpha`: The learning rate of your model. If you are interested in machine learning experimentation with your vectors you may experiment with this parameter.
6. `seed` (int): This is the random seed for your random initialization. All deep learning models initialize the weights with random floats before training. This is a useful field if you want to replicate your experiments because giving this a seed will initialize 'randomly' deterministically.
7. `min_count`: This is the minimum frequency threshold. If a given word appears with lower frequency than provided it will be ignored. This is here because words with very low frequency are hard to train.
8. `iter`: This is the number of iterations(entire run) over the corpus, also known as epochs. Usually anything between 1-10 is ok. The trade offs are that if you have higher iterations, it will take longer to train and the model may overfit on your dataset. However, longer training will allow your vectors to perform better on tasks relevant to your dataset.

Most of these settings will not concern us. As you'll see below, we are only going to use four arguments.

In [None]:
congress_model = gensim.models.Word2Vec(
    sentences = sentences,
    workers = 30, # if you have more computing power available
    min_count = 10 # remove words stated only once
    ) 

### Save the model

Let's also save our model in case we want to use it again in a later session.

Change this filename to reflect whatever you are doing now.

In [None]:
filename = 'congress_model-1967-2010-full-lemmatized-stopworded-bigrammed'

In [None]:
congress_model.save(filename)

And you can load a model in the same way (remember this from our topic model)

In [67]:
congress_model = gensim.models.Word2Vec.load(filename) 

## What's in the model?

The method `wv.index_to_key` allows us to see the words in our model (but careful! congress_model.wv.key_to_index will print out every word in the corpus -- a very long list!)

In [69]:
congress_model.wv.index_to_key[:25]

['senat',
 'state',
 'amend',
 'committe',
 'time',
 'mr_presid',
 'year',
 'nation',
 'program',
 'hous',
 'gentleman',
 'mr_speaker',
 'congress',
 'feder',
 'presid',
 'peopl',
 'legisl',
 'govern',
 'provid',
 'ask',
 'act',
 'member',
 'need',
 'servic',
 'countri']

The model itself is -- like the SKLEARN CountVectors model -- a matrix of vectors. Every row corresponds to the counts for one word. We can call the entire matrix or call up one row at a time.

In [70]:
congress_model.wv.vectors

array([[-2.4754717e+00, -4.4011908e+00, -4.6082503e-01, ...,
        -4.2563915e-01, -3.5345390e-01,  1.9237659e+00],
       [-2.4642570e+00, -2.7159767e+00, -1.7856932e-01, ...,
        -1.3992382e+00,  1.2022231e+00, -2.0550833e+00],
       [-1.0655633e+00, -2.8813925e-01, -7.5352246e-01, ...,
        -1.5756048e+00,  2.4891226e-01,  7.7211171e-01],
       ...,
       [ 1.9001320e-02, -2.9346248e-02, -3.4601636e-02, ...,
        -1.9130716e-02, -3.1579260e-02,  1.5718418e-03],
       [-1.1716643e-03, -4.0329710e-02, -7.0127897e-02, ...,
        -1.5042054e-02, -1.2530140e-02, -4.7811344e-02],
       [-1.8255347e-02, -2.0252684e-02, -4.9519010e-02, ...,
         7.4810158e-03, -2.2701774e-02, -3.9323349e-02]], dtype=float32)

Here's the fourth row of the model, represented as a word and as a vector:

In [71]:
word = congress_model.wv.index_to_key[3]
word

'committe'

In [72]:
congress_model.wv[word]

array([-2.9128954 , -4.3240027 , -0.44421792,  0.17803319, -1.6887195 ,
       -1.8417858 , -2.382663  , -0.54376405, -0.14213592,  2.4068887 ,
       -0.8008058 , -2.0471299 ,  1.2257737 , -0.813561  ,  0.26949748,
        2.1343286 , -1.7320954 , -2.3215182 , -1.7046514 ,  1.5707619 ,
       -1.5329007 , -1.7563577 , -0.2608846 ,  0.6554528 , -0.6724577 ,
        1.6333065 , -1.309933  ,  0.7698269 ,  3.6186852 ,  1.5208755 ,
       -1.013507  , -0.18529823, -0.35002238,  2.0540264 , -1.26692   ,
        1.6140392 , -0.40102732, -0.7102088 , -1.7623308 , -2.326815  ,
       -3.711199  , -1.7033322 , -0.8530236 ,  3.6067219 ,  0.46784082,
       -0.15662834, -1.1279367 , -0.13018279, -1.5130849 , -1.2077358 ,
        0.1416394 ,  2.8089206 , -2.4010386 , -1.738008  , -0.50861233,
       -2.4322774 , -0.30739444,  0.15793154,  2.0183172 ,  0.8630274 ,
        0.94616544,  1.2690661 , -0.70399517, -2.5525053 , -0.80706865,
        0.2644899 , -0.5465852 , -2.489552  ,  0.8314482 , -0.19

In [73]:
congress_model.wv.vectors[3]

array([-2.9128954 , -4.3240027 , -0.44421792,  0.17803319, -1.6887195 ,
       -1.8417858 , -2.382663  , -0.54376405, -0.14213592,  2.4068887 ,
       -0.8008058 , -2.0471299 ,  1.2257737 , -0.813561  ,  0.26949748,
        2.1343286 , -1.7320954 , -2.3215182 , -1.7046514 ,  1.5707619 ,
       -1.5329007 , -1.7563577 , -0.2608846 ,  0.6554528 , -0.6724577 ,
        1.6333065 , -1.309933  ,  0.7698269 ,  3.6186852 ,  1.5208755 ,
       -1.013507  , -0.18529823, -0.35002238,  2.0540264 , -1.26692   ,
        1.6140392 , -0.40102732, -0.7102088 , -1.7623308 , -2.326815  ,
       -3.711199  , -1.7033322 , -0.8530236 ,  3.6067219 ,  0.46784082,
       -0.15662834, -1.1279367 , -0.13018279, -1.5130849 , -1.2077358 ,
        0.1416394 ,  2.8089206 , -2.4010386 , -1.738008  , -0.50861233,
       -2.4322774 , -0.30739444,  0.15793154,  2.0183172 ,  0.8630274 ,
        0.94616544,  1.2690661 , -0.70399517, -2.5525053 , -0.80706865,
        0.2644899 , -0.5465852 , -2.489552  ,  0.8314482 , -0.19

#### Inspecting Word Context with the GENSIM model, one word at a time

The GENSIM model has all sorts of tools built in for navigating and inspecting vectors.  We will make use of the

    most_similar()

command, which calls up all the words used in the same context as a given word.

In [74]:
congress_model.wv.most_similar("women", topn = 20)

[('femal', 0.7471427321434021),
 ('negro', 0.7277549505233765),
 ('young_men', 0.7235952615737915),
 ('men_women', 0.708111584186554),
 ('teenag', 0.6955069303512573),
 ('youth', 0.6834402680397034),
 ('wive', 0.6701415777206421),
 ('mexicanamerican', 0.6671103239059448),
 ('male', 0.6652931571006775),
 ('religi', 0.6620873212814331),
 ('sex', 0.6553176045417786),
 ('depriv', 0.6506348848342896),
 ('young_women', 0.650577962398529),
 ('group', 0.6495906710624695),
 ('hispan', 0.6455756425857544),
 ('black', 0.6396685838699341),
 ('older_worker', 0.638055682182312),
 ('youngster', 0.6380534768104553),
 ('adult', 0.6349009871482849),
 ('parent', 0.6277905702590942)]

In [75]:
congress_model.wv.most_similar("soldier", topn = 20)

[('american_soldier', 0.8551445007324219),
 ('wound', 0.8450905680656433),
 ('shot', 0.8175531625747681),
 ('maim', 0.8090488314628601),
 ('dy', 0.808813214302063),
 ('dead', 0.8079124093055725),
 ('men', 0.804377555847168),
 ('brutal', 0.8016231060028076),
 ('kill', 0.7979767322540283),
 ('beirut', 0.7927286028862),
 ('vietcong', 0.7925238609313965),
 ('battlefield', 0.7914227247238159),
 ('captur', 0.7881661653518677),
 ('hitler', 0.7857232689857483),
 ('crew', 0.7825039029121399),
 ('fighter', 0.7777116298675537),
 ('enemi', 0.7770015597343445),
 ('their', 0.7758663296699524),
 ('gallant', 0.7749981880187988),
 ('brave_men', 0.7722310423851013)]

In [76]:
congress_model.wv.most_similar("man", topn = 20)

[('knew', 0.8173918724060059),
 ('humor', 0.7942655086517334),
 ('honesti', 0.787624716758728),
 ('love', 0.768494188785553),
 ('god', 0.7611181735992432),
 ('public_servant', 0.7589733600616455),
 ('humbl', 0.7589018940925598),
 ('legaci', 0.7477531433105469),
 ('admir', 0.7477036118507385),
 ('courag', 0.7426941990852356),
 ('reput', 0.731036365032196),
 ('displai', 0.7307682037353516),
 ('warm', 0.7305134534835815),
 ('underdog', 0.7238001823425293),
 ('inspir', 0.719451367855072),
 ('charact', 0.7173877358436584),
 ('vision', 0.7162234783172607),
 ('gentl', 0.7155197262763977),
 ('fellow_man', 0.7104278802871704),
 ('woman', 0.7098684906959534)]

#### Interpreting vector similarity

Let's look at the word context vectors that are most similar to 'men'.

In [None]:
congress_model.wv.most_similar("men", topn = 20)

We find that men are spoken about almost in entirely the same context as women. But if women are spoken about in the same context as children, men are spoken about slightly more often in the same context as their homes. (what you see may vary with a different sample)

**Remember**: everything the model knows it knows from our corpus. What we're learning are assumptions *immanent* to the corpus.  These aren't FACTS about women or men -- these are data about how women and men were spoken about in Congress, 1985-2005.

Both `word2vec` and our model have limitations.

Additionally, our training set is selective and small (just a subset of some debates about the environment). Therefore, our analogies can return some wild cards. 


In [None]:
america_vector = congress_model.wv['america']
congress_model.wv.similar_by_vector(america_vector)

So this is pretty straight forward -- America is talked about in terms of Americans, the world, and prosperity.  Nothing to see here.

Our other method gives the same results:

In [None]:
congress_model.wv.most_similar("america", topn = 10)

But when we look for other words that are spoken about with the same language as America -- the answers are quite telling.

Wow. America is spoken about a few other places in the world.  Some versions of the output suggest that we speak of America with the same language in which we invoke democracy, drugs, and the interests of different peoples, especially workers. (what you see may vary with a different sample)

Try your own hand at interpreting these outputs. 

In [None]:
congress_model.wv.most_similar("iraq", topn = 10)

How do you interpret these similarities?

In [None]:
congress_model.wv.most_similar("britain", topn = 10)

## Subtracting Vectors

You'll recall that we've used vector subtraction before.  Subtracting the context for "woman" from the context for "man" produces a vector of high scores for the words that only appear around "man" but not woman.

In [None]:
diff = congress_model.wv['man'] - congress_model.wv['woman']
congress_model.wv.similar_by_vector(diff)

In [None]:
diff = congress_model.wv['woman'] - congress_model.wv['man']
congress_model.wv.similar_by_vector(diff)

## Adding Vectors to Find Substitutes

In [None]:
congress_model.wv.most_similar("women", topn = 1000)[:5]

Store just the words as a vector.

In [None]:
women_context = [word[0] for word in congress_model.wv.most_similar("women", topn = 100)]
women_context[:10]

Let's take the context vectors of each word and add them together.

In [None]:
sum = congress_model.wv[women_context[0]] 

for word in women_context[1:len(women_context)]:
    next_vector = congress_model.wv[word] 
    sum = sum + next_vector
    
    
congress_model.wv.similar_by_vector(sum)

Notice that the output of this process is a different list than we started with.

What have we done? We've taken the word context vectors for the 100 words that most commonly co-occur with the word 'women.'  We've added those word contet vectors together.

In essence, we've taken the context for 'women' and asked, 'what other words might substitute for the word 'women,' given the same context?'

Our final list is essentially the *functional synonyms* for the word 'women.'  The words in this list could functionally be substituted for 'women' in most sentences in which the word 'women' is used, with the same meaning -- at least, from the point of view of the speakers.  

We are looking here at a powerful tool for understanding stereotypes.  Digital scholar Richard Jean So has used a similar process to show that 'homeless' was functionally a synonym for 'black' in the novels written by white people of the twentieth century.

### Distance and Similarity with Vectors in GENSIM

We can also make *quantitative* measurements about how close or far about any two vectors are based on their usage.

'Similarity' in this case is a mathematical statistic, calculated as the cosine similarity between any two vectors -- it's 1 minus cosine distance.  You've used cosine distance before -- you're a whiz with cosine distance already. 

With similarity, the higher the number, the more alike two terms are in the context in which they are used. 

When we used cosine distance before, we were doing it one vector at a time.  

In [None]:
congress_model.wv.similarity('women', 'men')

In [None]:
congress_model.wv.similarity('soldier', 'men')

In [None]:
congress_model.wv.similarity('women', 'individu')

#### Visualize the similarities

Some researchers have used these similarity 'scores' to study how ideas were related in the past.  Scholars have sometimes produced a "dendrogram" of words related to other words, which we learned was created on the basis of cosine distance scores between word vectors.

This dendrogram was used to compare the meaning of "freedom" in the seventeenth century (when the word was nearest in meaning to "friendship") to the meaning of "freedom" in the eighteenth century (when the word became associated with nations and patriotism).

Let's see if we can make a dendrogram of words for our model.

The 

    linkage()
    
command performs hierarchical clustering -- in other words, it takes the Euclidean similarity score between any two vectors, and then ranks them.

In [None]:
keywords = ['dream', 'war', 'fight', 'racism', 'wealth', 'delight',# 'today', 
            'tomorrow', 'past', 'present', 'futur',
            'america', 'ireland', 'britain', 'iraq', 'china', 'democrat', #'dictator',
            'totalitarian', #'democracy', 
            #'charity', 'socialism','communism', 
            'russia', 'congress', 'riot','protest']

NOTE: if you get an error because any of the words above aren't in your sample corpus, edit the list and try again.

In [None]:
keyword_vectors = congress_model.wv[keywords]
keyword_vectors

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
links = linkage(keyword_vectors, method='complete', metric='seuclidean')
links

This ranking gives us a read of which vectors are closest to which vectors.  We can visualize it using matplotlib and the "dendrogram" command from SKLEARN:

In [None]:
from matplotlib import pyplot as plt

l = links

# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')

dendrogram(
    l,
    leaf_rotation=0,  # rotates the x axis labels
    leaf_font_size=16,  # font size for the x axis labels
    orientation='left',
    leaf_label_func=lambda v: str(keywords[v])
)
plt.show()


With a little tweaking, you can create a list of only those vectors for the words most of interest to you, using GENSIM to visualize their similarity to each other in the corpus.

You could even -- like Connell's blog entry indicates -- create a separate dendrogram for 1985 and another for 2005, to see how these terms have changed.

### Visualizing Abstract Relatedness

Similarity scores can also be used to visualize words as points in space where each word represents a single point.

These points represent words' relationships with one-another.

The code that follows is borrowed from digital humanist Dan Sinykin.

In [None]:
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

In [None]:
#%matplotlib inline

def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.wv.key_to_index.keys()), sample)
        else:
            words = [ word for word in model.wv.key_to_index ]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(congress_model.wv, keywords)

Truth be told, I don't love this visualization; it's visualizing abstract relationships that show the conceptual distance between different entities in the model. I present it to you as a cute toy, not as an approved visualization that i'd like to see in your work. 

Please use PCA analysis with care; it's almost impossible to get back to what it actually *means* -- at least without pairing it with other visualizations and measures.

## Studying Change Over Time with GENSIM

If we want to go further and use GENSIM as a tool for studying change over time, we'll want to organize the data by the features of the data that we care about.

In this case, we want to be able to investigate keywords, but also 5-year-periods.  We have a column called '5yrperiod,' and we want to make sure that this column is part of our dataset.


We'll want to organize the data by the features of the data that we care about -- the period.

Here are the names of the unique periods in this dataset:

In [28]:
periodnames = sample_m['5yrperiod'].unique().tolist()
periodnames

[1970.0, 1965.0, 1975.0, 1980.0]

Now that we have a list of periods, we can tell Gensim to make a 'model' for each period. And we can use these models to compare how the usage of each word changes over time.

***The following line of code might take a while, since we're creating 6 different gensim models. Fortunately, we're saving all of them, so if you want to go back and run this for a different word later, you can just load the old data rather than running the whole thing again.  If you want to rerun the code, follow the directions in the code below to hashtag out the gensim command and instead use the period_model = genseim-models-Word2Vec.load() command to load the old data -- it will be much less time consuming.***

In [29]:
dataname = 'sample-m'

In [30]:
cd '/scratch/group/history/hist_3368-jguldi/congress-embeddings'

/scratch/group/history/hist_3368-jguldi/congress-embeddings


In [31]:
for period1 in periodnames:
    print('working on ', period1)

    # grab the data from period1
    period_data = sample_m[sample_m['5yrperiod'] == period1] # select one period at a time
    
    # structure the data for Gensim
    period_sentences = structure_data(period_data['speech'], lemma = False, stopwords = True, stemmed = True)
    
    # make the Gensim model
    period_model = gensim.models.Word2Vec( # make a gensim model for that data
        sentences = period_sentences,
        min_count = 2)
    
    # save it
    period_model.save(dataname + '-model-' + str(period1)) # save the model with the name of the period


working on  1970.0
working on  1965.0
working on  1975.0
working on  1980.0


As a result of this code, we have saved in the congress-embeddings folder a "period model" labeled with the name of your data and each period.  We can call up each period one at a time to get information about how any individual words were talked about.  

### Search the Period Models for a Keyword

The period models aren't very interesting in themselves, but they allow us to efficiently search for how the context of a keyword changes over time.

Let's search each 5-yr-period for a keyword and save the results as the variable *keyword_context*.

In [32]:
keyword1 = 'women'

In [33]:
filename = 'sample-congress-model-'

In [34]:
cd '/scratch/group/history/hist_3368-jguldi/congress-embeddings'

/scratch/group/history/hist_3368-jguldi/congress-embeddings


In [35]:
#########  after the first run, use this line to call the old data without generating it again
keyword_context = []
dates_found = []

# cycle through each period
for period1 in periodnames:
    print('working on ', period1)
    
    # load the model from period1
    period_model = gensim.models.Word2Vec.load(dataname + '-model-' + str(period1)) # to load a saved model

    ## is the keyword found?
    if keyword1 in period_model.wv.key_to_index:
        print('found ', keyword1)
        
        # get the context vector for keyword1
        keyword_context_period = period_model.wv.most_similar(keyword1, topn = 5000) 
        
        # save it for later
        keyword_context.append(keyword_context_period) # save the context of how women were talked about for later
        dates_found.append(period1)

working on  1970.0
found  women
working on  1965.0
found  women
working on  1975.0
found  women
working on  1980.0
found  women


The variable *keyword_context* is a list of vectors, each of which corresponds to a period.

   * keyword_context[0] is a vector of the words that most frequently occurred with keyword1 in 1965
   * keyword_context[1] is a vector of the words that most frequently occurred with keyword1 in 1970
   * keyword_context[2] is a vector of the words that most frequently occurred with keyword1 in 1975
   * ...and so on
   
We can use this list of vectors to study how the context of 'woman' was changing from period to period.

In [None]:
keyword_context[0][0:15]

I can also grab just the names from the keyword vectors this way:

In [None]:
[item[0] for item in keyword_context[1]][:5]

I can grab just the numbers for any given year (in this case, the second period -- 1990 -- [1]) this way:

In [None]:
[item[1] for item in keyword_context[1]][:5]

#### Visualize it

## Visualize it

In [None]:
# helper function to abstract only unique values while keeping the list in the same order -- the order of first appearance
def unique2(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

Make a flattened list of all the words.

In [None]:
all_words = []
for i in range(len(dates_found)):
    words = [item[0] for item in keyword_context[i]][:10]
    all_words.append(words)

all_words2 = []
for list in all_words:
    for word in list:
        all_words2.append(word)

numwords = 10


Set up the colors.

In [None]:
from numpy import linspace
from matplotlib import cm
colors = [ cm.jet(x) for x in linspace(.5, 2, 50) ]

In [None]:
%matplotlib inline
#from matplotlib.colors import ListedColormap, LinearSegmentedColormap

from adjustText import adjust_text
from numpy import linspace
from matplotlib import cm

colors = [ cm.viridis(x) for x in linspace(0, 1, len(unique2(all_words2))+10) ]

# change the figure's size here
plt.figure(figsize=(10,10), dpi = 200)

texts = []

# plt.annotate only plots one label per iteration, so we have to use a for loop 
for i in range(len(dates_found)):    # cycle through the period names
    
    #yyy = int(keyword_per_year[keyword_per_year['5yrperiod'] == int(xx)]['count'])   # how many times was the keyword used that year?
                     
    for j in range(10):     # cycle through the first ten words (you can change this variable)
        
        xx = dates_found[i]        # on the x axis, plot the period name
        yy = [item[1] for item in keyword_context[i]][j]         # on the y axis, plot the distance -- how closely the word is related to the keyword
        txt = [item[0] for item in keyword_context[i]][j]        # grab the name of each collocated word
        colorindex = unique2(all_words2).index(txt)   # this command keeps all dots for the same word the same color
        
        plt.scatter(                                             # plot dots
            xx, #x axis
            yy, # y axis
            linewidth=1, 
            color = colors[colorindex],
            edgecolors = 'darkgray',
            s = 100, # dot size
            alpha=0.8)  # dot transparency

        # make a label for each word
        texts.append(plt.text(xx, yy, txt))

# Code to help with overlapping labels -- may take a minute to run
adjust_text(texts, force_points=0.2, force_text=.7, 
                    expand_points=(1, 1), expand_text=(1, 1),
                    arrowprops=dict(arrowstyle="-", color='black', lw=0.5))

plt.xticks(rotation=90)

# Add titles
plt.title("What words were used in the same contex as ''" + keyword1 + "' in Congress?", fontsize=20, fontweight=0, color='Red')
plt.xlabel("period")
plt.ylabel("normalized probability score")


filename2 = 'words-used-in-context-of-' + keyword1 + '-' + filename
plt.savefig(filename2 + '.png')

## Assignment

This week, you may choose a coding-intensive exercise OR an interpretive research question.

You do not have to do both.

#### Coding-intensive exercise
   * Create a list of keywords that you think would be particularly relevant for Congress during this time -- something that might demonstrate historical change in ideas.
        * Using the code above, create a GENSIM model for 1985, 1995, and 2005
        * Using the code above, create an array of vectors for your words for each time period
        * Using the code above, draw a dendrogram of keyword relatedness for the three time periods.
   * The code for the final visualization above shows the most common words used in the context of a keyword.  Tweak the code so that instead of showing *the keyword's context*, the visualization shows the *words that share the same context as the keyword in question.* This tweak should require about two lines of code.
     
        * Use the final visualization that you created above and its variation as the basis for an interpretive essay of one page.




#### Interpretive Research Question

In the code above, we learned how to streamline code to make it run over the complete data set.

Now that you can process *all* the data, you're ready for a more serious engagement with interpretation of historical questions, like:

* What are some of the ideas that changed in Congress over this time period? For instance, historians of this period frequently talk about the rise of a free-market ideology, a critique of the welfare state, arguments about the nature of democracy and America's role abroad. Can you support an argument about intellectual change on the basis of the changing context in which words were discussed?
* What groups of people were talked about in this period, and did the way they were spoken about change? Consider the role of women, minorities, the gay movement, and individuals who identify as religious in your answer.
* How did America's relationship with other nations change during this time period?  A historian might consider, for starters, the fall of the Berlin Wall in 1989 and the disintegration of the former USSR; the rise of terrorism, and the identification of Iraq, Iran, and Afghanistan as a frontier for US pacification; the border with Mexico and issues of immigration. Can you find systematic evidence of when and how one or more of these conversations changed in the data?

Choose one of the above questions. Iterate through a series of keyword queries and data results that would support a robust answer. Formulate an answer with at least one visualization and a page of writing, single-spaced, which analyze historical change in detail.

Turn in your work on Canvas. Do not turn in an ipynb. 