# Hist 3368 - Week 10: Word Context Vectors

#### By Jo Guldi

#### From Word Vectors to Word Context Vectors

In previous notebooks, you've used word count vectors to compare the words most distinctive of companies and time periods, and to measure the abstract "distance" between different entities.

In our reading, we've learned that many scholars applied word vectors to understanding intellectual history. Word vectors can help you to understand the changing profile of every word -- how its "context" was different in 1920 than in 1980.  For example, the word "gay" meant "happy" in 1920, but by 1990 it had come to mean "homosexual."  Word context is the study of the changing words that surrounded "gay" in both instances.

Wordcount vectors can get you to changing word context. To perform such an investigation, you need to structure the data so that there is one vector for each word in every period of time.  This is called a “word context vector”. 

In this week's notebook, we'll make word context vectors for some words in Congress. We'll go through the following steps:
* First, we'll organize the data so that we have a dataframe of one word per row, where another column gives the sentence in which the word appears in Congress and the date.
* Then, we'll "groupby" keyword and period, so that we have an index for each keyword and period (for instance, "woman-1985") and a "context" column with every word next to which "woman" appears in the year 1985.
* Next, we'll use the word vector tools you already know -- SKLEARN's Countvectorizer(), .fit_transform -- to make vectors from this data.
* We'll use a measurement tool you already know -- cosine distance -- to compare context vectors for "woman" from 1985 to 1995 and 2005
* We'll use a comparison tool you already know -- vector subtraction -- to create a "gender difference vectors" whose low scores show words more likely to show up in the context of "woman" and whose high scores show words more likely to show up in the context of "man"
* Later in the notebook, we'll return to a 'word embedding' software package that uses high-dimensional math and hidden layers to make whip-fast vectors.

#### Word Vectors vs. Word Embeddings 

Wordcount vectors are just what we’ve looked at: a simple count of words, with one integer per every word.  Wordcount embeddings are similar. But they typically add one more row of data or more per document.  That might mean that there’s a count of how many nouns, verbs, or adjectives there are per document. That might mean that there’s a count of bigrams, trigrams, fourgrams, or more – or multi-word phrases, plus or minus a word, called a “skipgram.”  These “hidden layers” in word embedding models mean an even richer model of which documents are like other documents. Because they factor in grammar and sentence structure as well as lexicon, they produce models that are very good at matching rhetorical style in text, and getting at the nuances of grammatical meaning. That is to say, they’re good at noticing when you mean “apple” the fruit (which you might eat or make into pie) or “apple” the computer (which you might turn on or off).  

Functionally, you use word embeddings just the way you use wordcount vectors. You can measure the distance between them, just like we did in our notebook this week.  You can subtract them, just as we did, to get a litmus test of what’s different between two periods of time, or which words are used to signify masculinity and femininity.  

*In the first half of this notebook,* we'll stick with "word vectors," not embeddings.  We'll word context vectors 'by hand' -- i.e., using onyl SKLEARN's CountVectorizer() and .fit_transform + cosine distance and subtraction.  Doing it this way is slower than loading some other packages that have been built specifically for working with large-scale wordcount vectors, where the code is packaged with high-dimensional math designed to make the comparisons run faster. We're doing it this way, however, so that you can really see for yourself how a word vector is built and what's inside it at every moment.   When we structure the data, build the vectors, subtract and measure the distance between vectors, we'll be able to inspect what's in the vector at every turn. You'd be able to do the math yourself if you looked more carefully.

*In the second half of the code,* however, we'll use the GENSIM package of word embeddings to work on a larger-scale sample of debates. We'll use GENSIM's pre-built tools to do analysis comparative to what you did with cosine distance and vector subtraction:

     wv.vocab - which allows you to inspect the words in a vector 
     wv.most_similar() - which allows you to call up vectors from your dataset that are most similar to a given word

#### Skill Building for Historical Analysis

By the end of this notebook, you'll know how to replicate most of the fancy work with vectors in the reading.  You'll be able to:
* use word context vectors to analyze the intellectual history of concept words like "freedom," "gay", or "woman," detecting how their context changed from moment to moment
* visualize changes to word concepts as a dendrogram
* use GENSIM's "most_similar()" to generate a list of the words most similar to any concept (for instance "freedom") at different moments over time
* visualize changes to the context of an individual word over time

### Loading data

In [155]:
import pandas as pd
import csv
import glob
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import scipy.spatial.distance
import matplotlib
import matplotlib.pyplot as plt
import itertools
from nltk.tokenize import sent_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer

The following lines load some data from Congress. Don't worry too much about the commands within this block; we're more interested in the transformations we'll apply to the data after it's loaded.  If you're curious, the lines below download two separate dataframes --  "speeches" and "descriptions" -- and then merge them  so that we now have one database of speeches with the date on which they were spoken.

In [156]:
all_speech_files = glob.glob('/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_*.txt')
CONGRESS_MIN_THRESHOLD = 100
CONGRESS_MAX_THRESHOLD = 115

speech_files = []

for fn in all_speech_files:
    number = int(fn.rsplit('_', 1)[-1].split('.')[0])
    if CONGRESS_MIN_THRESHOLD <= number <= CONGRESS_MAX_THRESHOLD:
        speech_files.append(fn)

speech_files.sort()
        
def parse_one(fn):
    print(f'Reading {fn}...')
    return pd.read_csv(fn, sep='|', encoding="ISO-8859-1", error_bad_lines=False, warn_bad_lines=False, quoting=csv.QUOTE_NONE)

speeches_df = pd.concat((parse_one(fn) for fn in speech_files))
speeches_df.dropna(how='any', inplace=True)

all_description_files = glob.glob('/scratch/group/oit_research_data/stanford_congress/hein-bound/descr_*.txt')
                                  
description_files = []

for fn in all_description_files:
    number = int(fn.rsplit('_', 1)[-1].split('.')[0])
    if CONGRESS_MIN_THRESHOLD <= number <= CONGRESS_MAX_THRESHOLD:
        description_files.append(fn)
        description_files.sort()
        
description_df = pd.concat((parse_one(fn) for fn in description_files))

all_data = pd.merge(speeches_df, description_df, on = 'speech_id')
all_data.fillna(0, inplace=True)
all_data = all_data.drop(['chamber', 'speech_id', 'number_within_file', 'speaker', 'first_name'], 1)
all_data = all_data.drop(['last_name', 'state', 'gender', 'line_start', 'line_end', 'file', 'char_count', 'word_count'], 1)
all_data['date']=pd.to_datetime(all_data['date'],format='%Y%m%d')
all_data['year'] = pd.to_datetime(all_data['date']).dt.year

Reading /scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_100.txt...
Reading /scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_101.txt...
Reading /scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_102.txt...
Reading /scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_103.txt...
Reading /scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_104.txt...
Reading /scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_105.txt...
Reading /scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_106.txt...
Reading /scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_107.txt...
Reading /scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_108.txt...
Reading /scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_109.txt...
Reading /scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_110.txt...
Reading /s

In [157]:
all_data['5yrperiod'] = np.floor(all_data['year'] / 5) * 5 # round each year to the nearest 5 -- by dividing by 5 and "flooring" to the lowest integer
all_data = all_data.drop(['date', 'year'], 1)

In [158]:
all_data.head()

Unnamed: 0,speech,5yrperiod
0,Representativeselect to the 100th Congress. th...,1985.0
1,The Chair would also like to state that Repres...,1985.0
2,The quor closes that 426 Represe have answered...,1985.0
3,The Clerk credentials regular in for received ...,1985.0
4,The next order of business is the election of ...,1985.0


#### Downsample

In this exercise, the first pass, we're going to do some memory-intensive work on the computer by creating word context vectors 'by hand' -- i.e., using onyl SKLEARN's CountVectorizer() and .fit_transform + cosine distance and subtraction.  

Doing it this way is slower than loading some other packages that have been built specifically for working with large-scale wordcount vectors, where the code is packaged with high-dimensional math designed to make the comparisons run faster.

We're doing it this way, however, so that you can really see for yourself how a word vector is built and what's inside it at every moment.   

When we structure the data, build the vectors, subtract and measure the distance between vectors, we'll be able to inspect what's in the vector at every turn. You'd be able to do the math yourself if you looked more carefully.

Later in the notebook, we'll return to a 'word embedding' software package that uses high-dimensional math and hidden layers to make whip-fast vectors.

However, as we're doing old-fashioned vectors by hand, it'll go best if we "downsample" the data, taking a random sample of 5000 sentences spoken in Congress between 1985-2010.

Let's create some downsamples so we don't break the computer.

In [None]:
sample_l = all_data.sample(500000)
sample_m = sample_l.sample(50000)
sample = sample_m.sample(5000)

### Structure the data

We want to make a word context vector dataframe.

First, we'll break up the data so that we have one row per every sentence.

Then, we'll break up the data so that we have one row per every word -- and a column with the 'sentence' where the word was originally found.

This will tell us about the context of the word.

We'll retain information about the '5yrperiod' when the word was originally from.

#### Break data into sentences

Here's a handy script for breaking up strings into sentences.

In [160]:
tokenizer = TreebankWordTokenizer()

def make_sentences(text):
    preprocessed_text = []
    for line in text:
        lower_case = line.lower()
        sentences = sent_tokenize(lower_case)
        tokenized_sentences = [tokenizer.tokenize(sent) for sent in sentences]
        preprocessed_text += tokenized_sentences
    return preprocessed_text

Apply the make_sentences function to the 'speech' column (this might take a while):

In [None]:
sentences = make_sentences(sample['speech']).copy()
sentences[:2]

In [257]:
word_context_sentences = pd.concat([pd.DataFrame({'sentence': speech, '5yrperiod': row['5yrperiod']}, index=[0]) 
           for _, row in sample.iterrows() 
           for speech in row['speech'].split('.') if speech != ''])
word_context_sentences

Unnamed: 0,sentence,5yrperiod
0,The Chair again informs the Senator from Calif...,2000.0
0,I object,2000.0
0,I yield to the gentleman from California,2005.0
0,Madam Speaker,2005.0
0,I rise today to honor the outgoing 2006 Board...,2005.0
...,...,...
0,that vote will occur tomorrow morning,2005.0
0,We may be able to work on an agreement to wor...,2005.0
0,We will certainly work with all colleagues to...,2005.0
0,I have asked the Republican leader to speak f...,2005.0


#### Break sentences into words

Here's some code for breaking up sentences into words.

Create an 'index' column with 
    
        np.arange(len(index_context))

In [258]:
index_context = word_context_sentences.copy()
index_context['index'] = np.arange(len(index_context)) # create an 'index' column
index_context

Unnamed: 0,sentence,5yrperiod,index
0,The Chair again informs the Senator from Calif...,2000.0,0
0,I object,2000.0,1
0,I yield to the gentleman from California,2005.0,2
0,Madam Speaker,2005.0,3
0,I rise today to honor the outgoing 2006 Board...,2005.0,4
...,...,...,...
0,that vote will occur tomorrow morning,2005.0,104098
0,We may be able to work on an agreement to wor...,2005.0,104099
0,We will certainly work with all colleagues to...,2005.0,104100
0,I have asked the Republican leader to speak f...,2005.0,104101


Use str.split() and 

    .explode()

to create a dataframe with one word per row

In [259]:
word_per_row = index_context.set_index('index')
word_per_row =pd.concat([word_per_row['sentence'].str.split(' ').explode()],axis=1).reset_index() #explode the data 
word_per_row = word_per_row.rename({'sentence' : 'keyword'}, axis = 1) # rename the column "sentence" to "keyword"
word_per_row

Unnamed: 0,index,keyword
0,0,The
1,0,Chair
2,0,again
3,0,informs
4,0,the
...,...,...
1007998,104102,to
1007999,104102,do
1008000,104102,off
1008001,104102,the


Merge the two dataframes to create a well-annotated dataframe of every word and its context.

In [294]:
word_context_words = pd.merge(word_per_row, index_context, on="index") # merge the two df's
word_context_words = word_context_words.drop('index', 1) # get rid of the index column because we don't need it any more
word_context_words

Unnamed: 0,keyword,sentence,5yrperiod
0,The,The Chair again informs the Senator from Calif...,2000.0
1,Chair,The Chair again informs the Senator from Calif...,2000.0
2,again,The Chair again informs the Senator from Calif...,2000.0
3,informs,The Chair again informs the Senator from Calif...,2000.0
4,the,The Chair again informs the Senator from Calif...,2000.0
...,...,...,...
1007998,to,I have something I have to do off the floor,2005.0
1007999,do,I have something I have to do off the floor,2005.0
1008000,off,I have something I have to do off the floor,2005.0
1008001,the,I have something I have to do off the floor,2005.0


In [295]:
word_context_words['keyword'] = word_context_words['keyword'].str.strip() # strip the whitespace
word_context_words

Unnamed: 0,keyword,sentence,5yrperiod
0,The,The Chair again informs the Senator from Calif...,2000.0
1,Chair,The Chair again informs the Senator from Calif...,2000.0
2,again,The Chair again informs the Senator from Calif...,2000.0
3,informs,The Chair again informs the Senator from Calif...,2000.0
4,the,The Chair again informs the Senator from Calif...,2000.0
...,...,...,...
1007998,to,I have something I have to do off the floor,2005.0
1007999,do,I have something I have to do off the floor,2005.0
1008000,off,I have something I have to do off the floor,2005.0
1008001,the,I have something I have to do off the floor,2005.0


You can use 

    .size()
    
with .groupby() to get the word counts per time.  We can also call

    .to_frame()

to tell pandas what to name the new column

In [296]:
words_per_period = word_context_words.groupby(['keyword', '5yrperiod']).size().to_frame('count')
words_per_period

Unnamed: 0_level_0,Unnamed: 1_level_0,count
keyword,5yrperiod,Unnamed: 2_level_1
,1985.0,10972
,1990.0,21075
,1995.0,21227
,2000.0,19087
,2005.0,20392
...,...,...
§,2000.0,4
§114,1995.0,2
§16(f)(3),2000.0,1
§441(b),1995.0,1


#### Groupby Word and Period

Let's create a new data structure where we group by keyword AND period.

If we organize our data this way, we will preserve information about the context for how each word was spoken about, across all companies, in 1994, 2011, etc.  



Technically, we could groupby('keyword', '5yrperiod'). However, later on, we're going to want to call vectors of data by an index that references both word and period. So it's better if we just create a new column for the data called 'wpord-period,' and groupby() that.

In [297]:
word_context_word_period = word_context_words.copy()
word_context_word_period['word-period'] = word_context_word_period['keyword'] + "-" + word_context_word_period['5yrperiod'].astype(str)
word_context_word_period = word_context_word_period.drop(['5yrperiod', 'keyword'], 1)
word_context_word_period

Unnamed: 0,sentence,word-period
0,The Chair again informs the Senator from Calif...,The-2000.0
1,The Chair again informs the Senator from Calif...,Chair-2000.0
2,The Chair again informs the Senator from Calif...,again-2000.0
3,The Chair again informs the Senator from Calif...,informs-2000.0
4,The Chair again informs the Senator from Calif...,the-2000.0
...,...,...
1007998,I have something I have to do off the floor,to-2005.0
1007999,I have something I have to do off the floor,do-2005.0
1008000,I have something I have to do off the floor,off-2005.0
1008001,I have something I have to do off the floor,the-2005.0


In [298]:
word_context_grouped = word_context_word_period.groupby(['word-period']).sum()

In [299]:
word_context_grouped

Unnamed: 0_level_0,sentence
word-period,Unnamed: 1_level_1
!--1995.0,It would be of greal - benefit to the future ...
!-1995.0,!
!imit-1990.0,this Member successfully pushed in the author...
!ine-1995.0,will continue to ensure a growing and efficie...
"""$12-1990.0","strike out ""$12"
...,...
§-2000.0,§ 16(e)) containing the requirement that cour...
§114-1995.0,Day casts serious doubt on §11410(c)(1)s requ...
§16(f)(3)-2000.0,§16(f)(3) intervention by interested parties ...
§441(b)-1995.0,and instead looked to the particular characte...


In this new output, the 'context' column has all the words from all the sentences that contain the 'keyword' of that row in a given period. 

In [300]:
word_context_grouped.filter(like = 'woman', axis=0)

Unnamed: 0_level_0,sentence
word-period,Unnamed: 1_level_1
Chairwoman-2005.0,I wish to take a second to thank Chairwoman M...
Chairwoman-2010.0,I want to recognize the work of Chairwoman CL...
Congresswoman-1985.0,I also want to commend Congressman FRANK who ...
Congresswoman-1990.0,Congresswoman NANCY JOHNSON
Congresswoman-1995.0,Congresswoman Congresswoman Congresswoman JAN...
Congresswoman-2000.0,Congresswoman EDDIE BERNICE JOHNSON Congressw...
Congresswoman-2005.0,FY2010 Energy and Water Development and Relat...
Congresswoman-2010.0,Congresswoman CLARKE including Congressman LA...
Councilwoman-1990.0,Councilwoman Clarks strong belief In educatio...
assemblywoman-1985.0,And as she rose to the precedent setting rank...


We're now ready to see how 'woman' changed its meaning in Congress from 1985 to 2005.

## Make Word Context Vectors

In [301]:
vectorizer = CountVectorizer(max_features=10000, lowercase=True, ngram_range=(1, 1), analyzer = "word")

Note that we feed the vectorizer the column 'sentence' because we want to model the CONTEXT in which each keyword appears.

In [None]:
vectorized = vectorizer.fit_transform(word_context_grouped['sentence'])

Inspect the vectors as a dataframe where every column is a word and every row a period:

In [None]:
context_words = np.array(vectorizer.get_feature_names())
context_words

In [None]:
word_period = list(word_context_grouped.axes[0].to_numpy())
word_period[:10]

In [None]:
vectors_dataframe = pd.DataFrame(vectorized.todense(), # the matrix we saw above is turned into a dataframe
                                 columns=context_words,
                                 index = word_period
                                 )
vectors_dataframe

In [None]:
matr = vectorized.todense()
matr

### Taking measurements of vectors

In our last exercise, we measured how different individual words were from each other.

Let's do it again.

In [None]:
woman_1985_vector = vectors_dataframe.filter(regex = ('^woman-1985'), axis=0) # the caret (^) means 'begins with'
woman_1985_vector 

In [None]:
woman_1990_vector = vectors_dataframe.filter(regex = '^woman-1990', axis=0)
woman_1995_vector = vectors_dataframe.filter(regex = '^woman-1995', axis=0)
woman_2000_vector = vectors_dataframe.filter(regex = '^woman-2000', axis=0)
woman_2005_vector = vectors_dataframe.filter(regex = '^woman-2005', axis=0)
woman_1990_vector

The vectors_dataframe is nice to use because its rows and columns are nicely labeled.  It's easy to call exactly the keyword-period combination you want.  

You can use the rows directly pulled from vectors_dataframe as the basis for calculating cosine distances.

In [None]:
scipy.spatial.distance.cosine(woman_1985_vector, woman_1990_vector)

In [None]:
scipy.spatial.distance.cosine(woman_1985_vector, woman_1995_vector)

In [None]:
scipy.spatial.distance.cosine(woman_1985_vector, woman_2000_vector)

In [None]:
scipy.spatial.distance.cosine(woman_1985_vector, woman_2005_vector)

#### Subtracting vectors

Above, we called the rows of vectors_dataframe directly to calculate cosine distances.  

We can use these same vectors to execute a subtraction -- with a bit of reformatting.

First, we "transmute" them from a horizontal row of values to a vertical row of values with 

    .T


Next, we call the columns with the values by name, e.g.: 

    ['woman-1985.0']


In [None]:
diff = woman_1995_vector.T['woman-1995.0'] - woman_1985_vector.T['woman-1985.0']
diff

We "sort" the values from small to big using:

    .sort_values()



In [None]:
diff.sort_values()

Hint: use 
    
    .dropna() 
    
to get rid of NaN's (not a number)

In [None]:
diff.dropna().sort_values()

These are the words that changed the most in the context of 'woman' between 1985 and 1995.

## Compare words used for 'man' and 'woman'

Let's make a vector that contains all the references to women.

In [None]:
pattern = ['woman','women','\bshe','\bher','\bhers','girl']
woman_vector = vectors_dataframe.loc[[x for x in vectors_dataframe.index for word in pattern if word in x]]

woman_vector

That's the data we want, all right. But it'd be more useful as a matrix where all the columns are added together. 

That's easy to do with 

    .sum()

In [None]:
woman_vector = woman_vector.sum()
woman_vector

Perfect! Now let's look for men

In [None]:
import re
pattern2 = ['\bhe\b','\bhim','\bhis','\bman\b','\bmen\b','boy\b','boys']
man_vector = vectors_dataframe.loc[[x for x in vectors_dataframe.index for word in pattern2 if word in x]]
man_vector = man_vector.sum()
man_vector

In [None]:
gender_diff_vector = man_vector - woman_vector

In [None]:
gender_diff_vector.sort_values()

The output of gender_diff_vector.sort_values() is predictive of the words most likely to refer to women (the negatives) and the words most likely to indicate men (the higher positives.)

## Word Embeddings

At this point in the code, we're shifting from word vectors made with SKLEARN to word "embeddings" made with the GENSIM package.

GENSIM uses higher-level math to condense the matrices, meaning that we'll be able to deal with more information than the downsized sample above. Word embeddings like GENSIM also typically have a "hidden layer" of modeling which includes information about word order and part-of-speech, designed to make the word vectors more accurate models of the way that words are used in sentences. 

In [None]:
import gensim 

#### Resample the data and create data structure (again)

Let's use a larger sample than we did last time. We'll need to break it into sentences and words again, and group by the features of the data that we care about -- keyword and period -- again.

Because you've seen the instructions above, we'll skip them below and just give the code.

NOTE: the lines below may take a while. Splitting sentences and words can be intensive on a dataset of this scale. If it's not working for you, try sample_m or sample where you see sample_l in the first line.

In [None]:
sample_l

In [None]:
sentences_m = make_sentences(sample_l['speech']).copy() # <---- switch out sample_l to sample_s or sample_m here
word_context_sentences_m = pd.concat([pd.DataFrame({'sentence': speech, '5yrperiod': row['5yrperiod']}, index=[0]) 
           for _, row in sample.iterrows() 
           for speech in row['speech'].split('.') if speech != ''])

In [None]:
sentences_m[:2]

We're now ready to model a larger set of data in Congress from 1985 to 2005 with the help of GENSIM.

#### Break sentences into words (for later use)

Here's the code for breaking up sentences into words. We'll need words_per_period later in the code. You've seen the detailed code before, so here's the quick version.

In [None]:
index_context2 = word_context_sentences_m.copy()
index_context2['index'] = np.arange(len(index_context2)) # create an 'index' column
word_per_row2 = index_context2.set_index('index')
word_per_row2 =pd.concat([word_per_row2['sentence'].str.split(' ').explode()],axis=1).reset_index() #explode the data 
word_per_row2 = word_per_row2.rename({'sentence' : 'keyword'}, axis = 1) # rename the column "sentence" to "keyword"
word_context_words2 = pd.merge(word_per_row2, index_context2, on="index") # merge the two df's
word_context_words2 = word_context_words2.drop('index', 1) # get rid of the index column because we don't need it any more
word_context_words2['keyword'] = word_context_words2['keyword'].str.strip() # strip the whitespace
words_per_period2 = word_context_words2.groupby(['keyword', '5yrperiod']).size().to_frame('count')
words_per_period2

### Setting up GENSIM

The first step is to "train" the GENSIM model with the function `gensim.models.Word2Vec()`. This function has a couple dozen parameters, some of which are more important than others.

Here are a few major ones. Only two are MANDATORY: these are marked with an asterisk:

1. `sentences*`: This is where you provide your data. It must be in a format of iterable of iterables.
2. `sg`: Your choice of training algorithm. There are two standard ways of training W2V vectors -- 'skipgram' and 'CBOW'. If you enter 1 here the skip-gram is applied; otherwise, the default is CBOW.
3. `size*`: This is the length of your resulting word vectors. If you have a large corpus (>few billion tokens) you can go up to 100-300 dimensions. Generally word vectors with more dimensions give better results.
4. `window`: This is the window of context words you are training on. In other words, how many words come before and after your given word. A good number is 4 here but this can vary depending on what you are interested in. For instance, if you are more interested in embeddings that embody semantic meaning, smaller window sizes work better. 
5. `alpha`: The learning rate of your model. If you are interested in machine learning experimentation with your vectors you may experiment with this parameter.
6. `seed` (int): This is the random seed for your random initialization. All deep learning models initialize the weights with random floats before training. This is a useful field if you want to replicate your experiments because giving this a seed will initialize 'randomly' deterministically.
7. `min_count`: This is the minimum frequency threshold. If a given word appears with lower frequency than provided it will be ignored. This is here because words with very low frequency are hard to train.
8. `iter`: This is the number of iterations(entire run) over the corpus, also known as epochs. Usually anything between 1-10 is ok. The trade offs are that if you have higher iterations, it will take longer to train and the model may overfit on your dataset. However, longer training will allow your vectors to perform better on tasks relevant to your dataset.

Most of these settings will not concern us. As you'll see below, we are only going to use four arguments.

In [None]:
congress_model = gensim.models.Word2Vec(
    sentences = sentences_m,
    min_count = 2, # remove words stated only once
    size = 100) # size of neuralnet layers; default is 100; higher for larger corpora

### Save the model

Let's also save our model in case we want to use it again in a later session.

In [None]:
congress_model.save('congress_model')
# hansard_model = gensim.models.Word2Vec.load('hansard_model') # to load a saved model

And you can load a model in the same way (remember this from our topic model)

In [None]:
congress_model = gensim.models.Word2Vec.load('congress_model') 

## What's in the model?

The method `wv.index2word` allows us to see the words in our model (but careful! congress_model.wv.vocab will print out every word in the corpus -- a very long list!)

In [None]:
congress_model.wv.index2word[:25]

The model itself is -- like the SKLEARN CountVectors model -- a matrix of vectors. Every row corresponds to the counts for one word. We can call the entire matrix or call up one row at a time.

In [None]:
congress_model.wv.vectors

Here's the fourth row of the model, represented as a word and as a vector:

In [None]:
word = congress_model.wv.index2word[3]
word

In [None]:
congress_model.wv[word]

In [None]:
congress_model.wv.vectors[3]

#### Inspecting Word Context with the GENSIM model, one word at a time

The GENSIM model has all sorts of tools built in for navigating and inspecting 

We can look at the word context vector for any individual word by using:

    model.wv['word']

Here are the words with the highest counts in the context vector for 'man'. In other words, these are words that appear most commonly around 'man' in our sample:

In [None]:
man_vector = congress_model.wv['man']
congress_model.wv.similar_by_vector(man_vector)

In [None]:
woman_vector = congress_model.wv['woman']
congress_model.wv.similar_by_vector(woman_vector)

In [None]:
individual_vector = congress_model.wv['individual']
congress_model.wv.similar_by_vector(individual_vector)

In [None]:
soldier_vector = congress_model.wv['soldier']
congress_model.wv.similar_by_vector(soldier_vector)

### Distance and Similarity with Vectors in GENSIM

Similarity is cosine similarity -- it's 1 minus cosine distance.  You've used cosine distance before -- you're a whiz with cosine distance already. 

With similarity, the higher the number, the more alike two terms are in the context in which they are used. 

When we used cosine distance before, we were doing it one vector at a time.  

In [None]:
congress_model.wv.similarity('women', 'men')

#### What other words have similar context vectors?

Part of the beauty of the GENSIM package is that it has pre-run all the word vectors for you. So it can call up the most similar word context vectors to the word context vector of any word, using the command, 'most_similar()'

From the GENSIM documentation: "This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model."

In [None]:
congress_model.wv.most_similar("women", topn = 20)

Interesting. So, according to our model, women are like men and individuals and soldiers; they're also like students and parents.

#### Interpreting vector similarity

But before we get carried away, remember that these results come from a *different* mode of analysis than the CONTEXT VECTOR above.  The results here don't indicate that the words "individuals" or "soldiers" regularly occur in sentneces with the word "women."  

Instead, the model indicates that "individuals" and "soldiers" are often talked about with the same words that men and women are talked about.  They have employers, wages, etc.

Let's look at the word context vectors that are most similar to 'men'.

In [None]:
congress_model.wv.most_similar("men", topn = 20)

We find that men are spoken about almost in entirely the same context as women. But if women are spoken about in the same context as children, men are spoken about slightly more often in the same context as their homes. (what you see may vary with a different sample)

**Remember**: everything the model knows it knows from our corpus. What we're learning are assumptions *immanent* to the corpus.  These aren't FACTS about women or men -- these are data about how women and men were spoken about in Congress, 1985-2005.

Both `word2vec` and our model have limitations.

Additionally, our training set is selective and small (just a subset of some debates about the environment). Therefore, our analogies can return some wild cards. 


In [None]:
congress_model.wv.most_similar("america", topn = 10)

Wow. America is spoken about like freedom, like Iraq, and like the world. It's in a downturn, and when we speak of America, we speak of the same contexts in which we invoke democracy, drugs, and the interests of different peoples, especially workers. (what you see may vary with a different sample)

Try your own hand at interpreting these outputs. 

In [None]:
congress_model.wv.most_similar("iraq", topn = 10)

How do you interpret these similarities?

In [None]:
congress_model.wv.most_similar("britain", topn = 10)

#### Visualize the similarities

You'll recall that in Sarah Connell's blog entry, researchers produced a "dendrogram" of words related to other words, which we learned was created on the basis of cosine distance scores between word vectors.

This dendrogram was used to compare the meaning of "freedom" in the seventeenth century (when the word was nearest in meaning to "friendship") to the meaning of "freedom" in the eighteenth century (when the word became associated with nations and patriotism).

Let's see if we can make a dendrogram of words for our model.

The 

    linkage()
    
command performs hierarchical clustering -- in other words, it takes the Euclidean similarity score between any two vectors, and then ranks them.

In [None]:
keywords = ['dream', 'bombing', 'warfare', 'racism', 'prosperity', 'wealth', 'happiness', 'today', 'tomorrow', 'past', 'present', 'future', 'america', 'france', 'britain', 'iraq', 'china', 'democratic', 'dictator', 'totalitarian', 'democracy', 'welfare', 'socialism', 'communism', 'russia', 'congress', 'debate', 'hearing', 'protest']

NOTE: if you get an error because any of the words above aren't in your sample corpus, edit the list and try again.

In [None]:
keyword_vectors = congress_model.wv[keywords]
keyword_vectors

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
links = linkage(keyword_vectors, method='complete', metric='seuclidean')
links

This ranking gives us a read of which vectors are closest to which vectors.  We can visualize it using matplotlib and the "dendrogram" command from SKLEARN:

In [None]:
from matplotlib import pyplot as plt

l = links

# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')

dendrogram(
    l,
    leaf_rotation=0,  # rotates the x axis labels
    leaf_font_size=16,  # font size for the x axis labels
    orientation='left',
    leaf_label_func=lambda v: str(keywords[v])
)
plt.show()


With a little tweaking, you can create a list of only those vectors for the words most of interest to you, using GENSIM to visualize their similarity to each other in the corpus.

You could even -- like Connell's blog entry indicates -- create a separate dendrogram for 1985 and another for 2005, to see how these terms have changed.

## Subtracting Vectors

You'll recall that we've used vector subtraction before.  Subtracting the context for "woman" from the context for "man" produces a vector of high scores for the words that only appear around "man" but not woman.

In [None]:
diff = congress_model.wv['man'] - congress_model.wv['woman']
congress_model.wv.similar_by_vector(diff)

In [None]:
diff = congress_model.wv['woman'] - congress_model.wv['man']
congress_model.wv.similar_by_vector(diff)

### Visualizing Abstract Relatedness

The four words making up the analogy can be understood as points in space where each word represents a single point. These points represent words' relationships with one-another.

Let's borrow more of Sinykin's code to visualize the results.

In [None]:
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

In [None]:
#%matplotlib inline

def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.wv.vocab.keys()), sample)
        else:
            words = [ word for word in model.wv.vocab ]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(congress_model.wv, keywords)

Truth be told, I don't love this visualization; it's visualizing abstract relationships that show the conceptual distance between different entities in the model. I present it to you as a cute toy, not as an approved visualization that i'd like to see in your work. 

Please use PCA analysis with care; it's almost impossible to get back to what it actually *means* -- at least without pairing it with other visualizations and measures.

## Comparing Time with GENSIM

You might recall that there's a lot of data that we're not using, for instance, the 5yrperiod field:

In [None]:
sample_m

How can we make use of it?  How about a for loop?

In [None]:
periodnames = sample_m['5yrperiod'].unique().tolist()
periodnames

This might take a while, since we're creating 6 different gensim models. Fortunately, we're saving all of them, so if you want to go back and run this for a different word later, you can just load the old data rather than running the whole thing again.

In [None]:
women_context = [] # create an empty dummy variable

for period1 in periodnames:
    period_data = sample_m[sample_m['5yrperiod'] == period1] # select one period at a time
    print('mining ', period1)
    sentences = make_sentences(sample_m['speech']).copy() # break data into sentences for that period only 
    ####### tweak here after the first run to use the old data without generating it again
     period_model = gensim.models.Word2Vec( # make a gensim model for that data
        sentences = sentences,
        min_count = 2, 
        size = 100)  
    period_model.save('model-' + str(period1)) # save the model with the name of the period
    #period_model = gensim.models.Word2Vec.load('model-' + str(period1)) # to load a saved model
    ###########
    women_context_period = period_model.wv.most_similar("woman", topn = 1000) # extract the context of how women were talked about in that period
    women_context.append(women_context_period) # save the context of how women were talked about for later

The output should be a list of context vectors for each period, which we can use to show how the context of 'woman' was changing from period to period.

In [None]:
women_context[0][0:15]

In [None]:
women_context[5][0:15]

I can grab just the names this way:

In [None]:
[item[0] for item in women_context[1]][:5]

I can grab just the numbers for any given year (in this case, the second period -- 1990 -- [1]) this way:

In [None]:
[item[1] for item in women_context[1]][:5]

#### Let's annotate the data with how many times women were referred to over time


Recall that we made a nice dataframe of how many times each word appears over a period.

In [None]:
words_per_period2

In [None]:
keyword_per_year = words_per_period2[words_per_period2['keyword'=='woman']]
keyword_per_year[keyword_per_year['5yrperiod'=='1985']

#### Visualize it

Make a flattened list of all the words.

In [None]:
all_words = []
for i in range(5):
    words = [item[0] for item in women_context[i]][:10]
    all_words.append(words)

all_words2 = []
for list in all_words:
    for word in list:
        all_words2.append(word)

Set up the colors.

In [None]:
from numpy import linspace
colors = [ cm.jet(x) for x in linspace(.5, 2, 50) ]

Dots and annotations.

In [None]:
%matplotlib inline

from matplotlib import cm
from matplotlib.colors import ListedColormap, LinearSegmentedColormap


from adjustText import adjust_text

# change the figure's size here
plt.figure(figsize=(5,5), dpi = 200)

# plt.annotate only plots one label per iteration, so we have to use a for loop 
for i in range(len(periodnames)):    # cycle through the period names
    
    xx = periodnames[i]        # on the x axis, plot the period name
    yyy = keyword_per_year[i]  # how many times was the keyword used that year?

    # for each period, one big black dot
    
    plt.scatter(                                           # plot dots
            xx, #x axis
            yyy, # y axis
            linewidth=1, 
            color = 'black',
            s = 10, # dot size
            alpha=0.2)  # dot transparency

                     
                     
    for j in range(10):     # cycle through the first ten words (you can change this variable)
        
        yy = [item[1] for item in women_context[i]][j]         # on the y axis, plot the distance -- how closely the word is related to the keyword
        txt = [item[0] for item in women_context[i]][j]        # grab the name of each collocated word
        colorindex = all_words2.index(txt)                     # this command keeps all dots for the same word the same color
        
        plt.scatter(                                           # plot dots
            xx, #x axis
            yy, # y axis
            linewidth=1, 
            color = colors[colorindex],
            s = 3, # dot size
            alpha=0.8)  # dot transparency

                                                                # make a label for each word
        plt.annotate(
                txt,
                (xx, yy),   
                size = 5,
                color = 'black', 
                alpha=0.8 # i've made the fonts transparent as well.  you could play with color and size if you wanted to. 
            )

# Code to help with overlapping labels -- may take a minute to run
adjust_text(texts, force_points=0.2, force_text=0.2,
            expand_points=(1, 1), expand_text=(1, 1),
            arrowprops=dict(arrowstyle="-", color='black', lw=0.5))

plt.xticks(rotation=90)

# Add titles
plt.title("Word Context Change for 'WOMAN'a Over Time in Congress", fontsize=20, fontweight=0, color='Red')
plt.xlabel("period")
plt.ylabel("similarity of word")



## Assignment

Above, we make a dendrogram of a long list of interesting words and their contextual similarity. 

We also use a list of periods to create separate GENSIM models for each period.

You will put these together to create a dendrogram for 1985 and another dendrogram for 1995 and 2005. 

#### Coding exercise
   * Create a list of keywords that you think would be particularly relevant for Congress during this time -- something that might demonstrate historical change in ideas.
        * Using the code above, create a GENSIM model for 1985, 1995, and 2005
        * Using the code above, create an array of vectors for your words for each time period
        * Using the code above, draw a dendrogram of keyword relatedness for the three time periods.
    
#### Interpretation exercise    
   * Write an interpretive paragraph of at least half a page examining what this dendrogram suggests about change over time.  If there's not enough material in your first experiment, tweak the keyword list and try again -- until you have something to say about history.


Turn in your work on Canvas. Do not turn in an ipynb. 