# Lede Algorithms -- Assignment 2

In this assignment you will use all your text analysis skills to analyze the U.S. State of the Union speeches in the 20th century. 

First, load `state-of-the-union.csv`. This is is a standard CSV file with one speech per row. There are two columns: the year of the speech, and the text of the speech. 

In [3]:
# Some stuff you'll need
import pandas as pd
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer


In [4]:
# load 'state-of-the-union.csv'
df = pd.read_csv('week-2-1/state-of-the-union.csv')
df.head()

Unnamed: 0,year,text
0,1790,"George Washington\nJanuary 8, 1790\n\nFellow-C..."
1,1790,\nState of the Union Address\nGeorge Washingto...
2,1791,\nState of the Union Address\nGeorge Washingto...
3,1792,\nState of the Union Address\nGeorge Washingto...
4,1793,\nState of the Union Address\nGeorge Washingto...


We will work with ony those speeches in the 20th century, so start by filtering out only the rows with a year between 1900 and 1999

In [5]:
twentieth = df[(df['year'] >= 1900) & (df['year'] < 2000)]
twentieth.head()

Unnamed: 0,year,text
111,1900,\nState of the Union Address\nWilliam McKinley...
112,1901,\nState of the Union Address\nTheodore Rooseve...
113,1902,\nState of the Union Address\nTheodore Rooseve...
114,1903,\nState of the Union Address\nTheodore Rooseve...
115,1905,\nState of the Union Address\nTheodore Rooseve...


The first step in your analysis task will be to tokenize each document in this set and create a dataframe of tf-idf vectors. We're going to need to tokenize first, so write (or cut and paste!) a tokenizer function that takes a string and returns a list of standardized tokens.

In [7]:
def tokenize(s):
  blob = TextBlob(s.lower())
  words = [token for token in blob.words if len(token) > 2]
  return words


vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize)

matrix = vectorizer.fit_transform(twentieth['text'])

Good stuff. Now use this to create a matrix of tf-idf vectors for the document set.

In [8]:
tfidf = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tfidf.head()

Unnamed: 0,'70,'76,'82,'86,'89,'90,'follow,'forties,'ll,'re,...,zest,zigzag,zimbabwe,zimbabwean,zinc,zion,zone,zones,zoological,zooming
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005191,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010902,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018962,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.014207,0.0,0.0,0.0


You're probably going to want a way to print out the most highly weighted terms this as well, so we'll use print_sorted_vector from the lesson notebook:

In [9]:
def print_sorted_vector(v):
    # this "lambda" thing is an anonymous function, google me to unluck bonus coding knowledge
    sorted_list = sorted(v.items(), key=lambda x: (x[1],x[0]), reverse=True) 
    sorted_list = sorted_list[:20]
    print('\n'.join([str(x) for x in sorted_list]))

Print out a few of the State of The Union vectors for individual speeches to get a sense of what's happening here.

In [16]:
print_sorted_vector(tfidf.iloc[60])

('world', 0.19525421971047371)
('nations', 0.16452492828929235)
('free', 0.1357434922758075)
('freedom', 0.10774332349482432)
('nation', 0.10546314652398149)
('peace', 0.10058550712357736)
('steel', 0.09929551162946561)
('space', 0.09879316829658123)
('years', 0.09560168954817315)
('progress', 0.0887297958192671)
('make', 0.0828351235135343)
('help', 0.08150224929736809)
('emerging', 0.07954983245864958)
('year', 0.07844211254641721)
('today', 0.07358342904889673)
('exploration', 0.07287342010581332)
('economic', 0.0724081038890005)
('american', 0.0724081038890005)
('congress', 0.06962281969053154)
('intellectual', 0.06929595177688465)


Now sum the vectors for each decade, and print out the results. Do you see any themes? Can you connect the terms to major historical events? (wars, the great depression, assassinations, the civil rights movement, Watergate…)

In [17]:
for i in range(0, 100, 10):
  docs = tfidf.iloc[i:i+9, :]
  total = docs.sum(axis=0)
  print('==================')
  print('THE {}S!'.format(1900+i))
  print_sorted_vector(total)

THE 1900S!
('government', 1.2359306950991655)
('law', 0.9916217036066706)
('states', 0.9435170349458204)
('great', 0.8578640147784602)
('people', 0.808429883201772)
('congress', 0.7850429195576802)
('islands', 0.7731323976591647)
('public', 0.7354192775850008)
('men', 0.7302993784449956)
('united', 0.7093969067327885)
('work', 0.6994447709701992)
('country', 0.6964696644889974)
('man', 0.6904666965767627)
('corporations', 0.6700926675281301)
('business', 0.6408954502929638)
('navy', 0.6314778642437825)
('officers', 0.6003360619231584)
('good', 0.5659015722105067)
('national', 0.5624271019967686)
('service', 0.5594502765632127)
THE 1910S!
('government', 1.0020449464940833)
('shall', 0.8446881688471051)
('great', 0.7584805285733002)
('states', 0.7333388111387222)
('congress', 0.7024132259041955)
('country', 0.6892539229191801)
('war', 0.6204964885082951)
('united', 0.6165462562938432)
('people', 0.6099444409310377)
('men', 0.604927682035116)
('necessary', 0.5585005194702183)
('present', 

('government', 1.5564492246892034)
('congress', 0.9577863052750989)
('public', 0.8572806861304343)
('country', 0.7811484908183989)
('ought', 0.7554515789788483)
('people', 0.6700670375168616)
('law', 0.6685694304749461)
('national', 0.6590685958993429)
('war', 0.6179100965996406)
('present', 0.6105202551454593)
('world', 0.5647376004827329)
('agriculture', 0.564144309634744)
('great', 0.552753870414571)
('american', 0.5114564772401927)
('service', 0.5079199371831159)
('federal', 0.5077605251346552)
('states', 0.5044590362798012)
('necessary', 0.5028698194307357)
('commerce', 0.49946093387419654)
('court', 0.47994217819997353)
THE 1930S!
('government', 1.1866985360727254)
('people', 0.7677054649553443)
('national', 0.7385615200112146)
('congress', 0.7167123938113138)
('relief', 0.6216066830710284)
('world', 0.6088354867849038)
('nation', 0.5741005720354149)
('banks', 0.5602636977587498)
('new', 0.5483293602814658)
('1933', 0.5365259061513299)
('year', 0.5355279229669557)
('public', 0.53

('world', 1.4223405614375622)
('free', 1.0953944449019157)
('people', 0.8964559094760031)
('government', 0.8920158400602367)
('military', 0.8917382652845594)
('nations', 0.8861895804553822)
('defense', 0.820751349770115)
('congress', 0.8064467316305177)
('economic', 0.7908832544103401)
('program', 0.7857966246705288)
('security', 0.7538216639133882)
('communist', 0.7411763629429565)
('strength', 0.7198303790827091)
('shall', 0.6984309689972759)
('year', 0.6899033909855268)
('peace', 0.6851003169848593)
('federal', 0.671033181559459)
('new', 0.6564321733049444)
('soviet', 0.6328189861591784)
('united', 0.5633984037268309)
THE 1960S!
('new', 1.1059842048365573)
('world', 1.0207754522189276)
('nation', 0.7731449255134488)
('vietnam', 0.7640720316866008)
('nations', 0.7614975163422817)
('year', 0.7156258274267949)
('help', 0.711263026228991)
('people', 0.7076680731511208)
('years', 0.6697557150721344)
('free', 0.6406703748991328)
('congress', 0.6324749345538291)
('american', 0.616878651213

('congress', 1.271218078619974)
('new', 1.2157453127710927)
('america', 1.1547518195319122)
('people', 1.0955925573854335)
('years', 1.0638733245726701)
('world', 1.0182380745616073)
('government', 0.9440291891950003)
('year', 0.9349235382776521)
('federal', 0.8167870399241473)
('american', 0.7656561772918058)
('nation', 0.7371673339459355)
('programs', 0.7038189774688304)
('americans', 0.6472727795571446)
('energy', 0.6367476551857673)
('great', 0.6260858315836922)
('peace', 0.62086369183682)
('president', 0.5833267116492951)
('time', 0.5708150782612422)
('today', 0.5392909815407213)
('states', 0.5369913633707246)
THE 1980S!
('america', 1.1836947080777152)
("'ve", 1.170533540891054)
('world', 0.9372717401863497)
('people', 0.9248091325711995)
("'re", 0.9073994455843919)
('soviet', 0.8792968992601153)
('government', 0.8232411572057319)
('tonight', 0.725619512730229)
('new', 0.7222890393836175)
('american', 0.7072319850020303)
('years', 0.6988959191114495)
('peace', 0.6885330489010167)


Which two decades are most similar, according to the cosine similarity of their average vectors? You will need to use a double loop that compares every pair of decades and finds the pair with the smallest distance.

In [18]:
def doc_distance(a_vec, b_vec):
  # First we have to compute similarity. The idea is the same as doc_similarity, but
  # because we are using arrays and not dictionaries, we can just multiply all the elements 
  # together and add the sum. This is what numpy's dot function does
  similarity = a_vec.dot(b_vec)

  # Because the vectors are already normalized, similarity will be 1 if equal, 0 if disjoint
  # We want things the other way around
  return 1 - similarity

# helpful little function for distance between documents i and j
def dij(i, j):
  return doc_distance(tfidf.iloc[i], tfidf.iloc[j])



In [21]:
comparisons = []

for i in range(0, 100, 10):
  first = tfidf.iloc[i:i + 9, :]
  first_mean_vec = first.mean(axis=0)
  first_decade = 1900 + i
  
  for j in range(i + 10, 100, 10):
    second = tfidf.iloc[j:j + 9, :]
    second_mean_vec = second.mean(axis=0)
    second_decade = 1900 + j
    
    print('==================')
    print('THE {}s VS THE {}s!'.format(first_decade, second_decade))
    distance = doc_distance(first_mean_vec, second_mean_vec)
    print('Distance is', distance)
    comparisons.append({'first_decade': first_decade, 'second_decade': 
      second_decade, 'distance': distance})
    
  
comparisons.sort(key=lambda x: x['distance'])
comparisons

THE 1900s VS THE 1910s!
Distance is 0.6461135586267194
THE 1900s VS THE 1920s!
Distance is 0.6126413180572683
THE 1900s VS THE 1930s!
Distance is 0.7376617407708097
THE 1900s VS THE 1940s!
Distance is 0.7279359927965614
THE 1900s VS THE 1950s!
Distance is 0.7066774282050449
THE 1900s VS THE 1960s!
Distance is 0.7464041793729915
THE 1900s VS THE 1970s!
Distance is 0.7568603093022099
THE 1900s VS THE 1980s!
Distance is 0.7826697811637804
THE 1900s VS THE 1990s!
Distance is 0.7818639491879753
THE 1910s VS THE 1920s!
Distance is 0.693291299615625
THE 1910s VS THE 1930s!
Distance is 0.7885886540006005
THE 1910s VS THE 1940s!
Distance is 0.7709164471749794
THE 1910s VS THE 1950s!
Distance is 0.7482111369414501
THE 1910s VS THE 1960s!
Distance is 0.7889199189790577
THE 1910s VS THE 1970s!
Distance is 0.7916041750826145
THE 1910s VS THE 1980s!
Distance is 0.815785558594597
THE 1910s VS THE 1990s!
Distance is 0.8220172244415658
THE 1920s VS THE 1930s!
Distance is 0.7275721095156007
THE 1920s VS

THE 1940s VS THE 1950s!
Distance is 0.673426383587381
THE 1940s VS THE 1960s!
Distance is 0.7213508070906358
THE 1940s VS THE 1970s!
Distance is 0.7398775796455352
THE 1940s VS THE 1980s!
Distance is 0.7614654556793173
THE 1940s VS THE 1990s!
Distance is 0.7740782129110376
THE 1950s VS THE 1960s!
Distance is 0.646724961739356
THE 1950s VS THE 1970s!
Distance is 0.6806205723793772
THE 1950s VS THE 1980s!
Distance is 0.6914260612600416
THE 1950s VS THE 1990s!
Distance is 0.7268477923182721
THE 1960s VS THE 1970s!
Distance is 0.6720290052372226
THE 1960s VS THE 1980s!
Distance is 0.6922769766881971
THE 1960s VS THE 1990s!
Distance is 0.7043427764971251
THE 1970s VS THE 1980s!
Distance is 0.6690733042664241
THE 1970s VS THE 1990s!
Distance is 0.6747943014557702
THE 1980s VS THE 1990s!
Distance is 0.6318950417399167


[{'distance': 0.6126413180572683, 'first_decade': 1900, 'second_decade': 1920},
 {'distance': 0.6318950417399167, 'first_decade': 1980, 'second_decade': 1990},
 {'distance': 0.6461135586267194, 'first_decade': 1900, 'second_decade': 1910},
 {'distance': 0.646724961739356, 'first_decade': 1950, 'second_decade': 1960},
 {'distance': 0.6690733042664241, 'first_decade': 1970, 'second_decade': 1980},
 {'distance': 0.6720290052372226, 'first_decade': 1960, 'second_decade': 1970},
 {'distance': 0.673426383587381, 'first_decade': 1940, 'second_decade': 1950},
 {'distance': 0.6747943014557702, 'first_decade': 1970, 'second_decade': 1990},
 {'distance': 0.6806205723793772, 'first_decade': 1950, 'second_decade': 1970},
 {'distance': 0.6914260612600416, 'first_decade': 1950, 'second_decade': 1980},
 {'distance': 0.6922769766881971, 'first_decade': 1960, 'second_decade': 1980},
 {'distance': 0.693291299615625, 'first_decade': 1910, 'second_decade': 1920},
 {'distance': 0.7035373328185519, 'first_de

Write a 500 word (max) article on what U.S. presidents discussed in their SOTU speeches in the 20th century. You should obviously use your tf-idf analysis as a primary source *but* you will not be able to complete this without actually reading some of the speeches, and comparing them to other historical references.

Turn in this notebook, with your article below.
    

(your SOTU article here)