# BernieBot
Here's a simple program for generating Bernie Sanders speeches (to a first approximation). Using a large compilation of Bernie speeches (~40k words), we use a simple frequency analysis (count the total word distribution $P(w)$ and the one-word conditional distribution $P(w_{t+1} | w_t)$. Yup, it's a Markov chain.

We can random walk on this Markov chain and generate speeches. We can also get the entropy rate $h_\mu = \sum_w P(w) \sum_{w'} P(w'|w) \log (P(w'|w)) $ (assuming Bernie is in his stationary distribution).

In [1]:
import random as rd

speechfile = open("berniespeechCompilation.txt",'r')
speechstring = speechfile.readline()
words = speechstring.split(' ')
num_words = len(words)
set_words = set(words)
print num_words, " words"
print len(set(words)), " unique words though"

46809  words
6156  unique words though


Build the word dictionary. 

In [22]:
n = 2
n_grams = [ words[i:i+n] for i in range(len(words)-n) ]

In [23]:
n_grams[0]

['i', 'was']

In [28]:
words[0:2]

['i', 'was']

In [37]:
[ sum([ 1 if n_grams[j] == words[i:i+n] else 0 for i in range(len(words)-n) ]) for j in range(len(n_grams))]

KeyboardInterrupt: 

In [None]:
n_grams

In [None]:
d = {key: value for (key, value) in iterable}

In [35]:
n_gram_dict = { 

16

In [33]:
1 if 2+2==5 else 0

0

In [2]:
word_dictionary = {}

for i in range(len(words)):
    current_word = words[i]
    if current_word not in word_dictionary.keys():
        word_dictionary[current_word] = {"frequency": 0, "nextwords": {}}

    word_dictionary[current_word]["frequency"] += 1
    
    if i < len(words) - 1:
        nextwords_dictionary = word_dictionary[current_word]["nextwords"]
        next_word = words[i+1]

        if next_word in nextwords_dictionary.keys():
            nextwords_dictionary[next_word]["frequency"] += 1
        else:
            nextwords_dictionary[next_word] = {"frequency": 1}

        word_dictionary[current_word]["nextwords"] = nextwords_dictionary

In [15]:
choose_word_from_dictionary(word_dictionary)

'aided'

In [6]:
def choose_word_from_dictionary(dictionary):
    rn = rd.random()
    s = 0
    for entry in dictionary:
        s += word_dictionary[entry]["frequency"]
        if s >= rn:
            break
            
    return entry

In [5]:
def generate_text(n=500,eps=0.05):
    """
    n is the number of words you want to generate.
    eps is the randomness parameter, a chance to deviate from the conditional distribution.
    """
    generated_text = []
    word = choose_word_from_dictionary(word_dictionary)
    for i in range(n):
        generated_text.append(word)
        rn = rd.random()
        if rn < eps:
            word = choose_word_from_dictionary(word_dictionary)
        else:
            word = choose_word_from_dictionary(word_dictionary[word]["nextwords"])
    
    return generated_text

In [6]:
text = generate_text()
print " ".join(text)

aided mexican drug and all program would authorize the government, it�s a partial ceasefire shows the government, it�s aided aided mexican drug and all program would authorize the aided mexican drug and all program would authorize the government, it�s a aided mexican drug and all program would authorize the government, it�s a partial ceasefire shows the government, it�s aided mexican drug and all program would authorize the government, it�s a partial ceasefire shows the government, it�s a partial ceasefire shows the government, it�s a partial ceasefire shows the government, it�s a partial ceasefire shows the government, it�s a partial ceasefire shows the government, it�s a partial ceasefire shows the government, it�s a partial ceasefire shows the government, it�s a partial ceasefire shows the government, it�s a partial ceasefire shows the government, it�s a partial ceasefire shows the government, it�s a partial ceasefire shows the government, it�s a partial ceasefire shows the governme

And now for the entropy rate.

In [7]:
import numpy as np

H = 0
for word in word_dictionary:
    nextwords = word_dictionary[word]["nextwords"]
    H_word = sum([ nextwords[nextword]["wordfrequency"] * np.log(nextwords[nextword]["wordfrequency"]) for nextword in nextwords ])
    H += word_dictionary[word]["frequency"] * H_word
H = -H / np.log(2)
print "Bernie Sanders's entropy rate: ", H

KeyError: 'wordfrequency'