<a href="https://www.kaggle.com/code/vladtasca/using-markov-chains-to-generate-fomc-speech?scriptVersionId=165408967" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Using Markov Chains to Generate FOMC Speech
The goal of this notebook is to explore the applicability of Markov Chains in generating random but seemingly plausible text based on the corpora of FOMC Meeting Statements and Minutes.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import collections
import random
import re
from dateutil import parser

comms = pd.read_csv('../input/fomc-meeting-statements-and-minutes/communications.csv')

# Split into statements and minutes
statements = comms.loc[comms['Type'] == 'Statement'].copy()
minutes = comms.loc[comms['Type'] == 'Minute'].copy()

statements = (statements
 .assign(Date = lambda df: pd.to_datetime(df['Date']))
 .reset_index(drop=True)
)

minutes = (minutes
 .assign(Date = lambda df: pd.to_datetime(df['Date']))
 .reset_index(drop=True)
)

all_statements = '\n'.join(statements['Text'])
all_minutes = '\n'.join(minutes['Text'])

## Text cleaning
Let's start removing the non-essential artifacts that are left in the source data from the scraping process as these will interfere with the quality of the text that we're aiming to generate.

We need some heuristics to get rid of the extraneous parts of the statements. The functions below do a good enough job of filtering the unnecessary data:

In [2]:
def is_mostly_names(line):
    """
    Return True if more than half of the words in the string argument are capitalized.
    """
    words = line.split()
    capitalized_words = [word for word in words if word.istitle()]
    return len(capitalized_words) > len(words) / 2

def contains_date(line):
    """
    Returns True if the string argument contains a date of any kind.
    """
    try:
        parser.parse(line, fuzzy=True, tzinfos={'EST': 'UTC-5'})
        return True
    except ValueError:
        return False
    
def contains_phone_number(line):
    """
    Returns True if the string argument contains a phone number of the format XXX-XXX-XXXX.
    """
    pattern = r'\d{3}-\d{3}-\d{4}'
    match = re.search(pattern, line)
    return match is not None

In [3]:
statements = [x for x in all_statements.split('\n')]
statements = pd.DataFrame(statements, columns=['Text'])

std = (
    statements
    .loc[statements['Text'].str.replace(' ', '') != '']
    .assign(Text = lambda df: df['Text'].str.replace('\t', ''))
    .assign(MostlyNames = lambda df: df['Text'].apply(is_mostly_names))
    .assign(HasDate = lambda df: df['Text'].apply(contains_date))
    .assign(HasPhone = lambda df: df['Text'].apply(contains_phone_number))
    .assign(TextLen = lambda df: df['Text'].apply(len))
    .reset_index(drop=True)
)

std

Unnamed: 0,Text,MostlyNames,HasDate,HasPhone,TextLen
0,Recent indicators suggest that growth of econo...,False,False,False,287
1,The U.S. banking system is sound and resilient...,False,False,False,288
2,The Committee seeks to achieve maximum employm...,False,False,False,948
3,In assessing the appropriate stance of monetar...,False,False,False,544
4,Voting for the monetary policy action were Jer...,True,False,False,282
...,...,...,...,...,...
1496,"In taking the discount rate action, the Federa...",False,False,False,425
1497,The Federal Open Market Committee voted today ...,False,False,False,248
1498,The Committee remains concerned that over time...,False,False,False,307
1499,Against the background of its long-run goals o...,False,True,False,286


Let's filter out lines that contain mostly names, have dates or phone numbers, or are otherwise too short to carry any useful information:

In [4]:
source_text = std.loc[
    (std['MostlyNames'] == False) &
    (std['HasDate'] == False) &
    (std['HasPhone'] == False) &
    (std['TextLen'] > 30)
, 'Text'].tolist()

source_text = ' '.join(source_text)

The `generate_text` function below is adapted from a [great article by Ben Hoyt](https://benhoyt.com/writings/markov-chain/), which itself was inspired by a recommendable book called [The Practice of Programming](https://www.cs.princeton.edu/~bwk/tpop.webpage/).

In [5]:
def end_sentence_in_period(output, possibles):
    if output[-1][-1] == '.':
        return ' '.join(output)
    for i in range(0, -int(len(output) + 1), -1):
        for possible in possibles[tuple(output[i-2:i])]:
            if possible[-1] == '.':
                sentence_ender = possible
                
                # Tweak output to end in sentence_ender
                new_output = output
                new_output[i] = sentence_ender
                return ' '.join(new_output[:i+1])

def generate_text(text, prefix_length=2, text_length=100):
    if prefix_length < 2:
        raise ValueError("Prefix length must be at least 2")

    # Initialize prefix as a tuple of empty strings
    prefix = tuple([''] * prefix_length)
    possibles = collections.defaultdict(list)

    # Build possibles table indexed by dynamic-length prefix
    for line in text.split('\n'):
        for word in line.split():
            possibles[prefix].append(word)
            prefix = prefix[1:] + (word,)  # Slide the window

    # Avoid empty possibles lists at end of input
    possibles[prefix + ('',)].append('')
    for i in range(1, prefix_length):
        possibles[prefix[i:] + ('',) * (i + 1)].append('')

    # Generate randomized output (start with a random capitalized prefix)
    capitalized_prefixes = [k for k in possibles if k[0][:1].isupper()]
    if not capitalized_prefixes:
        raise ValueError("No suitable starting prefix found. Ensure the text contains capitalized words.")

    prefix = random.choice(capitalized_prefixes)
    output = list(prefix)
    for _ in range(text_length - prefix_length):
        word = random.choice(possibles[prefix])
        output.append(word)
        prefix = prefix[1:] + (word,)
        
    # For a complete picture, let's end the sentence with a period
    output = end_sentence_in_period(output, possibles)
    
    return output, possibles

In [6]:
output, possibles = generate_text(source_text, prefix_length=2, text_length=100)

The function parses the source text, then builds a data structure that represents every possible combination of words that follow every n-gram (where n is set by `prefix_length`) present in the text. The structure represents the words exactly as they appear, such that if a certain word combination appears more often than others, they will be represented just as many times in the dictionary data structure.

The effect of this is that when we generate new text one n-gram at a time, the next word in the sequence will be chosen based on the possible combinations that were identified in the source text. To illustrate, let's check the entry for the bi-gram `"monetary policy"`:

In [7]:
collections.Counter(possibles[('monetary', 'policy')])

Counter({'remains': 37,
         'action': 35,
         'as': 27,
         'will': 13,
         'affects': 10,
         'until': 10,
         'to': 6,
         'action,': 6,
         'is': 5,
         'actions': 5,
         'that': 4,
         'accommodation,': 2,
         'in': 1,
         'would': 1,
         'and': 1,
         'accommodative,': 1})

So the most common combination of words would be `"monetary policy remains"`, followed by `"monetary policy action"` and so on.

When generating new text, we randomly pick one of these following words, but of course more weight will be assigned to the words with a higher number of occurences in the original text.

The result is that the text that we generate sounds almost plausible, even though it's informed by a random process. If we squint, it might even be legible:

In [8]:
print(output)

Committee also decided to keep the target range for the federal funds rate remains appropriate. In determining how long to maintain price stability. The Committee believes that policy accommodation by purchasing up to $300 billion of Treasury and agency mortgage-backed securities in agency mortgage-backed securities at least through mid-2015. Voting for the economic outlook. The Federal Reserve's holdings of agency debt and agency mortgage-backed securities at auction. This policy, by keeping the Committee's expectation of exceptionally low range for the federal funds rate at 1-1/4 to 1â1/2 to 1-3/4 percent.


![Jerome Powell saying "I never said that"](https://i.imgur.com/kb0hy0x.png)