<a href="https://colab.research.google.com/github/vheastman/FOMC/blob/main/models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Models
Here are the final models and results. Many other things were tried and most I saved in the models_failed.py, but some were even deleted from there. The data was processed in process_data.py and details of the statements were explored in EDA.py.

The purpose of this analysis was to determine word clustering for the 141 statements issued by the FOMC. Relatedly, I attempt to train a model to predict statement 'sentiment'. I measure sentiment by the daily change in the 10-year treasury yield after the statement is released.

For the clustering analysis, I found the straight CountVectorizer was the best way to extract features. For the sentiment analysis, the best results were obtained using GridSearchCV optimization on an SVM model.

This model could probably make you some money if you invest well. Just a thought.

In [33]:
# Import standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import sklearn libraries
from sklearn import svm
import sklearn.model_selection as ms
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [34]:
# Read in the statements
statements = pd.read_csv("https://raw.githubusercontent.com/vheastman/FOMC/main/statements_with_labels.csv")
print(statements.head())
print(statements.labels.value_counts())

         date                                               text  labels
0  1994-02-04  Chairman Alan Greenspan announced today that t...       1
1  1994-03-22  Chairman Alan Greenspan announced today that t...       1
2  1994-04-18  Chairman Alan Greenspan announced today that t...       0
3  1994-05-17  The Federal Reserve today announced two action...       1
4  1994-08-16  The Federal Reserve announced today the follow...       0
0    89
1    81
Name: labels, dtype: int64


## 1. Clustering

In [35]:
# Extract features with CountVectorizer
# Ignore words with over 95% frequency and in less than 2 documents
vec = CountVectorizer(max_df=0.95, min_df=1, stop_words='english')

cv = vec.fit_transform(statements['text'])
print(cv.shape)

(170, 1567)


In [36]:
def print_top_words(model, feature_names, n_top_words):
    unique_top_words = []
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
        unique_top_words.extend([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
    print()
    return set(unique_top_words)

In [37]:
lda_model = LatentDirichletAllocation(n_components=5, max_iter=5,
                                      learning_method='online',
                                      learning_offset=50.,
                                      random_state=0)
lda_model.fit(cv)

print("\nTopics in LDA model:")
feature_names = vec.get_feature_names_out()
print()
print_top_words(lda_model, feature_names, 10)



Topics in LDA model:

Topic #0: growth policy price inflation securities action pace monetary stability financial
Topic #1: growth long action goals jr currently productivity foreseeable weakness policy
Topic #2: inflation reserve policy action growth monetary board target chairman conditions
Topic #3: inflation policy securities labor longer conditions agency term mortgage range
Topic #4: policy action reserve monetary stability chairman growth board approved price



{'action',
 'agency',
 'approved',
 'board',
 'chairman',
 'conditions',
 'currently',
 'financial',
 'foreseeable',
 'goals',
 'growth',
 'inflation',
 'jr',
 'labor',
 'long',
 'longer',
 'monetary',
 'mortgage',
 'pace',
 'policy',
 'price',
 'productivity',
 'range',
 'reserve',
 'securities',
 'stability',
 'target',
 'term',
 'weakness'}

## II. Prediction

### II (a). Naive Bayes

In [49]:
# Identify labels
y = statements['labels']

# Implement Tfid vectorizer to vectorize input
tf = text.TfidfVectorizer()
X = tf.fit_transform(statements['text'])
print(X.shape)

(170, 1737)


In [50]:
# Identify how many samples have non-zero features
p = 100 * X.nnz / float(X.shape[0] * X.shape[1])
print(f"Each sample has ~{p:.2f}% non-zero features.")

Each sample has ~11.38% non-zero features.


In [51]:
# Separate train/test data
(X_train, X_test, y_train, y_test) = ms.train_test_split(X, y, test_size=.2)

In [52]:
# Use GridSearchCV to find optimal alpha
nb_model = ms.GridSearchCV(nb.BernoulliNB(), param_grid={'alpha': np.logspace(-2., 2., 50)})
nb_model.fit(X_train, y_train)

In [53]:
# Determine NB model score on test set
print('Naive Bayes model score:')
round(nb_model.score(X_test, y_test),4)

Naive Bayes model score:


0.5294

### II (b). SVM

In [54]:
# Use GridSearchCV
svm_model = ms.GridSearchCV(svm.SVC(kernel='rbf'),
                            {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma' : [0.001, 0.01, 0.1, 1]},
                            cv=5)

# Train SVM
svm_model.fit(X_train, y_train)

In [55]:
# Determine SVM model score on test set
print('SVM model score:')
round(svm_model.score(X_test, y_test),4)

SVM model score:


0.5588

## III. Additional Tests

### Test: run SVM and NB on CountVectorized features

In [57]:
# Divide data into training and testing sets
(X_train_cv, X_test_cv, y_train_cv, y_test_cv) = ms.train_test_split(cv, y, test_size=.2)

In [59]:
# Use GridSearchCV
svm_model_cv = ms.GridSearchCV(svm.SVC(kernel='rbf'),
                            {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma' : [0.001, 0.01, 0.1, 1]},
                            cv=5)
svm_model_cv.fit(X_train_cv, y_train_cv)
print('GridSearchCV SVM model score:')
round(svm_model_cv.score(X_test_cv, y_test_cv),4)

GridSearchCV SVM model score:


0.5882

In [60]:
# Use GridSearchCV to find optimal alpha for NB model
nb_model_cv = ms.GridSearchCV(nb.BernoulliNB(), param_grid={'alpha': np.logspace(-2., 2., 50)})
nb_model_cv.fit(X_train_cv, y_train_cv)
print('GridSearchCV NB model score:')
round(nb_model_cv.score(X_test_cv, y_test_cv),4)

GridSearchCV NB model score:


0.5588

### Test: Run LDA on tf-idf features

In [48]:
# Let's test LDA on the tf-idf vectorized features
lda_model = LatentDirichletAllocation(n_components=5, max_iter=5,
                                      learning_method='online',
                                      learning_offset=50.,
                                      random_state=0)
lda_model.fit(X)

print("\nTopics in LDA model:")
feature_names = tf.get_feature_names_out()
print_top_words(lda_model, feature_names, 10)


Topics in LDA model:
Topic #0: the its willingness premature changes moderated in lending decelerated indicates
Topic #1: the 10 of hesitancy unwelcome accompanying level 35 how so
Topic #2: exerting the faster january backed obtaining consequence works 28 reflect
Topic #3: the in of to and increase committee remains productive suggests
Topic #4: the of to and in committee inflation federal its that



{'10',
 '28',
 '35',
 'accompanying',
 'and',
 'backed',
 'changes',
 'committee',
 'consequence',
 'decelerated',
 'exerting',
 'faster',
 'federal',
 'hesitancy',
 'how',
 'in',
 'increase',
 'indicates',
 'inflation',
 'its',
 'january',
 'lending',
 'level',
 'moderated',
 'obtaining',
 'of',
 'premature',
 'productive',
 'reflect',
 'remains',
 'so',
 'suggests',
 'that',
 'the',
 'to',
 'unwelcome',
 'willingness',
 'works'}