# WeThePeople - Predicting how US legislature affects the US Economy

The American economic machine has long been a fascination of mine. I've been keenly interested in understanding with great depth, the chain of cause-effect relationships that make the american economy tick. Perhaps more importantly, understanding this provides a framework and springboard to understand the rest of the global economy at large.

The American economy is a multi-dimensional system; it ingests mutliple types of signals from multiple sources and similarly can be measured along multiple axes. In this project, we pare down this immense problem into something rather bite-sized. It's no secret that the actions of those in government affect the lives of the constituency. The question this project sets out to answer is: How? Can we understand exactly how different pieces of legislature affects the American economy as a whole? And armed with this knowledge, can we predict how proposed legislature might affect the American economy in the future?

We'll use machine learning models to aid us in answering these questions. Without further ado...let's begin.

## Data Pipeline

I considered getting data for this project from the following sources:

| Source | Description |
|---|---|
| Congress.GOV | Library of the US Congress: Collects records on many (if not all) congressional activities. Perhaps most important for us, it contains an archive of all Public Laws in US History. |
| FRED | Federal Reserve Economic Data: Aggregates multiple economic indicators from multiple government organizations and publishes them in a single repository. This is probably the best one-stop shop for anything US Economic Data related. |
| BEA | Bureau of Economic Analysis: BEA's economists produce some of the world's most closely watched statistics, including U.S. gross domestic product, better known as GDP. We do state and local numbers, too, plus foreign trade and investment stats and industry data |

In fact I built Python APIs from scratch for both Congress.GOV and BEA. However upon further investigation, turns out the data from the BEA is readily available via FRED. So we'll press forward with just Congress.GOV and FRED as our data sources.

### Sourcing and Preparing Data

In [5]:
from src import all_datasets

**BEWARE:** The cell below downloads *a lot* of data from the FRED API and Congress.GOV API. Due to server side rate limits, this cell could run for hours, perhaps even days. If you choose to, you can mitigate this by assigning the `search_limit` key in the `congress_args` dictionary, and/or assigning the `min_popularity` key in the `fred_args` dictionary. See documentation for details.

In [None]:
kwargs = {
    'compile_congress_dataset': False,
    'compile_fred_dataset': True,
    'retry_congress_errors': False,
    'congress_args': {},
    'fred_args': {},
    'retry_args': {}
}

all_datasets.get(**kwargs) # THIS SHOULD RETURN SERIES_SEQ_LENGTH AND LABEL_SEQ_LENGTH
# ALSO WE NEED TO PUT Bert_Vocabulary in the right place to be read...

### Tensorflow Pipeline

In [None]:
from src.modeling.preprocess import tf_pipeline

In [None]:
# Come back to this
pipeline_kwargs = {
    'training_data_folder': "datasets/training data",
    'fred_series_id': "GDPC1",
    'series_seq_length': None,
    'label_seq_length': None,
    'n_vocab': None,
    'num_threads': None,
    'local_batch_size': None,
    'distributed': None
}

train_data, val_data, test_data, steps_per_epoch = tf_pipeline.build(**pipeline_kwargs)

## Machine Learning

Blah

### Setup (AWS)

For reasons that will become apparent later, we'll need to run the machine learning aspect of this project on AWS (Amazon Web Services). Specifically we'll be using Amazon Sagemaker for training our ML model and Amazon S3 for storing our training data and model artifacts.

#### Imports

In [7]:
import os
import tarfile

import sagemaker
import boto3
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
    WarmStartConfig,
    WarmStartTypes
)
from sagemaker.tensorflow import TensorFlow

from src.aws_utils import upload

client = boto3.client('sagemaker')
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/tomi/Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/tomi/Library/Application Support/sagemaker/config.yaml


#### Helper functions

Let's define some convenience methods that we'll use later on in this notebook.

In [11]:
def get_small_estimator(hparams_dict, n_instances=1):
    estimator = TensorFlow(
        entry_point="ML.py",
        role=role,
        source_dir=source_uri,
        model_dir=model_uri,
        framework_version="2.13",
        py_version="py310",
        instance_type="ml.g5.xlarge",
        instance_count=n_instances,
        volume_size=20,
        output_path=output_uri,
        hyperparameters=hparams_dict
    )
    return estimator


def get_distributed_estimator(hparams_dict, n_instances=1):
    if hparams_dict["distributed"] == "tf":
        dist_config = {"multi_worker_mirrored_strategy": {"enabled": True}}
        env_vars = None
    elif hparams_dict["distributed"] == "horovod":
        dist_config = {"mpi": {"enabled": True}}
        env_vars = {"HOROVOD_GPU_OPERATIONS": "NCCL"}

    estimator = TensorFlow(
        entry_point="ML.py",
        role=role,
        source_dir=source_uri,
        model_dir=model_uri,
        framework_version="2.13",
        py_version="py310",
        instance_type="ml.g5.12xlarge",
        instance_count=n_instances,
        volume_size=20,
        output_path=output_uri,
        hyperparameters=hparams_dict,
        distribution=dist_config,
        environment=env_vars,
        checkpoint_s3_uri=checkpoint_uri
    )
    return estimator

# TODO: Add another function with file_mode as "pipe" or "fast_file"

#### Upload data to S3

In [12]:
model_prefix = "my_first_model"
source_endpt = f"{model_prefix}/inputs/source"
inputs_endpt = f"{model_prefix}/inputs/datasets"

##### Upload source code

In [None]:
tar = tarfile.open("WeThePeople.tgz", 'w:gz')
for item in os.listdir("src"):
    tar.add(os.path.join("src", item), item)

tar.close()

In [None]:
local_dir = "WeThePeople.tgz"
inputs = sagemaker_session.upload_data(path=local_dir, bucket=bucket, key_prefix=source_endpt)
print("input spec (in this case, just an S3 path): {}".format(inputs))

##### Upload training data

In [15]:
# Script parameters
local_dir = 'datasets/training data'
key_prefix = "my_first_model/inputs/"
n_threads = 20
bucket = os.environ['MY_DEFAULT_S3_BUCKET']

upload(local_dir, bucket, key_prefix, n_threads)

#### Estimator arguments

In [None]:
train_data_uri = f"s3://{bucket}/{inputs_endpt}/training data"
model_uri = f"s3://{bucket}/{model_prefix}/outputs/model"
output_uri = f"s3://{bucket}/{model_prefix}/outputs/output"
source_uri = f"s3://{bucket}/{source_endpt}/WeThePeople.tgz"
checkpoint_uri = f"s3://{bucket}/{model_prefix}/outputs/checkpoints"
channels = {
    "train": train_data_uri
}

### Model v1

#### Architecture

This model is the standard encoder-decoder transformer architecture from the `Attention is All You Need` paper. The encoder is Google's DistilBERT, and the decoder is one I built myself, capable of properly processing time-series and performing regression to predict the next token.

In a nutshell, the encoder ingests the legislative text and exposes the corresponding encoded values to the decoder. The decoder ingests econometric time-series data starting from a certain date, leading up to the date the bill in question was signed into law, then employs cross-attention to ingest the encoded values from the encoder, and finally uses a regression head to output its next prediction in the time-series. Note that the cross attention layer can be repeated N times, just as in the original paper. This entire process is repeated auto-regressively until we've generated a time-series long enough to suit our needs.

Below, I train this model to predict a 5-year econometric outlook given a legislative bill and 10 years worth of data.

#### Training on AWS

In [None]:
# Customize as needed
hparams = {
    "num_threads": 2,
    "batch_size_per_worker": 1,
    "n_vocab": 100000,
    "label_seq_length": 20,
    "series_seq_length": 40,
    "dropout_rate": 0.1,
    "num_heads": 2,
    "stack_height": 1,
    "d_values": 12,
    "d_keys": 12,
    "encoder_max_seq_len": 512,
    "epochs": 3,
    "learning_rate": 1e-3
}

##### Distributed

In [None]:
hparams["distributed"] = "horovod"
estimator = get_distributed_estimator(hparams)
estimator.fit(inputs=channels)

##### Non-distributed

In [None]:
estimator = get_small_estimator(hparams)
estimator.fit(inputs=channels)

#### Results

I ran a hyperparameter tuning job to arrive at the best performing model of which the results are reported below. The hyperparameters for the best performing model are as follows:

|Hyperparameter Name|Description|Value|
|---|---|---|
|batch_size_per_worker|Number of samples per gradient update; also number of samples processed in parallel|3
|dropout_rate|Dropout probability for dropout layers in decoder|0.0167
|num_heads|Number of heads in each Multi-head attention block in decoder|12
|stack_height|Number of "decoder layers" in the decoder. Each decoder layer consists of a Multi Head Attention, Add & Norm, and Feed Forward NN block|5
|d_values|Dimensionality of transformer values in decoder|481
|d_keys|Dimensionality of transformer keys in decoder|245
|encoder_max_seq_len|Maximum sequence length that the encoder accepts as input|888
|learning_rate|Learning rate for the model compiler|0.0058

***

This model achieved a 10.48% MAPE (Mean Absolute Percentage Error) on the GDCP1 (Real Gross Domestic Product) timeseries with a 10 year look-back and 5 year look-ahead window.

### Hyperparameter tuning (Optional)

If the user chooses, run the cells below to perform hyperparameter tuning on any of the models above.

#### Setup Tuner

In [None]:
# Customize as needed
static_hparams = {
    "epochs": 4,
    "n_vocab": 100000,
    "label_seq_length": 20,
    "series_seq_length": 40
}

hparam_ranges = {"encoder_max_seq_len": IntegerParameter(627, 900),
                 "batch_size_per_worker": IntegerParameter(3, 7),
                 "num_threads": IntegerParameter(17, 27),
                 "d_values": IntegerParameter(447, 675),
                 "d_keys": IntegerParameter(191, 300),
                 "num_heads": IntegerParameter(11, 18),
                 "stack_height": IntegerParameter(3, 6),
                 "dropout_rate": ContinuousParameter(0, 0.2),
                 "learning_rate": ContinuousParameter(1e-3, 6e-3)
                }

obj_metric_name = "MAPE"
obj_type = "Minimize"
metric_defs = [
    {
        "Name": "MSE",
        "Regex": "val_mse: ([0-9\\.]+)"
    },
    {
        "Name": "MAPE",
        "Regex": "val_mape: ([0-9\\.]+)"
    },
    {
        "Name": "MAE",
        "Regex": "val_mae: ([0-9\\.]+)"
    },
    {
        "Name": "TEST_MSE",
        "Regex": "test_mse: ([0-9\\.]+)"
    },
    {
        "Name": "TEST_MAPE",
        "Regex": "test_mape: ([0-9\\.]+)"
    },
    {
        "Name": "TEST_MAE",
        "Regex": "test_mae: ([0-9\\.]+)"
    },
]


If you have a warm start config, declare it in the cell below and run it

In [None]:
# Setup warm start config
parent_names = {"TwelveXlargeTune-240404-2029"}
warm_start_config = WarmStartConfig(
    WarmStartTypes.IDENTICAL_DATA_AND_ALGORITHM, parents=parent_names
)

##### Distributed

In [None]:
static_hparams["distributed"] = "horovod"
estimator = get_distributed_estimator(static_hparams)

##### Non-distributed

In [None]:
estimator = get_small_estimator(static_hparams)

#### Initialize and Run Tuner

In [None]:
tuner = HyperparameterTuner(
    base_tuning_job_name="TwelveXlargeTune",
    estimator=estimator,
    objective_metric_name=obj_metric_name,
    hyperparameter_ranges=hparam_ranges,
    metric_definitions=metric_defs,
    objective_type=obj_type,
    max_jobs=10,
    max_parallel_jobs=1,
    warm_start_config=warm_start_config
)

tuner.fit(inputs=channels, include_cls_metadata=False)