## Yield Spread Model hyperparameter tuning


This model implements the combined yield spread model and the uses the Keras tuner to tune the hyper-parameters

In [1]:
import pandas as pd
import numpy as np
from google.cloud import bigquery
import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Embedding
from tensorflow.keras import activations
from tensorflow.keras import backend as K
from tensorflow.keras import initializers
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from sklearn import preprocessing
from datetime import datetime
import matplotlib.pyplot as plt
import pickle
from lightgbm import LGBMRegressor
import lightgbm
from keras_tuner import HyperModel
from keras_tuner.tuners import RandomSearch, Hyperband, BayesianOptimization
from IPython.display import display, HTML

from data_preparation import process_data  

Setting the seed for layer initializer. We want the layers to be initialized with the same values in all the experiments to remove randomness from the results

In [2]:
layer_initializer = initializers.RandomNormal(mean=0.0, stddev=0.1, seed=10)

Setting up the credentials for GCP

In [3]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="eng-reactor-287421-112eb767e1b3.json"

In [4]:
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'

Initializing the big query client

In [5]:
bq_client = bigquery.Client()

Checking if GPU is available

In [6]:
tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [7]:
tf.test.is_gpu_available()

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


True

#### Hyper-parameters for the model

The batch size and learning rate have an impact on the smoothness of convergence of the model.\
Larger the batch size the smoother the convergence. For a larger batch size we need a higher learning rate and vice-versa

In [8]:
TRAIN_TEST_SPLIT = 0.85
LEARNING_RATE = 0.001
BATCH_SIZE = 1000
NUM_EPOCHS = 100

SEQUENCE_LENGTH = 5
NUM_FEATURES = 5

### Query to fetch data from BigQuery

The SQL query uses the trade history for the training data view. All three trade directions, namely dealer-dealer (D), dealer-sells (S), and dealer-purchases (P) are included. We are limiting the training to bonds whose yield is a positive number less than three.  


In [9]:
DATA_QUERY = """ SELECT
  *
FROM
  `eng-reactor-287421.primary_views.trade_history_for_training_no_neg_yields`
WHERE
  yield IS NOT NULL
  AND yield > 0 
  AND yield <= 3 
  AND par_traded IS NOT NULL
  AND sp_long IS NOT NULL
  AND trade_date >= '2021-08-01' 
  AND trade_date <= '2021-10-01'
  AND msrb_valid_to_date > current_date -- condition to remove cancelled trades
ORDER BY
  trade_date DESC
            """

### Data Preparation

We grab the data from BigQuery and convert it into a format suitable for input to the model. The process data function converts the dictionary of trade history into a list of lists. 

In [10]:
processed_file = 'training_data.pkl'

In [11]:
%%time
if not os.path.isfile(processed_file):
    reference_data = process_data(DATA_QUERY, 
                              bq_client,
                              SEQUENCE_LENGTH,
                              NUM_FEATURES,
                              'data.pkl')
    reference_data.to_pickle(processed_file)
else:
    print('Reading from processed file')
    reference_data = pd.read_pickle('training_data.pkl')

Reading from processed file
CPU times: user 2.06 s, sys: 549 ms, total: 2.61 s
Wall time: 2.61 s


We use the dictionary to map the interest payment frequency code.

In [12]:
COUPON_FREQUENCY_DICT = {0:"Unknown",
                        1:"Semiannually",
                        2:"Monthly",
                        3:"Annually",
                        4:"Weekly",
                        5:"Quarterly",
                        6:"Every 2 years",
                        7:"Every 3 years",
                        8:"Every 4 years",
                        9:"Every 5 years",
                        10:"Every 7 years",
                        11:"Every 8 years",
                        12:"Biweekly",
                        13:"Changeable",
                        14:"Daily",
                        15:"Term mode",
                        16:"Interest at maturity",
                        17:"Bimonthly",
                        18:"Every 13 weeks",
                        19:"Irregular",
                        20:"Every 28 days",
                        21:"Every 35 days",
                        22:"Every 26 weeks",
                        23:"Not Applicable",
                        24:"Tied to prime",
                        25:"One time",
                        26:"Every 10 years",
                        27:"Frequency to be determined",
                        28:"Mandatory put",
                        29:"Every 52 weeks",
                        30:"When interest adjusts-commercial paper",
                        31:"Zero coupon",
                        32:"Certain years only",
                        33:"Under certain circumstances",
                        34:"Every 15 years",
                        35:"Custom",
                        36:"Single Interest Payment"
                        }

In [13]:
df = reference_data.copy()
df.interest_payment_frequency.fillna(0, inplace=True)
df.interest_payment_frequency = df.interest_payment_frequency.apply(lambda x: COUPON_FREQUENCY_DICT[x])
df.interest_payment_frequency.head()

0    Semiannually
1    Semiannually
2    Semiannually
3    Semiannually
4    Semiannually
Name: interest_payment_frequency, dtype: object

Dropping a few columns that we do not use as features. These features were dropped after analyzing their importance using LightGBM

In [14]:
%%time
df.drop(columns=[
                 'sp_stand_alone',
                 'sp_icr_school',
                 'sp_watch_long',
                 'sp_outlook_long',
                 'sp_prelim_long',
                 'MSRB_maturity_date',
                 'MSRB_INST_ORDR_DESC',
                 'MSRB_valid_from_date',
                 'MSRB_valid_to_date',
                 'upload_date',
                 'sequence_number',
                 'security_description',
                 'ref_valid_from_date',
                 'ref_valid_to_date',
                 'additional_next_sink_date',
                 'first_coupon_date',
                 'last_period_accrues_from_date',
                 'primary_market_settlement_date',
                 'assumed_settlement_date',
                 'sale_date','q','d'],
                  inplace=True)

CPU times: user 429 ms, sys: 128 ms, total: 557 ms
Wall time: 554 ms


Converting the columns to correct datatypes. We also restrict the universe of trades to only investment grade bonds 

In [15]:
%%time
df = df.copy()
df['quantity'] = np.log10(df.par_traded.astype(float))
df.coupon = df.coupon.astype(float)
df.issue_amount = np.log10(df.issue_amount)

date_cols = [col for col in list(df.columns) if 'DATE' in col.upper()]
for col in date_cols:
    df[col] = pd.to_datetime(df[col])

prices = ['coupon', 'par_traded', 'dollar_price', 'next_call_price', 'par_call_price', 'refund_price']
for col in prices:
    df[col] = df[col].astype(float)
    
# Just including investment grade bonds
df = df[df.sp_long.isin(['A-','A','A+','AA-','AA','AA+','AAA'])] 
df['rating'] = df.sp_long
df['yield_spread'] = df['yield_spread'] * 100


CPU times: user 3.01 s, sys: 573 ms, total: 3.58 s
Wall time: 3.58 s


Creating Binary features

In [16]:
df['callable'] = df.is_callable  
df['called'] = df.is_called 
df['zerocoupon'] = df.coupon == 0
df['whenissued'] = df.delivery_date >= df.trade_date
df['sinking'] = ~df.next_sink_date.isnull()
df['deferred'] = (df.interest_payment_frequency == 'Unknown') | df.zerocoupon

Converting the dates to a number of days from the settlement date. We only consider trades to be reportedly correctly if the trades are settled within one month of the trade date. 

In [17]:
# Dropping trades settled one month after the trade
print(len(df))
df['days_to_settle'] = (df.settlement_date - df.trade_date).dt.days
df = df[df.days_to_settle <= 31]
print(len(df))

384989
384298


In [18]:
df['days_to_maturity'] =  np.log10(1 + (df.maturity_date - df.settlement_date).dt.days)
df['days_to_call'] = np.log10(1 + (df.next_call_date - df.settlement_date).dt.days.fillna(0))
df['days_to_par'] = np.log10(1 + (df.par_call_date - df.settlement_date).dt.days)
df['call_to_maturity'] = np.log10(1 + (df.maturity_date - df.next_call_date).dt.days)


# Removing bonds from Puerto Rico
df = df[df.incorporated_state_code != 'PR']

We drop the trades which have already been called

In [19]:
print(len(df))
# df = df[~df.called]
# print(len(df))

384021


Adding seconds ago and yield spreads features of the last trade to the reference data model

In [20]:
def get_latest_trade_feature(x, feature):
    recent_trade = x[0]
    if feature == 'yield_spread':
        return recent_trade[0]
    elif feature == 'seconds_ago':
        return recent_trade[-1]
    elif feature == 'par_traded':
        return recent_trade[1]

In [21]:
df['last_seconds_ago'] = df.trade_history.apply(get_latest_trade_feature, args=["seconds_ago"])
df['last_yield_spread'] = df.trade_history.apply(get_latest_trade_feature, args=["yield_spread"])
df['last_size'] = df.trade_history.apply(get_latest_trade_feature, args=["par_traded"])
df.head()

Unnamed: 0,rtrs_control_number,trade_datetime,cusip,my_price,price_delta,msrb_cusip,yield_spread,num_prev_messages,publish_datetime,trade_type,...,sinking,deferred,days_to_settle,days_to_maturity,days_to_call,days_to_par,call_to_maturity,last_seconds_ago,last_yield_spread,last_size
0,2021100101708500,2021-10-01 11:04:26,788073DK7,112.084,0.0,788073DK7,-49.288566,0,2021-10-01 11:05:24,D,...,False,False,4,3.01368,0.0,,,17.304124,-78.929506,55.0
1,2021100100379500,2021-10-01 09:09:10,68277DES9,99.514,0.0,68277DES9,86.171957,0,2021-10-01 09:09:49,D,...,False,False,4,3.709609,3.409595,3.409595,3.407731,4.430817,83.711434,20.0
2,2021100102681500,2021-10-01 12:01:13,446186PP7,118.287,0.0,446186PP7,3.711434,0,2021-10-01 12:01:38,S,...,False,False,4,3.670617,3.396722,3.396722,3.340841,12.719926,15.838303,100.0
3,2021100102502900,2021-10-01 11:49:13,649791HG8,105.185,0.0,649791HG8,-87.828043,0,2021-10-01 11:49:44,S,...,False,False,4,2.710117,0.0,,,8.891787,-85.028043,50.0
4,2021100102121800,2021-10-01 11:25:26,64971W5Y2,100.954,0.0,64971W5Y2,-45.688566,0,2021-10-01 11:26:10,S,...,False,False,4,2.320146,0.0,,,15.150143,-68.931304,25.0


Filling missing values for non-categorical features. The missing values are filled by ther logical counterparts from the [47 Data Enumerations by XPath google sheet](https://docs.google.com/spreadsheets/d/1ke5Ga0OMLAY7T47I6AsS54tYTreMGvjo/edit#gid=1305746325).

In [22]:
df.dropna(subset=['instrument_primary_name'], inplace=True)
df.purpose_sub_class.fillna(1,inplace=True)
df.call_timing.fillna(0, inplace=True) #Unknown
df.call_timing_in_part.fillna(0, inplace=True) #Unknown
df.sink_frequency.fillna(10, inplace=True) #Under special circumstances
df.sink_amount_type.fillna(0, inplace=True)
df.issue_text.fillna('No issue text', inplace=True)
df.state_tax_status.fillna(0, inplace=True)
df.series_name.fillna('No series name', inplace=True)

Filling missing values for categorical features. The missing values are filled by ther logical counterparts from the [47 Data Enumerations by XPath google sheet](https://docs.google.com/spreadsheets/d/1ke5Ga0OMLAY7T47I6AsS54tYTreMGvjo/edit#gid=1305746325).

In [23]:
df.next_call_price.fillna(100, inplace=True)
df.par_call_price.fillna(100, inplace=True)
df.min_amount_outstanding.fillna(0, inplace=True)
df.max_amount_outstanding.fillna(0, inplace=True)
df.call_to_maturity.fillna(0, inplace=True)
df.days_to_par.fillna(0, inplace=True)
df.maturity_amount.fillna(0, inplace=True)
df.issue_price.fillna(df.issue_price.mean(), inplace=True)
df.orig_principal_amount.fillna(df.orig_principal_amount.mean(), inplace=True)
df.original_yield.fillna(0, inplace=True)
df.par_price.fillna(100, inplace=True)
df.called_redemption_type.fillna(0, inplace=True)

Filing missing values for binary features. The missing values are filled by ther logical counterparts from the [47 Data Enumerations by XPath google sheet](https://docs.google.com/spreadsheets/d/1ke5Ga0OMLAY7T47I6AsS54tYTreMGvjo/edit#gid=1305746325).

In [24]:
df.extraordinary_make_whole_call.fillna(False, inplace=True)
df.make_whole_call.fillna(False, inplace=True)
df.default_indicator.fillna(False, inplace=True)

In [25]:
print(len(df))

384003


We train the model on a subset of features. These features are defined below

In [26]:
IDENTIFIERS = ['rtrs_control_number', 'cusip']


BINARY = ['callable',
          'sinking',
          'zerocoupon',
          'is_non_transaction_based_compensation',
          'is_general_obligation',
          'callable_at_cav',           
          'extraordinary_make_whole_call', 
           'make_whole_call',
           'has_unexpired_lines_of_credit',
           'escrow_exists',
          ]



CATEGORICAL_FEATURES = ['rating',
                        'incorporated_state_code',
                        'trade_type',
                        'transaction_type',
                        'maturity_description_code',
                        'purpose_class']

NON_CAT_FEATURES = ['quantity',
                    'days_to_maturity',
                    'days_to_call',
                    'coupon',
                    'issue_amount',
                    'last_seconds_ago',
                    'last_yield_spread',
                    'days_to_settle',
                     'days_to_par',
                     'maturity_amount',
                     'issue_price', 
                     'orig_principal_amount',
                     'max_amount_outstanding']

TRADE_HISTORY = ['trade_history']
TARGET = ['yield_spread']

PREDICTORS = BINARY + CATEGORICAL_FEATURES + NON_CAT_FEATURES + TARGET + TRADE_HISTORY

In [27]:
processed_data = df[IDENTIFIERS + PREDICTORS]

In [28]:
processed_data.maturity_amount = np.log10(1 + processed_data.maturity_amount)
processed_data.orig_principal_amount = np.log10(1 + processed_data.orig_principal_amount)
processed_data.max_amount_outstanding = np.log10(1 + processed_data.max_amount_outstanding)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [29]:
for col in NON_CAT_FEATURES:
    processed_data[col] = processed_data[col].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


A few features such as the initial issue amount cannot be filled with their logical counterparts as their values are not known and hence are dropped. 

In [30]:
print(len(processed_data))
processed_data = processed_data.dropna()
print(len(processed_data))

384003
383981


Splitting the date into train and test set

In [31]:
train_index = int(len(processed_data) * (1-TRAIN_TEST_SPLIT))
train_dataframe = processed_data[train_index:]
test_dataframe = processed_data[:train_index]
print(len(train_dataframe))
print(len(test_dataframe))


326384
57597


## Combining models

Fitting encoders to the categorical features. These encoders are then used to encode the categorical features of the train and test set

In [32]:
encoders = {}
fmax = {}
for f in CATEGORICAL_FEATURES:
    fprep = preprocessing.LabelEncoder().fit(processed_data[f].drop_duplicates())
    fmax[f] = np.max(fprep.transform(fprep.classes_))
    encoders[f] = fprep

The build model function creates and returns a Keras model. The function uses the hp argument to define the hyperparameters during model creation.

In [33]:
# os.environ["KERASTUNER_TUNER_ID"]="tuner0"
# os.environ["KERASTUNER_ORACLE_IP"]="0.0.0.0"
# os.environ["KERASTUNER_ORACLE_PORT"]="8000"

In [47]:
def build_model(hp):
    inputs = []
    layer = []

    ############## INPUT BLOCK ###################
    trade_history_input = layers.Input(name="trade_history_input", 
                                       shape=(SEQUENCE_LENGTH,NUM_FEATURES), 
                                       dtype = tf.float32) 

    inputs.append(trade_history_input)

    for i in NON_CAT_FEATURES + BINARY:
        inputs.append(layers.Input(shape=(1,), name = f"{i}"))

    for i in inputs[1:]:
        layer.append(Normalization()(i))
    ####################################################


    ############## TRADE HISTORY MODEL #################

    # Adding the time2vec encoding to the input to transformer
    lstm_layer = layers.LSTM(hp.Int("lstm_layer_1_units", min_value=10, max_value=500, step=50), 
                             activation='tanh',
                             input_shape=(SEQUENCE_LENGTH,NUM_FEATURES),
                             kernel_initializer = layer_initializer,
                             return_sequences = True,
                             name='LSTM')

    lstm_layer_2 = layers.LSTM(hp.Int("lstm_layer_2_units", min_value=10, max_value=500, step=50), 
                               activation='tanh',
                               input_shape=(SEQUENCE_LENGTH,50),
                               kernel_initializer = layer_initializer,
                               return_sequences = False,
                               name='LSTM_2')

    features = lstm_layer(inputs[0])
    features = lstm_layer_2(features)

    trade_history_output = layers.Dense(hp.Int("trade_history_output_layer", min_value=10, max_value=500, step=50), 
                                        activation='relu',
                                        kernel_initializer=layer_initializer)(features)

    ####################################################

    ############## REFERENCE DATA MODEL ################
    global encoders
    global fmax
    for f in CATEGORICAL_FEATURES:
        fin = layers.Input(shape=(1,), name = f)
        inputs.append(fin)
        embedded = layers.Flatten(name = f + "_flat")( layers.Embedding(input_dim = fmax[f]+1,
                                                                        output_dim = hp.Int("embedding_dim", min_value=10, max_value=500, step=50),
                                                                        input_length= 1,
                                                                        name = f + "_embed",
                                                                        embeddings_initializer=layer_initializer)(fin))
        layer.append(embedded)

    reference_hidden = layers.Dense(hp.Int("reference_hidden_1_units", min_value=10, max_value=500, step=50), 
                                    activation='relu',
                                    kernel_initializer=layer_initializer,
                                    name='reference_hidden_1')(layers.concatenate(layer))

    reference_hidden2 = layers.Dense(hp.Int("reference_hidden_2_units", min_value=10, max_value=500, step=50), 
                                     activation='relu',
                                     kernel_initializer=layer_initializer,
                                     name='reference_hidden_2')(reference_hidden)

    referenece_output = layers.Dense(hp.Int("reference_hidden_3_units", min_value=10, max_value=500, step=50), 
                                     activation='tanh',
                                     kernel_initializer=layer_initializer,
                                     name='reference_hidden_3')(reference_hidden2)

    ####################################################


    feed_forward_input = layers.concatenate([referenece_output, trade_history_output])

    hidden = layers.Dense(hp.Int("output_block_1_units", min_value=250, max_value=600, step=50), 
                          activation='relu',
                          kernel_initializer=layer_initializer)(feed_forward_input)

    hidden2 = layers.Dense(hp.Int("output_block_2_units", min_value=100, max_value=600, step=50), 
                           activation='tanh',
                           kernel_initializer=layer_initializer)(hidden)

    final = layers.Dense(1,
                         kernel_initializer=layer_initializer)(hidden2)


    model = keras.Model(inputs=inputs, outputs=final)
    
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp.Choice("learning_rate", values=[1e-2, 1e-3, 1e-4])),
              loss=keras.losses.MeanSquaredError(),
              metrics=[keras.metrics.MeanAbsoluteError()])
    
    return model

The create input function encodes the categorical features. It then combines the trade history, categorical, non-categorical, and binary features to return a NumPy array containing the data to be fed into the model.

In [48]:
def create_input(df):
    global encoders
    datalist = []
    datalist.append(np.stack(df['trade_history'].to_numpy()))
    for f in NON_CAT_FEATURES + BINARY:
        datalist.append(df[f].to_numpy().astype('float32'))
        
    for f in CATEGORICAL_FEATURES:
        encoded = encoders[f].transform(df[f])
        datalist.append(encoded.astype('float32'))
    return datalist

Defining the tuner for hyper-parameter tuning

In [49]:
tuner = RandomSearch(
    build_model,
    objective="val_mean_absolute_error",
    max_trials=25,
    overwrite=True,
    directory="model_tuning",
    project_name="yield_spread_model",
    distribution_strategy=tf.distribute.MirroredStrategy()
)





INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


In [50]:
tuner.search_space_summary()

Search space summary
Default search space size: 10
lstm_layer_1_units (Int)
{'default': None, 'conditions': [], 'min_value': 10, 'max_value': 500, 'step': 50, 'sampling': None}
lstm_layer_2_units (Int)
{'default': None, 'conditions': [], 'min_value': 10, 'max_value': 500, 'step': 50, 'sampling': None}
trade_history_output_layer (Int)
{'default': None, 'conditions': [], 'min_value': 10, 'max_value': 500, 'step': 50, 'sampling': None}
embedding_dim (Int)
{'default': None, 'conditions': [], 'min_value': 10, 'max_value': 500, 'step': 50, 'sampling': None}
reference_hidden_1_units (Int)
{'default': None, 'conditions': [], 'min_value': 10, 'max_value': 500, 'step': 50, 'sampling': None}
reference_hidden_2_units (Int)
{'default': None, 'conditions': [], 'min_value': 10, 'max_value': 500, 'step': 50, 'sampling': None}
reference_hidden_3_units (Int)
{'default': None, 'conditions': [], 'min_value': 10, 'max_value': 500, 'step': 50, 'sampling': None}
output_block_1_units (Int)
{'default': None, '

In [51]:
val_index = int(len(train_dataframe) * (1-0.9))
train_dataframe = train_dataframe[val_index:]
val_dataframe = train_dataframe[:val_index]

In [52]:
%%time
x_train = create_input(train_dataframe)
y_train = train_dataframe.yield_spread

CPU times: user 455 ms, sys: 0 ns, total: 455 ms
Wall time: 453 ms


In [53]:
%%time
x_val = create_input(val_dataframe)
y_val = val_dataframe.yield_spread

CPU times: user 52.3 ms, sys: 0 ns, total: 52.3 ms
Wall time: 51 ms


In [None]:
tuner.search(x_train, y_train, epochs=30, validation_data=(x_val, y_val))

Trial 5 Complete [00h 33m 16s]
val_mean_absolute_error: 8.348470687866211

Best val_mean_absolute_error So Far: 6.5781097412109375
Total elapsed time: 02h 48m 51s

Search: Running Trial #6

Hyperparameter    |Value             |Best Value So Far 
lstm_layer_1_units|60                |360               
lstm_layer_2_units|460               |460               
trade_history_o...|110               |360               
embedding_dim     |410               |310               
reference_hidde...|60                |160               
reference_hidde...|360               |360               
reference_hidde...|160               |260               
output_block_1_...|500               |250               
output_block_2_...|500               |350               
learning_rate     |0.001             |0.0001            

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30