## Yield spread model with reference data


We try to use gradient boosted trees to predict the yield spread, the input to the model is the reference data for the CUSIP

In [1]:
import pandas as pd
import numpy as np

from google.cloud import bigquery
import os
from sklearn import preprocessing
from datetime import datetime
import matplotlib.pyplot as plt

from catboost import CatBoostRegressor, Pool


import plotly.graph_objects as go
from IPython.display import display, HTML

2021-07-19 13:44:46.869549: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0


Initializing the big query client

In [3]:
bq_client = bigquery.Client()

#### Hyper-parameters for the model
The batch size and learning rate have an impact on the smoothness of convergence of the model.\
Larger the batch size the smoother the convergence. For a larger batch size we need a higher learning rate and vice-versa

In [5]:
TRAIN_TEST_SPLIT = 0.85
LEARNING_RATE = 0.001
BATCH_SIZE = 1000
NUM_EPOCHS = 50

### Query to fetch data from BigQuery

The SQL query uses the trade history with reference data view. We only use the trades which occurred after 01/01/2021. All three trade directions, namely dealer-dealer (D), dealer-sells (S), and dealer-purchases (P) are included. We are limiting the training to only tax exempt bonds which have a rating available and the yield is a positive number less than three.  


In [6]:
DATA_QUERY = """ SELECT
  rtrs_control_number,
  cusip,
  trade_datetime,
  trade_date,
  time_of_trade,
  ifnull(settlement_date, assumed_settlement_date) AS settle_date,
  maturity_date,
  coupon,
  trade_type,
  is_non_transaction_based_compensation,
  yield_spread,
  yield AS ytw,
  dollar_price AS price,
  par_traded AS quantity,
  dated_date,
  issue_price,
  interest_payment_frequency,
  par_call_date,
  next_call_price,
  par_call_price,
  previous_coupon_payment_date,
  next_coupon_payment_date,
  first_coupon_date,
  coupon_type,
  muni_security_type,
  called_redemption_type,
  refund_price,
  call_timing,
  next_call_date AS call_date,
  next_sink_date AS sink_date,
  delivery_date AS deliv_date,
  refund_date AS refund_date,
  organization_primary_name AS issuer,
  is_callable,
  is_called,
  sp_long AS sp_lt_rating,
  assumed_redemption_date,
  is_lop_or_takedown,
  incorporated_state_code,
  security_description,
  instrument_primary_name,
  is_general_obligation,
  daycount_basis_type,
  capital_type,
  use_of_proceeds
FROM
  `eng-reactor-287421.primary_views.trade_history_with_reference_data`
WHERE
  yield IS NOT NULL
  AND yield > 0 
  AND yield <= 3 
  AND par_traded IS NOT NULL
  AND par_traded > 5000
  AND sp_long IS NOT NULL
  AND sp_long != "NR"
  AND federal_tax_status = 2 --tax exempt bonds
  AND trade_date >= '2021-01-01' 
  AND msrb_valid_to_date > current_date -- condition to remove cancelled trades
ORDER BY
  trade_date DESC
            """

### Data Preparation

We grab the data from BigQuery and converts it into a format suitable for input to the model. The fetch_data function uses the big query functionality to return the data from the SQL query as a dataframe. 

In [7]:
def fetch_data(query):
    df = bq_client.query(query).result().to_dataframe()
    return df

In [8]:
%%time
if os.path.isfile('data.pkl'):
    reference_data = pd.read_pickle('data.pkl')
else:
    reference_data = fetch_data(DATA_QUERY)
    reference_data.to_pickle('data.pkl')

CPU times: user 9.57 s, sys: 2.85 s, total: 12.4 s
Wall time: 3min 3s


Converting the columns to correct datatypes. We also restrict the universe of trades to only investment grade bonds 

In [10]:
%%time
df = reference_data.copy()

df.quantity = np.log10(df.quantity.astype(float))
df.coupon = df.coupon.astype(float)

df['timestamp'] = pd.to_datetime(df['trade_datetime'], format='%Y-%m-%d %H:%M:%S')
df['trade_date'] = pd.to_datetime(df['trade_date'], format='%Y-%m-%d')
df['settle_date'] = pd.to_datetime(df['settle_date'], format='%Y-%m-%d')
df['maturity_date'] = pd.to_datetime(df['maturity_date'], format='%Y-%m-%d')
df['call_date'] = pd.to_datetime(df['call_date'], format='%Y-%m-%d')
df['sink_date'] = pd.to_datetime(df['sink_date'], format='%Y-%m-%d')
df['deliv_date'] = pd.to_datetime(df['deliv_date'], format='%Y-%m-%d')
df['assumed_redemption_date'] = pd.to_datetime(df['assumed_redemption_date'], format='%Y-%m-%d')
df['yield_spread'] = df['yield_spread'] * 100

#add dates from the ytw calculator for test: 
df['dated_date'] = pd.to_datetime(df['dated_date'], format='%Y-%m-%d')
df['par_call_date'] = pd.to_datetime(df['par_call_date'], format='%Y-%m-%d')
df['maturity_date'] = pd.to_datetime(df['maturity_date'], format='%Y-%m-%d')
df['previous_coupon_payment_date'] = pd.to_datetime(df['previous_coupon_payment_date'], format='%Y-%m-%d')
df['next_coupon_payment_date'] = pd.to_datetime(df['next_coupon_payment_date'], format='%Y-%m-%d')
df['first_coupon_date'] = pd.to_datetime(df['first_coupon_date'], format='%Y-%m-%d')

rd = pd.to_datetime(df['refund_date'], format='%Y-%m-%d')
rd[rd < df.trade_date] = pd.NaT
df['refund_date'] = rd
df['rating'] = df.sp_lt_rating

# Just including investment grade bonds
df = df[df.rating.isin(['A-','A','A+','AA-','AA','AA+','AAA'])] 

print(len(df))

2622490
CPU times: user 8.19 s, sys: 1.68 s, total: 9.87 s
Wall time: 9.86 s


Creating Binary features

In [11]:
df['callable'] = df.is_callable  
df['called'] = df.is_called 
df['zerocoupon'] = df.coupon == 0
df['whenissued'] = df.deliv_date >= df.trade_date
df['sinking'] = ~df.sink_date.isnull()

# if the redemption type is NA we fill it with zero i.e unknown ref ice
df.called_redemption_type.fillna(0, inplace=True)

Converting the dates to number of days from settlement date. We only consider trades to be reportedly correctly if the trades are settled within on month of the trade date. 

In [12]:
# Dropping trades settled one month after the trade
df['days_to_settle'] = (df.settle_date - df.trade_date).dt.days
df = df[df.days_to_settle <= 31]

In [13]:
df['days_to_maturity'] =  np.log10(1 + (df.assumed_redemption_date - df.settle_date).dt.days)
df['days_to_call'] = np.log10(1 + (df.call_date - df.settle_date).dt.days.fillna(0))
df['days_to_coupon'] = (df.next_coupon_payment_date - df.settle_date).dt.days
df.days_to_coupon = df.days_to_coupon.apply(lambda x: x if x > 0 else 0)
df.days_to_coupon = np.log(1 + df.days_to_coupon)

# Removing bonds from Puerto Rico
df = df[df.incorporated_state_code != 'PR']

Creating flags for different settlement pace

In [14]:
def settlement_pace(x):
    if x <= 3:
        return 'Fast'
    elif x>3 and x <=15:
        return 'Medium'
    else:
        return 'Slow'

In [15]:
df['settle_pace'] = df.days_to_settle.apply(settlement_pace)

Features used to tran the model

In [16]:
# features: 
BINARY = ['callable','called','sinking','whenissued','zerocoupon','is_non_transaction_based_compensation','is_general_obligation']
CATEGORICAL_FEATURES = ['trade_type','rating','coupon_type', 'muni_security_type','called_redemption_type','issuer','settle_pace']
NON_CAT_FEATURES = ['quantity','days_to_maturity','days_to_call','coupon',]
TARGET = ['yield_spread']

In [17]:
processed_data = df[BINARY + NON_CAT_FEATURES + CATEGORICAL_FEATURES + TARGET]

In [18]:
processed_data = processed_data.dropna()
len(processed_data)

2604340

In [19]:
train_index = int( len(processed_data) * (1-TRAIN_TEST_SPLIT))
print(train_index)
train_dataframe = processed_data[train_index:]
test_dataframe = processed_data[:train_index]

print(len(train_dataframe))
print(len(test_dataframe))

390651
2213689
390651


### Reference data model
CatBoost is a machine learning algorithm that uses gradient boosting on decision trees. We use this model to predict yield spreads only using the reference data.



In [20]:
model = CatBoostRegressor(iterations=300,
                          depth=14,
                          loss_function='RMSE',
                          verbose=True)

In [27]:
%%time
train_dataframe = train_dataframe.copy()
train_dataframe['callable'] = train_dataframe.callable.astype('str')
train_dataframe['called'] = train_dataframe.called.astype('str')
train_dataframe['sinking'] = train_dataframe.sinking.astype('str')
train_dataframe['whenissued'] = train_dataframe.whenissued.astype('str')
train_dataframe['zerocoupon'] = train_dataframe.zerocoupon.astype('str')
train_dataframe['is_non_transaction_based_compensation'] = train_dataframe.is_non_transaction_based_compensation.astype('str')
train_dataframe['is_general_obligation'] = train_dataframe.is_general_obligation.astype('str')

#some categorical features are stores as numbers, we need to change them to strings
train_dataframe['coupon_type'] = train_dataframe.coupon_type.astype('str')
train_dataframe['muni_security_type'] = train_dataframe.muni_security_type.astype('str')
train_dataframe['called_redemption_type'] = train_dataframe.called_redemption_type.astype('str')

CPU times: user 2.63 s, sys: 180 ms, total: 2.81 s
Wall time: 2.81 s


In [28]:
target = train_dataframe[TARGET]
train_data = train_dataframe[NON_CAT_FEATURES + BINARY + CATEGORICAL_FEATURES]

In [36]:
CATF = CATEGORICAL_FEATURES + BINARY
trainset = Pool(train_data, target, cat_features=CATF)

In [37]:
model.fit(trainset)

Learning rate set to 0.383901
0:	learn: 43.2034391	total: 5.17s	remaining: 25m 44s
1:	learn: 34.1072060	total: 10.8s	remaining: 26m 46s
2:	learn: 29.5194161	total: 16s	remaining: 26m 26s
3:	learn: 27.2178095	total: 21.7s	remaining: 26m 42s
4:	learn: 25.7226684	total: 26.6s	remaining: 26m 12s
5:	learn: 24.9472322	total: 30.8s	remaining: 25m 9s
6:	learn: 24.3362727	total: 35.4s	remaining: 24m 40s
7:	learn: 23.9573118	total: 39.9s	remaining: 24m 15s
8:	learn: 23.5788527	total: 44.3s	remaining: 23m 52s
9:	learn: 23.4076342	total: 48.8s	remaining: 23m 34s
10:	learn: 23.2159794	total: 54.3s	remaining: 23m 46s
11:	learn: 23.0735083	total: 59.4s	remaining: 23m 46s
12:	learn: 22.9844691	total: 1m 3s	remaining: 23m 25s
13:	learn: 22.9077379	total: 1m 8s	remaining: 23m 15s
14:	learn: 22.7965219	total: 1m 12s	remaining: 23m 4s
15:	learn: 22.6834726	total: 1m 18s	remaining: 23m 12s
16:	learn: 22.5450863	total: 1m 23s	remaining: 23m 16s
17:	learn: 22.4639053	total: 1m 29s	remaining: 23m 23s
18:	lear

<catboost.core.CatBoostRegressor at 0x7f0e606ed590>

Processing the test set

In [71]:
%%time
test_dataframe = test_dataframe.copy()
test_dataframe['callable'] = test_dataframe.callable.astype('str')
test_dataframe['called'] = test_dataframe.called.astype('str')
test_dataframe['sinking'] = test_dataframe.sinking.astype('str')
test_dataframe['whenissued'] = test_dataframe.whenissued.astype('str')
test_dataframe['zerocoupon'] = test_dataframe.zerocoupon.astype('str')
test_dataframe['is_non_transaction_based_compensation'] = test_dataframe.is_non_transaction_based_compensation.astype('str')
test_dataframe['is_general_obligation'] = test_dataframe.is_general_obligation.astype('str')

#some categorical features are stores as numbers, we need to change them to strings
test_dataframe['coupon_type'] = test_dataframe.coupon_type.astype('str')
test_dataframe['muni_security_type'] = test_dataframe.muni_security_type.astype('str')
test_dataframe['called_redemption_type'] = test_dataframe.called_redemption_type.astype('str')

CPU times: user 145 ms, sys: 52 ms, total: 197 ms
Wall time: 195 ms


In [72]:
test_target = test_dataframe[TARGET]
test_data = test_dataframe[NON_CAT_FEATURES + CATEGORICAL_FEATURES + BINARY]

In [73]:
CATF = CATEGORICAL_FEATURES + BINARY

In [74]:
test_set = Pool(test_data, test_target, cat_features=CATF)

In [75]:
print(model.get_params())
model.eval_metrics(test_set,['RMSE','MAE'],eval_period=50)

{'iterations': 300, 'depth': 14, 'loss_function': 'RMSE', 'verbose': True}


{'RMSE': [42.424831515991116,
  22.72216295152427,
  22.525442832853138,
  22.49961569281253,
  22.494903842810697,
  22.50526629079192,
  22.514857050816218],
 'MAE': [31.746718645503293,
  16.30957875945107,
  16.12666597602852,
  16.08625526251984,
  16.074700939425835,
  16.071119742405664,
  16.072864987958393]}

In [76]:
predicted_yield_spreads = model.predict(test_set)

In [81]:
evaluation_dataframe = test_dataframe.copy()
evaluation_dataframe['predicted_yield_spreads'] = predicted_yield_spreads
evaluation_dataframe['delta_yield_spreads'] = evaluation_dataframe.yield_spread - evaluation_dataframe.predicted_yield_spreads

In [90]:
evaluation_dataframe.sort_values('delta_yield_spreads', ascending=False)

Unnamed: 0,callable,called,sinking,whenissued,zerocoupon,is_non_transaction_based_compensation,is_general_obligation,quantity,days_to_maturity,days_to_call,...,trade_type,rating,coupon_type,muni_security_type,called_redemption_type,issuer,settle_pace,yield_spread,predicted_yield_spreads,delta_yield_spreads
396982,True,False,False,False,False,False,False,4.301030,2.029384,2.029384,...,S,A+,8,8,0.0,DE KALB CNTY GA WTR & SEW REV,Fast,162.211440,-109.242426,271.453866
52475,False,True,False,False,False,False,False,4.397940,2.752816,2.752816,...,S,A+,8,3,13.0,MIAMI-DADE CNTY FLA SCH BRD CTFS PARTN,Fast,197.752258,-71.982415,269.734673
377118,False,False,False,False,False,False,False,4.000000,1.653213,0.000000,...,D,AA+,8,1,0.0,SOUTHERN CALIF WTR REPLENISHMENT DIST FING AUT...,Fast,196.553637,-69.768999,266.322636
381947,True,False,False,False,False,False,False,5.000000,1.477121,1.477121,...,P,AA,8,8,0.0,NEW YORK N Y CITY TRANSITIONAL FIN AUTH BLDG A...,Fast,200.353637,-65.568553,265.922191
161457,False,False,False,False,False,False,True,5.161368,1.397940,0.000000,...,P,AA,8,5,0.0,WILLIAMSON CNTY TEX MUN UTIL DIST NO 11,Fast,163.009603,-97.892641,260.902244
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87186,True,False,True,False,False,False,False,4.000000,2.235528,2.235528,...,D,AA,17,8,0.0,KANSAS CITY MO SAN SWR SYS REV,Fast,-91.988858,77.923862,-169.912720
310422,True,False,False,False,False,False,True,4.000000,1.591065,1.591065,...,P,AA,8,6,0.0,ROSE TREE MEDIA PA SCH DIST,Fast,-98.618239,87.670027,-186.288266
194115,False,True,False,False,False,False,True,4.000000,1.875061,1.792392,...,D,AA-,8,5,6.0,CALIFORNIA ST,Fast,-80.631237,111.867634,-192.498872
79012,False,True,False,False,True,False,True,4.301030,1.278754,0.000000,...,D,AA,4,5,17.0,ROSS VALLEY CALIF SCH DIST,Fast,-61.488858,160.835883,-222.324740
