## Comparing diff vs and 

This notebook compares model trained on new ys and diff ys. Both the model are trained on trades from May 2023 to June 2023. The models are then tested on trades after July 2023.

The results clearly show that both diff_ys and new_ys perform equally well on the entire dataset. However, it's worth noting that the diff_ys model exhibits a slight advantage in predicting significant dealer-dealer trades.

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import time

import numpy as np
from google.cloud import bigquery
from google.cloud import storage
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import seaborn as sns

from tensorflow.keras.layers import Embedding
from tensorflow.keras import activations
from tensorflow.keras import backend as K
from tensorflow.keras import initializers
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from sklearn import preprocessing
from datetime import datetime
import matplotlib.pyplot as plt
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from lightgbm import LGBMRegressor
import lightgbm

from IPython.display import display, HTML
import os


from ficc.data.process_data import process_data
from ficc.utils.auxiliary_variables import PREDICTORS, NON_CAT_FEATURES, BINARY, CATEGORICAL_FEATURES, IDENTIFIERS, PURPOSE_CLASS_DICT
from ficc.utils.gcp_storage_functions import upload_data, download_data
from ficc.utils.auxiliary_variables import RELATED_TRADE_BINARY_FEATURES, RELATED_TRADE_NON_CAT_FEATURES, RELATED_TRADE_CATEGORICAL_FEATURES

Initializing pandarallel with 16.0 cores
INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [2]:
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

2023-09-18 22:35:47.945068: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-09-18 22:35:47.955267: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-09-18 22:35:47.955800: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


Setting the environment variables

In [3]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="ahmad_creds.json"
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
pd.options.mode.chained_assignment = None

Initializing BigQuery client

In [4]:
bq_client = bigquery.Client()

Initializing GCP storage client

In [5]:
storage_client = storage.Client()

Declaring hyper-parameters

In [6]:
TRAIN_TEST_SPLIT = 0.85
LEARNING_RATE = 0.0001
BATCH_SIZE = 1000
NUM_EPOCHS = 100

DROPOUT = 0.10
SEQUENCE_LENGTH = 5
NUM_FEATURES = 6

Checking if the treasury spreads and target attention features are present in PREDICTORS 

In [7]:
if 'ficc_treasury_spread' not in PREDICTORS:
    PREDICTORS.append('ficc_treasury_spread')
    NON_CAT_FEATURES.append('ficc_treasury_spread')
if 'target_attention_features' not in PREDICTORS:
    PREDICTORS.append('target_attention_features')

#### Data Preparation
We grab the data from a GCP bucket. The data is prepared using the ficc python package. More insight on how the data is prepared can be found [here](https://github.com/Ficc-ai/ficc/blob/ahmad_ml/ml_models/sequence_predictors/data_prep/data_preparation.ipynb)

In [9]:
%%time
import gcsfs
fs = gcsfs.GCSFileSystem(project='eng-reactor-287421')
with fs.open('ahmad_data/processed_data_2023-09-18-17:34.pkl') as f:
    data = pd.read_pickle(f)

CPU times: user 8.91 s, sys: 2.73 s, total: 11.6 s
Wall time: 25.3 s


#### Date range for data

In [10]:
data.trade_date.max()

Timestamp('2023-09-15 00:00:00')

In [11]:
data.trade_date.min()

Timestamp('2023-08-01 00:00:00')

In [12]:
print(f'Restricting history to {SEQUENCE_LENGTH} trades')
data.trade_history = data.trade_history.apply(lambda x: x[:SEQUENCE_LENGTH])
data.target_attention_features = data.target_attention_features.apply(lambda x:x[:SEQUENCE_LENGTH])

Restricting history to 5 trades


In [13]:
data.trade_history.iloc[0].shape

(5, 6)

In [14]:
data.target_attention_features.iloc[0].shape

(1, 3)

In [15]:
data.sort_values('trade_datetime', inplace=True)

#### Creating features from trade history

In [16]:
ttype_dict = { (0,0):'D', (0,1):'S', (1,0):'P' }

ys_variants = ["max_ys", "min_ys", "max_qty", "min_ago", "D_min_ago", "P_min_ago", "S_min_ago"]
ys_feats = ["_ys", "_ttypes", "_ago", "_qdiff"]
D_prev = dict()
P_prev = dict()
S_prev = dict()

def get_trade_history_columns():
    '''
    This function is used to create a list of columns
    '''
    YS_COLS = []
    for prefix in ys_variants:
        for suffix in ys_feats:
            YS_COLS.append(prefix + suffix)
    return YS_COLS

def extract_feature_from_trade(row, name, trade):
    yield_spread = trade[0]
    ttypes = ttype_dict[(trade[3],trade[4])] + row.trade_type
    seconds_ago = trade[5]
    quantity_diff = np.log10(1 + np.abs(10**trade[2] - 10**row.quantity))
    return [yield_spread, ttypes,  seconds_ago, quantity_diff]

def trade_history_derived_features(row):
    trade_history = row.trade_history
    trade = trade_history[0]
    
    D_min_ago_t = D_prev.get(row.cusip,trade)
    D_min_ago = 9        

    P_min_ago_t = P_prev.get(row.cusip,trade)
    P_min_ago = 9
    
    S_min_ago_t = S_prev.get(row.cusip,trade)
    S_min_ago = 9
    
    max_ys_t = trade; max_ys = trade[0]
    min_ys_t = trade; min_ys = trade[0]
    max_qty_t = trade; max_qty = trade[2]
    min_ago_t = trade; min_ago = trade[5]
    
    for trade in trade_history[0:]:
        #Checking if the first trade in the history is from the same block
        if trade[5] == 0: 
            continue
 
        if trade[0] > max_ys: 
            max_ys_t = trade
            max_ys = trade[0]
        elif trade[0] < min_ys: 
            min_ys_t = trade; 
            min_ys = trade[0]

        if trade[2] > max_qty: 
            max_qty_t = trade 
            max_qty = trade[2]
        if trade[5] < min_ago: 
            min_ago_t = trade; 
            min_ago = trade[5]
            
        side = ttype_dict[(trade[3],trade[4])]
        if side == "D":
            if trade[5] < D_min_ago: 
                D_min_ago_t = trade; D_min_ago = trade[5]
                D_prev[row.cusip] = trade
        elif side == "P":
            if trade[5] < P_min_ago: 
                P_min_ago_t = trade; P_min_ago = trade[5]
                P_prev[row.cusip] = trade
        elif side == "S":
            if trade[5] < S_min_ago: 
                S_min_ago_t = trade; S_min_ago = trade[5]
                S_prev[row.cusip] = trade
        else: 
            print("invalid side", trade)
    
    trade_history_dict = {"max_ys":max_ys_t,
                          "min_ys":min_ys_t,
                          "max_qty":max_qty_t,
                          "min_ago":min_ago_t,
                          "D_min_ago":D_min_ago_t,
                          "P_min_ago":P_min_ago_t,
                          "S_min_ago":S_min_ago_t}

    return_list = []
    for variant in ys_variants:
        feature_list = extract_feature_from_trade(row,variant,trade_history_dict[variant])
        return_list += feature_list
    
    return return_list

In [17]:
%%time
temp = data[['cusip',
             'trade_history',
             'quantity',
             'trade_type']].parallel_apply(trade_history_derived_features, axis=1)

CPU times: user 19.2 s, sys: 3.7 s, total: 22.9 s
Wall time: 48 s


In [18]:
YS_COLS = get_trade_history_columns()

In [19]:
data[YS_COLS] = pd.DataFrame(temp.tolist(), index=data.index)

Adding trade history features to PREDICTORS list

In [20]:
for col in YS_COLS:
    if 'ttypes' in col and col not in PREDICTORS:
        PREDICTORS.append(col)
        CATEGORICAL_FEATURES.append(col)
    elif col not in PREDICTORS:
        NON_CAT_FEATURES.append(col)
        PREDICTORS.append(col)

In [21]:
def duration(coupon, ytw, years, dollar_price, peryear=2):
    ytw = ytw.clip(0.001,np.inf)
    c = (coupon/100) / peryear
    y = (ytw/10000) / peryear
    n = years * peryear
    m = peryear
    macaulay_duration = ((1+y) / (m*y)) - ( (1 + y + n*(c-y)) / ((m*c* ((1+y)**n - 1)) + m*y))
    modified_duration = macaulay_duration / (1 + y)
    dv01 = modified_duration * dollar_price / 10000
    return dv01

def add_additional_feature(data):
    data['diff_ficc_ycl'] = data.new_ficc_ycl - data.last_ficc_ycl
    data['diff_ficc_treasury_spread'] = data.last_ficc_ycl - (data.treasury_rate * 100)
    data['dv01'] = duration(data.coupon, data.last_yield, data.last_duration, data.last_dollar_price)
    data['approx_dpd'] =  data.dv01 * data.diff_ficc_ycl
    data['overage'] =  (data.last_dollar_price + data.approx_dpd - data.next_call_price)
    #data['de_minimis_gap'] = data.last_dollar_price - data.de_minimis_threshold
    return data

data = add_additional_feature(data)
additional_features = ['diff_ficc_ycl','diff_ficc_treasury_spread','dv01','approx_dpd','overage']#,'de_minimis_gap']
for i in additional_features:
    if i not in NON_CAT_FEATURES:
        NON_CAT_FEATURES.append(i)
        PREDICTORS.append(i)

This feature is used to check if there are any NaN values in the trade history. **It is not used to train the model**. 

In [22]:
len(data)

1318395

In [23]:
%%time
data['trade_history_sum'] = data.trade_history.parallel_apply(lambda x: np.sum(x))

CPU times: user 6.8 s, sys: 1.68 s, total: 8.48 s
Wall time: 9.42 s


In [24]:
data = data.dropna(subset=['trade_history_sum'])

In [25]:
len(data)

1318395

For the purpose of plotting, not used in training

In [26]:
data.purpose_sub_class.fillna(0, inplace=True)

Creating new ys label

In [27]:
data['new_ys'] = data['yield'] - data['new_ficc_ycl']
data['diff_ys'] = data['new_ys'] - data['last_yield_spread']

Selecting a subset of features for training. PREDICTORS are the features that we are going to use to train the model. More information about the feature set can be found [here](https://github.com/Ficc-ai/ficc_python/blob/d455bd30eca18f26a2535523530facad516dd90f/ficc/utils/auxiliary_variables.py#L120). We also select a set of additonal features, which are not used in training. These features are used to uderstand the results from the model.

In [28]:
auxiliary_features = ['dollar_price',
                     'calc_date', 
                     'trade_date',
                     'trade_datetime', 
                     'purpose_sub_class', 
                     'called_redemption_type', 
                     'calc_day_cat',
                     'yield',
                     'ficc_ycl',
                     'new_ys',
                     'trade_history_sum',
                     'new_ficc_ycl',
                     'days_to_refund',
                     'last_dollar_price',
                     'last_rtrs_control_number',
                     'is_called',
                     'federal_tax_status','par_traded']
                      #,'maturity_description_code']

In [29]:
processed_data = data[IDENTIFIERS + PREDICTORS + auxiliary_features]

Checking for missing data and NaN values

In [30]:
len(processed_data)

1318395

In [31]:
processed_data.issue_amount = processed_data.issue_amount.replace([np.inf, -np.inf], np.nan)

In [32]:
processed_data.dropna(inplace=True, subset=PREDICTORS)

In [33]:
len(processed_data)

1272886

#### Fitting encoders to the categorical features. These encoders are then used to encode the categorical features of the train and test set

In [62]:
with open('encoders_test.pkl','rb') as f:
    encoders = pickle.load(f)

In [63]:
processed_data.sort_values('trade_datetime',ascending=False,inplace=True)

#### Splitting the data into train and test sets

In [64]:
test_dataframe = processed_data[(processed_data.trade_date >= '08-01-2023') & (processed_data.trade_date <= '08-31-2023')]

In [65]:
len(test_dataframe)

864056

In [66]:
test_dataframe.trade_date.min()

Timestamp('2023-08-01 00:00:00')

##### Converting data into format suitable for the model

In [67]:
def create_input(df):
    global encoders
    datalist = []
    datalist.append(np.stack(df['trade_history'].to_numpy()))
    datalist.append(np.stack(df['target_attention_features'].to_numpy()))

    noncat_and_binary = []
    for f in NON_CAT_FEATURES + BINARY:
        noncat_and_binary.append(np.expand_dims(df[f].to_numpy().astype('float32'), axis=1))
    datalist.append(np.concatenate(noncat_and_binary, axis=-1))
    
    for f in CATEGORICAL_FEATURES:
        encoded = encoders[f].transform(df[f])
        datalist.append(encoded.astype('float32'))
    
    return datalist

In [68]:
%%time
x_test = create_input(test_dataframe)

CPU times: user 4.72 s, sys: 63.3 ms, total: 4.79 s
Wall time: 4.78 s


In [69]:
x_test[2].shape

(864056, 53)

### Load ys model

In [71]:
yield_spread_model = keras.models.load_model('saved_model_new_ys_2023-09-18-20-32')

In [72]:
test_dataframe['predicted_ys'] = yield_spread_model.predict(x_test, batch_size=BATCH_SIZE)

2023-09-18 22:43:48.198212: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla T4" frequency: 1590 num_cores: 40 environment { key: "architecture" value: "7.5" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 65536 memory_size: 9431678976 bandwidth: 320064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }


In [73]:
print(f"MAE yield spread model: {round(np.mean(np.abs(test_dataframe.new_ys - test_dataframe.predicted_ys)), 3)}")

MAE yield spread model: 14.215


In [74]:
print(f"MAD yield spread model: {round(np.median(np.abs(test_dataframe.new_ys - test_dataframe.predicted_ys)), 3)}")

MAD yield spread model: 7.438


Measuring accuracy for large dealer-dealer trades

In [75]:
true_mid = test_dataframe[(test_dataframe.trade_type == 'D') & (test_dataframe.par_traded > 500000)]

In [76]:
print(f"MAE yield spread model: {round(np.mean(np.abs(true_mid.new_ys - true_mid.predicted_ys)), 3)}")

MAE yield spread model: 8.056


In [77]:
print(f"MAD yield spread model: {round(np.median(np.abs(true_mid.new_ys - true_mid.predicted_ys)), 3)}")

MAD yield spread model: 3.566


### Loading diff ys model

In [78]:
diff_ys_model = keras.models.load_model('saved_model_diff_ys_2023-09-18-22-32')

In [79]:
test_dataframe['predicted_diff_ys'] = diff_ys_model.predict(x_test, batch_size=BATCH_SIZE)

2023-09-18 22:44:12.991578: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla T4" frequency: 1590 num_cores: 40 environment { key: "architecture" value: "7.5" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 65536 memory_size: 9431678976 bandwidth: 320064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }


In [80]:
test_dataframe['predicted_spread_diff_ys'] = test_dataframe['last_yield_spread'] + test_dataframe['predicted_diff_ys']

In [81]:
print(f"MAE diff ys loss: {round(np.mean(np.abs(test_dataframe.new_ys - test_dataframe.predicted_spread_diff_ys)), 3)}")

MAE diff ys loss: 11.092


In [82]:
print(f"MAD diff ys model: {round(np.median(np.abs(test_dataframe.new_ys - test_dataframe.predicted_spread_diff_ys)), 3)}")

MAD diff ys model: 5.891


Measuring accuracy on large dealer-dealer trades

In [83]:
true_mid = test_dataframe[(test_dataframe.trade_type == 'D') & (test_dataframe.par_traded > 500000)]

In [84]:
print(f"MAE diff ys loss: {round(np.mean(np.abs(true_mid.new_ys - true_mid.predicted_spread_diff_ys)), 3)}")

MAE diff ys loss: 6.365


In [85]:
print(f"MAD diff ys model: {round(np.median(np.abs(true_mid.new_ys - true_mid.predicted_spread_diff_ys)), 3)}")

MAD diff ys model: 2.516
