# Compare One Day of Data

Created by [Author name removed] on 2024-12-09.

Last edited by [Name removed] on 2024-12-09.

This notebook investigates three anomalies based on a single day of data for two datasets: (1) old data used for automated training, and (2) new data to be used for automated training. The anomalies are: (1) the data in the sets have minor differences, (2) the same model has vastly different predictions on these two datasets, (3) a model that was trained on just two epochs has very similar accuracy to a model that was trained on 100 epochs. The procedure is to 
1. download both datasets and isolate the single day
2. download the archived models
3. compare the accuracy results of the archived models on these datasets

To run the notebook, use Python 3.10 (Python 3.12 does not work), and
- on linux: use `ficc_python/requirements_py310_linux_jupyter.txt`
- on mac: use `ficc_python/requirements_py310_mac_jupyter.txt`

Note: This notebook **requires at least 50 GB of RAM** on a VM. On a MacBook Pro M1 Max with 32 GB of RAM, it can still run because swap memory allows the system to use additional storage space to supplement the required memory.

Change the following files and/or variables to enable credentials and the correct directories:
- `automated_training/auxiliary_functions.py::get_creds(...)` to be the location of the credentials file
- `automated_training/auxiliary_variables.py::WORKING_DIRECTORY` to be the location of the old working directory

In [None]:
# loads the autoreload extension
%load_ext autoreload
# automatically reloads all imported modules when their source code changes
%autoreload 2

In [50]:
import os
import warnings

import pandas as pd
from tensorflow import keras


# importing from parent directory: https://stackoverflow.com/questions/714063/importing-modules-from-parent-folder
import sys
sys.path.insert(0, '../../')


from automated_training.auxiliary_variables import WORKING_DIRECTORY, CATEGORICAL_FEATURES
from automated_training.auxiliary_functions import STORAGE_CLIENT, fit_encoders, create_input, load_model, create_summary_of_results
from ficc.utils.gcp_storage_functions import download_data

In [3]:
AUTOMATED_TRAINING_BUCKET = 'automated_training'

In [4]:
MODEL = 'yield_spread_with_similar_trades'

In [5]:
def get_data_for_automated_training_and_isolate_to_single_dates(old_or_new: str, dates: list) -> pd.DataFrame:
    assert old_or_new in ('old', 'new')
    df_list = []
    df_downloaded_from_google_cloud_storage = {}
    for date in dates:
        pickle_file_path = f'{WORKING_DIRECTORY}/files/{old_or_new}_data_{date}.pkl'
        if os.path.exists(pickle_file_path):
            print(f'Loading pickle file from {pickle_file_path}')
            df_list.append(pd.read_pickle(pickle_file_path))
        else:
            print(f'Could not find pickle file in {pickle_file_path}, so creating it now')
            suffix = '' if old_or_new == 'old' else '_v2'
            google_cloud_storage_file_name =  f'processed_data_yield_spread_with_similar_trades{suffix}.pkl'
            if google_cloud_storage_file_name not in df_downloaded_from_google_cloud_storage:
                df = download_data(STORAGE_CLIENT, AUTOMATED_TRAINING_BUCKET, google_cloud_storage_file_name)
                df_downloaded_from_google_cloud_storage[google_cloud_storage_file_name] = df
            else:
                df = df_downloaded_from_google_cloud_storage[google_cloud_storage_file_name]
            
            df = df[df['trade_date'] == date]
            df.to_pickle(pickle_file_path)
            df_list.append(df)
    return df_list if len(df_list) > 1 else df_list[0]

In [6]:
old_df_on_2024_12_06 = get_data_for_automated_training_and_isolate_to_single_dates('old', ['2024-12-06'])
new_df_on_2024_12_06 = get_data_for_automated_training_and_isolate_to_single_dates('new', ['2024-12-06'])

Loading pickle file from /Users/mitas/ficc/ficc_python/notebooks/compare_datasets/files/old_data_2024-12-06.pkl
Loading pickle file from /Users/mitas/ficc/ficc_python/notebooks/compare_datasets/files/new_data_2024-12-06.pkl


### Anomaly 1: different data in the two datasets
Check which RTRS control numbers are differing.

Conclusion: there will always be RTRS control numbers present in one data set and not in the other (and vice versa) because we exclude trades based on certain conditions in the reference data (see `automated_training_auxiliary_functions.py::get_data_query(...)` and `automated_training_auxiliary_variables.py::QUERY_CONDITIONS`). The discrepancy is due to the fact that the two data providers define features like `coupon_type` and `capital_type` differently and may report default events differently. Hence, a given trade may meet the exclusion criterion based on one set of reference data but it may not meet the criterion if we look at the other set.

In [7]:
print(f'Number of items in the old df: {len(old_df_on_2024_12_06)}')
print(f'Number of items in the new df: {len(new_df_on_2024_12_06)}')

Number of items in the old df: 58248
Number of items in the new df: 59014


In [8]:
old_df_on_2024_12_06_rtrs_control_numbers = set(old_df_on_2024_12_06['rtrs_control_number'].tolist())
new_df_on_2024_12_06_rtrs_control_numbers = set(new_df_on_2024_12_06['rtrs_control_number'].tolist())

In [9]:
print(f'RTRS control numbers in the old df but not in the new df: {old_df_on_2024_12_06_rtrs_control_numbers - new_df_on_2024_12_06_rtrs_control_numbers}')
print(f'RTRS control numbers in the new df but not in the old df: {new_df_on_2024_12_06_rtrs_control_numbers - old_df_on_2024_12_06_rtrs_control_numbers}')

RTRS control numbers in the old df but not in the new df: {2024120609830400, 2024120601898500, 2024120608315400, 2024120609774600, 2024120611621900, 2024120602891300, 2024120614790700, 2024120609832500, 2024120602015800, 2024120614448700, 2024120606218300, 2024120613168700, 2024120607156800, 2024120606508100, 2024120614567500, 2024120604229200, 2024120600453200, 2024120612549200, 2024120614575700, 2024120601694300, 2024120613969500, 2024120600520800, 2024120602995300, 2024120608106600, 2024120606216300, 2024120610047600, 2024120600519800, 2024120607892600, 2024120604155000, 2024120612222600, 2024120607157900, 2024120605024400, 2024120613169300, 2024120602032800, 2024120609204900, 2024120602164900, 2024120603248300, 2024120608321200, 2024120611354800, 2024120608107700, 2024120614443700, 2024120614575800, 2024120607804600, 2024120615463100, 2024120601694400, 2024120600520900, 2024120604156100, 2024120604497100, 2024120607189200, 2024120612450000, 2024120607687400, 2024120602016500, 20241

### Anomaly 2: same model has vastly different predictions
Make sure that the data is identical in terms of both datasets having the same RTRS control numbers, so that when we make the predictions, we can be confident that there are not a few outlier CUSIPs that are causing the discrepancy.

In [10]:
similar_trades_model_2024_12_06, _ = load_model('2024-12-06', 'yield_spread_with_similar_trades')

BEGIN load_model
Attempting to load model from gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-06
Model failed to load from gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-06 with exception: Error executing an HTTP request: HTTP response code 404 with body '<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-06</Details></Error>'
	 when reading gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-06
Attempting to load model from gs://automated_training/similar-trades-model-2024-12-06




Model loaded from gs://automated_training/similar-trades-model-2024-12-06
END load_model. Execution time: 0:00:43.779


In [22]:
def create_summary_of_results_for_model(df: pd.DataFrame, model, return_predictions_and_delta: bool = False) -> None:
    encoders, _ = fit_encoders(df, CATEGORICAL_FEATURES, MODEL)
    return create_summary_of_results(model, df, *create_input(df, encoders, MODEL), print_results=False, return_predictions_and_delta=return_predictions_and_delta)

In [None]:
create_summary_of_results_for_model(old_df_on_2024_12_06, similar_trades_model_2024_12_06)

BEGIN create_input
END create_input. Execution time: 0:00:00.167
 1/59 [..............................] - ETA: 1:16

2024-12-10 11:08:17.762199: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:693] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" model: "0" frequency: 2400 num_cores: 10 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }




Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,10.895,58248
Dealer-Dealer,10.961,21730
Bid Side / Dealer-Purchase,10.782,16752
Offered Side / Dealer-Sell,10.918,19766
AAA,9.897,8953
Investment Grade,10.445,47445
Trade size >= 100k,9.654,13024
Last trade <= 7 days,9.553,40836
7 days < Last trade <= 14 days,11.627,4164
14 days < Last trade <= 28 days,13.343,5090


In [13]:
create_summary_of_results_for_model(new_df_on_2024_12_06, similar_trades_model_2024_12_06)

BEGIN create_input
END create_input. Execution time: 0:00:00.168


Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,14.761,59014
Dealer-Dealer,14.974,21781
Bid Side / Dealer-Purchase,14.038,16769
Offered Side / Dealer-Sell,15.126,20464
AAA,13.12,8915
Investment Grade,13.929,47749
Trade size >= 100k,15.171,13723
Last trade <= 7 days,13.273,41617
7 days < Last trade <= 14 days,14.818,4160
14 days < Last trade <= 28 days,16.445,5084


Select only rows that have an RTRS control number in both datasets.

Conclusion: even with the same RTRS control numbers (and thus, the same trade count), the errors are very different.

In [17]:
rtrs_control_numbers_in_both_old_data_and_new_data = old_df_on_2024_12_06_rtrs_control_numbers & new_df_on_2024_12_06_rtrs_control_numbers
old_df_on_2024_12_06_same_rtrs_control_numbers = old_df_on_2024_12_06[old_df_on_2024_12_06['rtrs_control_number'].isin(rtrs_control_numbers_in_both_old_data_and_new_data)]
new_df_on_2024_12_06_same_rtrs_control_numbers = new_df_on_2024_12_06[new_df_on_2024_12_06['rtrs_control_number'].isin(rtrs_control_numbers_in_both_old_data_and_new_data)]

In [23]:
results_old_df, predictions_old_df, delta_old_df = create_summary_of_results_for_model(old_df_on_2024_12_06_same_rtrs_control_numbers, similar_trades_model_2024_12_06, return_predictions_and_delta=True)
results_old_df

BEGIN create_input
END create_input. Execution time: 0:00:00.167
 1/59 [..............................] - ETA: 17s

2024-12-10 11:19:19.062977: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:693] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" model: "0" frequency: 2400 num_cores: 10 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }




Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,10.89,58146
Dealer-Dealer,10.96,21692
Bid Side / Dealer-Purchase,10.777,16728
Offered Side / Dealer-Sell,10.91,19726
AAA,9.9,8947
Investment Grade,10.436,47359
Trade size >= 100k,9.647,12996
Last trade <= 7 days,9.546,40756
7 days < Last trade <= 14 days,11.619,4159
14 days < Last trade <= 28 days,13.323,5083


In [24]:
old_df_on_2024_12_06_same_rtrs_control_numbers['ys_prediction'] = predictions_old_df
old_df_on_2024_12_06_same_rtrs_control_numbers['ys_delta'] = delta_old_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  old_df_on_2024_12_06_same_rtrs_control_numbers['ys_prediction'] = predictions_old_df
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  old_df_on_2024_12_06_same_rtrs_control_numbers['ys_delta'] = delta_old_df


In [26]:
results_new_df, predictions_new_df, delta_new_df = create_summary_of_results_for_model(new_df_on_2024_12_06_same_rtrs_control_numbers, similar_trades_model_2024_12_06, return_predictions_and_delta=True)
results_new_df

BEGIN create_input
END create_input. Execution time: 0:00:00.170


Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,14.514,58146
Dealer-Dealer,14.881,21692
Bid Side / Dealer-Purchase,14.021,16728
Offered Side / Dealer-Sell,14.529,19726
AAA,13.095,8908
Investment Grade,13.712,47229
Trade size >= 100k,14.243,12996
Last trade <= 7 days,12.933,40756
7 days < Last trade <= 14 days,14.817,4159
14 days < Last trade <= 28 days,16.446,5083


In [27]:
new_df_on_2024_12_06_same_rtrs_control_numbers['ys_prediction'] = predictions_new_df
new_df_on_2024_12_06_same_rtrs_control_numbers['ys_delta'] = delta_new_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df_on_2024_12_06_same_rtrs_control_numbers['ys_prediction'] = predictions_new_df
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df_on_2024_12_06_same_rtrs_control_numbers['ys_delta'] = delta_new_df


Investigate a single trade and its prediction. To have a more informative investigation, select a trade with a small error in one of the datasest and a large error in the other. The way to do that, would be to join the two datasets (with an error column) on the RTRS control number. To go deeper from here, we must investigate the input that is passed into the model.

In [28]:
joined_df_2024_12_06_same_rtrs_control_numbers = pd.merge(old_df_on_2024_12_06_same_rtrs_control_numbers, new_df_on_2024_12_06_same_rtrs_control_numbers, on='rtrs_control_number', suffixes=('_old', '_new'))

In [30]:
old_df_on_2024_12_06_small_error_rtrs_control_numbers = old_df_on_2024_12_06_same_rtrs_control_numbers[old_df_on_2024_12_06_same_rtrs_control_numbers['ys_delta'] < old_df_on_2024_12_06_same_rtrs_control_numbers['ys_delta'].quantile(0.1)]['rtrs_control_number'].tolist()    # get `rtrs_control_number` where the error is in the bottom 10%
new_df_on_2024_12_06_large_error_rtrs_control_numbers = new_df_on_2024_12_06_same_rtrs_control_numbers[new_df_on_2024_12_06_same_rtrs_control_numbers['ys_delta'] > new_df_on_2024_12_06_same_rtrs_control_numbers['ys_delta'].quantile(0.9)]['rtrs_control_number'].tolist()    # get `rtrs_control_number` where the error is in the top 10%

In [None]:
small_error_in_old_df_large_error_in_new_df = list(set(old_df_on_2024_12_06_small_error_rtrs_control_numbers) & set(new_df_on_2024_12_06_large_error_rtrs_control_numbers))
assert len(small_error_in_old_df_large_error_in_new_df) != 0

In [65]:
def find_columns_with_different_values(one_item_df1: pd.DataFrame, one_item_df2: pd.DataFrame) -> list:
    # Ensure both DataFrames have the same columns
    if not one_item_df1.columns.equals(one_item_df2.columns):
        df1_columns, df2_columns = set(one_item_df1.columns), set(one_item_df2.columns)
        columns_in_both = df1_columns & df2_columns
        one_item_df1, one_item_df2 = one_item_df1[columns_in_both], one_item_df2[columns_in_both]
        warnings.warn(f'DataFrames must have the same columns to compare. The following columns were removed since they were in one dataframe but not the other: {(df1_columns - df2_columns) | (df2_columns - df1_columns)}', RuntimeWarning)

    trade_history_columns = ['trade_history', 'similar_trade_history', 'target_attention_features']    # removing these columns because as arrays, the boolean types cannot be easily compard
    one_item_df1 = one_item_df1.drop(columns=trade_history_columns)
    one_item_df2 = one_item_df2.drop(columns=trade_history_columns)

    # Identify columns with different values
    differences = [col for col in one_item_df1.columns if (not pd.isna(one_item_df1[col].iloc[0]) and not pd.isna(one_item_df2[col].iloc[0]) and one_item_df1[col].iloc[0] != one_item_df2[col].iloc[0]) 
                                                          or (pd.isna(one_item_df1[col].iloc[0]) != pd.isna(one_item_df2[col].iloc[0]))]

    # Display differing columns
    if differences:
        print('Columns with different values:', differences)
    else:
        print('No differences found')
    return differences

In [None]:
pd.set_option('display.max_columns', None)    # when displaying the dataframe, do not use ellipses to truncate the middle columns

In [71]:
def create_input_for_df(df: pd.DataFrame) -> None:
    encoders, _ = fit_encoders(df, CATEGORICAL_FEATURES, MODEL)
    return create_input(df, encoders, MODEL)

##### Investigate a single RTRS control number
Conclusion: the `is_callable` feature had different values between the two datasets. This means we must change the `fast_trade_history_redis_update` cloud function to use the new reference data redis because the reference data is what creates the `calc_date` and currently, the pipeline uses the data from the BigQuery table `auxiliary_views.calculation_date_and_price` to fill in the `calc_date`.

In [44]:
rtrs_control_number_of_interest = small_error_in_old_df_large_error_in_new_df[0]    # arbitrarily choose the first one
rtrs_control_number_of_interest

2024120604330500

In [46]:
old_df_rtrs_control_number_of_interest = old_df_on_2024_12_06_same_rtrs_control_numbers[old_df_on_2024_12_06_same_rtrs_control_numbers['rtrs_control_number'] == rtrs_control_number_of_interest]
old_df_rtrs_control_number_of_interest

Unnamed: 0,rtrs_control_number,cusip,yield,is_callable,refund_date,accrual_date,dated_date,next_sink_date,coupon,delivery_date,trade_date,trade_datetime,par_call_date,interest_payment_frequency,is_called,is_non_transaction_based_compensation,is_general_obligation,callable_at_cav,extraordinary_make_whole_call,make_whole_call,has_unexpired_lines_of_credit,escrow_exists,incorporated_state_code,trade_type,par_traded,maturity_date,settlement_date,next_call_date,issue_amount,maturity_amount,issue_price,orig_principal_amount,max_amount_outstanding,dollar_price,calc_date,purpose_sub_class,called_redemption_type,calc_day_cat,previous_coupon_payment_date,instrument_primary_name,purpose_class,call_timing,call_timing_in_part,sink_frequency,sink_amount_type,issue_text,state_tax_status,series_name,transaction_type,next_call_price,par_call_price,when_issued,min_amount_outstanding,original_yield,par_price,default_indicator,sp_stand_alone,sp_long,moodys_long,coupon_type,federal_tax_status,use_of_proceeds,muni_security_type,muni_issue_type,capital_type,other_enhancement_type,next_coupon_payment_date,first_coupon_date,last_period_accrues_from_date,rating,trade_history,last_yield_spread,last_ficc_ycl,last_rtrs_control_number,last_yield,last_dollar_price,last_seconds_ago,last_size,last_calc_date,last_maturity_date,last_next_call_date,last_par_call_date,last_refund_date,last_trade_datetime,last_calc_day_cat,last_settlement_date,last_trade_type,similar_trade_history,ficc_ycl,yield_spread,treasury_rate,ficc_treasury_spread,quantity,callable,called,zerocoupon,whenissued,sinking,deferred,days_to_settle,days_to_maturity,days_to_call,days_to_refund,days_to_par,call_to_maturity,accrued_days,days_in_interest_payment,scaled_accrued_days,A/E,last_trade_date,new_ficc_ycl,target_attention_features,new_ys,max_ys_ys,max_ys_ttypes,max_ys_ago,max_ys_qdiff,min_ys_ys,min_ys_ttypes,min_ys_ago,min_ys_qdiff,max_qty_ys,max_qty_ttypes,max_qty_ago,max_qty_qdiff,min_ago_ys,min_ago_ttypes,min_ago_ago,min_ago_qdiff,D_min_ago_ys,D_min_ago_ttypes,D_min_ago_ago,D_min_ago_qdiff,P_min_ago_ys,P_min_ago_ttypes,P_min_ago_ago,P_min_ago_qdiff,S_min_ago_ys,S_min_ago_ttypes,S_min_ago_ago,S_min_ago_qdiff,ys_prediction,ys_delta
42769,2024120604330500,13124MDA0,267.0,False,NaT,2024-12-09,2024-12-09,NaT,5.0,2024-12-09,2024-12-06,2024-12-06 11:15:00,NaT,Semiannually,False,False,False,False,False,False,False,False,CA,S,535000.0,2036-07-01,2024-12-09,NaT,7.782078,6.642465,119.545,6.639487,6.642465,119.545,2036-07-01,125,0,2,NaT,REF BDS 2024A,50,0,0,0,10,REF BDS,1,2024A,I,100.0,100.0,True,4390000,2.67,100.0,False,NR,AA+,Aa2,8,2,66,8,,1,,2025-07-01,2025-07-01,2036-01-01,AA+,"[[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0...",0.0,,,,0.0,0.0,,NaT,NaT,NaT,NaT,NaT,NaT,,NaT,,"[[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0...",316.530858,-49.530858,4.15,-98.469142,5.728354,False,False,False,True,False,False,3,3.625621,0.0,0.0,0.0,0.0,0,180.0,0.0,0.0,NaT,316.530858,"[[5.728353977203369, 0.0, 1.0]]",-49.530858,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,-50.176018,0.64516


In [47]:
new_df_rtrs_control_number_of_interest = new_df_on_2024_12_06_same_rtrs_control_numbers[new_df_on_2024_12_06_same_rtrs_control_numbers['rtrs_control_number'] == rtrs_control_number_of_interest]
new_df_rtrs_control_number_of_interest

Unnamed: 0,rtrs_control_number,cusip,yield,is_callable,refund_date,accrual_date,dated_date,next_sink_date,coupon,delivery_date,trade_date,trade_datetime,par_call_date,interest_payment_frequency,is_called,is_non_transaction_based_compensation,is_general_obligation,callable_at_cav,extraordinary_make_whole_call,make_whole_call,has_unexpired_lines_of_credit,escrow_exists,incorporated_state_code,trade_type,par_traded,maturity_date,settlement_date,next_call_date,issue_amount,maturity_amount,issue_price,orig_principal_amount,max_amount_outstanding,dollar_price,calc_date,purpose_sub_class,called_redemption_type,calc_day_cat,previous_coupon_payment_date,instrument_primary_name,purpose_class,call_timing,call_timing_in_part,sink_frequency,sink_amount_type,issue_text,state_tax_status,series_name,transaction_type,next_call_price,par_call_price,when_issued,min_amount_outstanding,original_yield,par_price,default_indicator,sp_long,coupon_type,federal_tax_status,use_of_proceeds,muni_security_type,muni_issue_type,capital_type,other_enhancement_type,next_coupon_payment_date,first_coupon_date,last_period_accrues_from_date,rating,trade_history,last_yield_spread,last_ficc_ycl,last_rtrs_control_number,last_yield,last_dollar_price,last_seconds_ago,last_size,last_calc_date,last_maturity_date,last_next_call_date,last_par_call_date,last_refund_date,last_trade_datetime,last_calc_day_cat,last_settlement_date,last_trade_type,similar_trade_history,ficc_ycl,yield_spread,treasury_rate,ficc_treasury_spread,quantity,callable,called,zerocoupon,whenissued,sinking,deferred,days_to_settle,days_to_maturity,days_to_call,days_to_refund,days_to_par,call_to_maturity,accrued_days,days_in_interest_payment,scaled_accrued_days,A/E,last_trade_date,new_ficc_ycl,target_attention_features,new_ys,max_ys_ys,max_ys_ttypes,max_ys_ago,max_ys_qdiff,min_ys_ys,min_ys_ttypes,min_ys_ago,min_ys_qdiff,max_qty_ys,max_qty_ttypes,max_qty_ago,max_qty_qdiff,min_ago_ys,min_ago_ttypes,min_ago_ago,min_ago_qdiff,D_min_ago_ys,D_min_ago_ttypes,D_min_ago_ago,D_min_ago_qdiff,P_min_ago_ys,P_min_ago_ttypes,P_min_ago_ago,P_min_ago_qdiff,S_min_ago_ys,S_min_ago_ttypes,S_min_ago_ago,S_min_ago_qdiff,ys_prediction,ys_delta
43371,2024120604330500,13124MDA0,267.0,True,NaT,2024-12-09,2024-12-09,NaT,5.0,2024-12-09,2024-12-06,2024-12-06 11:15:00,2034-07-01,Semiannually,False,False,False,False,True,False,False,False,CA,S,535000.0,2036-07-01,2024-12-09,2034-07-01,6.638988,6.638988,119.545,6.638988,0.0,119.545,2036-07-01,50,0,2,NaT,"Water Revenue Refunding Bonds, Series 2024A",50,0,0,0,10,"Water Revenue Refunding Bonds, Series 2024A",1,No series name,I,100.0,100.0,True,0,2.67,100.0,False,AA+,8,2,60,8,,4,,NaT,2025-07-01,2036-01-01,AA+,"[[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0...",0.0,,,,0.0,0.0,,NaT,NaT,NaT,NaT,NaT,NaT,,NaT,,"[[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0...",316.530858,-49.530858,4.15,-98.469142,5.728354,True,False,False,True,False,False,3,3.625621,3.543074,0.0,3.543074,2.864511,0,180.0,0.0,0.0,NaT,316.530858,"[[5.728353977203369, 0.0, 1.0]]",-49.530858,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,0.0,DS,0.0,5.728354,-9.312593,40.218265


In [66]:
find_columns_with_different_values(old_df_rtrs_control_number_of_interest, new_df_rtrs_control_number_of_interest)

Columns with different values: ['ys_prediction', 'min_amount_outstanding', 'next_coupon_payment_date', 'max_amount_outstanding', 'orig_principal_amount', 'maturity_amount', 'capital_type', 'is_callable', 'days_to_par', 'use_of_proceeds', 'call_to_maturity', 'next_call_date', 'par_call_date', 'issue_text', 'instrument_primary_name', 'days_to_call', 'callable', 'ys_delta', 'extraordinary_make_whole_call', 'issue_amount', 'purpose_sub_class', 'series_name']




['ys_prediction',
 'min_amount_outstanding',
 'next_coupon_payment_date',
 'max_amount_outstanding',
 'orig_principal_amount',
 'maturity_amount',
 'capital_type',
 'is_callable',
 'days_to_par',
 'use_of_proceeds',
 'call_to_maturity',
 'next_call_date',
 'par_call_date',
 'issue_text',
 'instrument_primary_name',
 'days_to_call',
 'callable',
 'ys_delta',
 'extraordinary_make_whole_call',
 'issue_amount',
 'purpose_sub_class',
 'series_name']

##### Investigate another RTRS control number
Conclusion: this trade has different values in `create_input(...)` for the following features:
- `issue_amount` (index 4)
- `max_amount_outstanding` (index 12)
- `A/E` (index 15)
- `has_unexpired_lines_of_credit` (index 46)

These are in the bottom third of previous LightGBM feature importance plots (e.g., see [this](https://github.com/Ficc-ai/ficc/blob/dev/ml_models/sequence_predictors/yield_spread_models/yield_spread_model_similar_history.ipynb)), so we chalk this discrepancy to randomness from neural network training.

In [67]:
rtrs_control_number_of_interest = small_error_in_old_df_large_error_in_new_df[1]    # arbitrarily choose the second one
rtrs_control_number_of_interest

2024120613576200

In [68]:
old_df_rtrs_control_number_of_interest = old_df_on_2024_12_06_same_rtrs_control_numbers[old_df_on_2024_12_06_same_rtrs_control_numbers['rtrs_control_number'] == rtrs_control_number_of_interest]
old_df_rtrs_control_number_of_interest

Unnamed: 0,rtrs_control_number,cusip,yield,is_callable,refund_date,accrual_date,dated_date,next_sink_date,coupon,delivery_date,trade_date,trade_datetime,par_call_date,interest_payment_frequency,is_called,is_non_transaction_based_compensation,is_general_obligation,callable_at_cav,extraordinary_make_whole_call,make_whole_call,has_unexpired_lines_of_credit,escrow_exists,incorporated_state_code,trade_type,par_traded,maturity_date,settlement_date,next_call_date,issue_amount,maturity_amount,issue_price,orig_principal_amount,max_amount_outstanding,dollar_price,calc_date,purpose_sub_class,called_redemption_type,calc_day_cat,previous_coupon_payment_date,instrument_primary_name,purpose_class,call_timing,call_timing_in_part,sink_frequency,sink_amount_type,issue_text,state_tax_status,series_name,transaction_type,next_call_price,par_call_price,when_issued,min_amount_outstanding,original_yield,par_price,default_indicator,sp_stand_alone,sp_long,moodys_long,coupon_type,federal_tax_status,use_of_proceeds,muni_security_type,muni_issue_type,capital_type,other_enhancement_type,next_coupon_payment_date,first_coupon_date,last_period_accrues_from_date,rating,trade_history,last_yield_spread,last_ficc_ycl,last_rtrs_control_number,last_yield,last_dollar_price,last_seconds_ago,last_size,last_calc_date,last_maturity_date,last_next_call_date,last_par_call_date,last_refund_date,last_trade_datetime,last_calc_day_cat,last_settlement_date,last_trade_type,similar_trade_history,ficc_ycl,yield_spread,treasury_rate,ficc_treasury_spread,quantity,callable,called,zerocoupon,whenissued,sinking,deferred,days_to_settle,days_to_maturity,days_to_call,days_to_refund,days_to_par,call_to_maturity,accrued_days,days_in_interest_payment,scaled_accrued_days,A/E,last_trade_date,new_ficc_ycl,target_attention_features,new_ys,max_ys_ys,max_ys_ttypes,max_ys_ago,max_ys_qdiff,min_ys_ys,min_ys_ttypes,min_ys_ago,min_ys_qdiff,max_qty_ys,max_qty_ttypes,max_qty_ago,max_qty_qdiff,min_ago_ys,min_ago_ttypes,min_ago_ago,min_ago_qdiff,D_min_ago_ys,D_min_ago_ttypes,D_min_ago_ago,D_min_ago_qdiff,P_min_ago_ys,P_min_ago_ttypes,P_min_ago_ago,P_min_ago_qdiff,S_min_ago_ys,S_min_ago_ttypes,S_min_ago_ago,S_min_ago_qdiff,ys_prediction,ys_delta
5683,2024120613576200,639319MW9,415.9,False,2025-02-15,2018-03-20,2018-03-20,NaT,5.0,2018-03-20,2024-12-06,2024-12-06 15:35:42,2025-02-15,Semiannually,True,False,True,False,False,False,False,True,TX,P,20000.0,2032-02-15,2024-12-09,2025-02-15,7.67619,5.963788,113.517,5.963788,5.963788,100.141,2025-02-15,51,13,3,2024-08-15,UNLTD TAX BLDG BDS 2018,37,1,1,0,10,UNLTD TAX BLDG BDS,2,2018,I,100.0,100.0,False,920000,2.83,100.0,False,NR,MR,Aaa,8,2,9,5,,6,21,2025-02-15,2019-02-15,2024-08-15,MR,"[[11.17359190132899, -92.0, 5.0, 0.0, 1.0, 7.8...",11.173592,177.826408,2022073000000000.0,189.0,107.673,74499947.0,100000.0,2025-02-15,2032-02-15,2025-02-15,2025-02-15,2025-02-15,2022-07-28 09:09:55,3.0,2022-08-01,S,"[[55.81245262092841, -75.4, 4.698969841003418,...",288.011631,127.888369,4.19,-130.988369,4.30103,False,True,False,False,False,False,3,3.419129,1.838849,1.838849,1.838849,3.407731,2419,180.0,1209.5,0.644444,2022-07-28,274.949132,"[[4.301030158996582, 1.0, 0.0]]",140.950868,27.018562,SP,7.931268,5.812914,11.173592,SP,7.872156,4.903095,27.018562,SP,7.931268,5.812914,11.173592,SP,7.872156,4.903095,23.884903,DP,7.872568,4.903095,23.884903,PP,7.872568,4.903095,11.173592,SP,7.872156,4.903095,140.397034,0.553835


In [69]:
new_df_rtrs_control_number_of_interest = new_df_on_2024_12_06_same_rtrs_control_numbers[new_df_on_2024_12_06_same_rtrs_control_numbers['rtrs_control_number'] == rtrs_control_number_of_interest]
new_df_rtrs_control_number_of_interest

Unnamed: 0,rtrs_control_number,cusip,yield,is_callable,refund_date,accrual_date,dated_date,next_sink_date,coupon,delivery_date,trade_date,trade_datetime,par_call_date,interest_payment_frequency,is_called,is_non_transaction_based_compensation,is_general_obligation,callable_at_cav,extraordinary_make_whole_call,make_whole_call,has_unexpired_lines_of_credit,escrow_exists,incorporated_state_code,trade_type,par_traded,maturity_date,settlement_date,next_call_date,issue_amount,maturity_amount,issue_price,orig_principal_amount,max_amount_outstanding,dollar_price,calc_date,purpose_sub_class,called_redemption_type,calc_day_cat,previous_coupon_payment_date,instrument_primary_name,purpose_class,call_timing,call_timing_in_part,sink_frequency,sink_amount_type,issue_text,state_tax_status,series_name,transaction_type,next_call_price,par_call_price,when_issued,min_amount_outstanding,original_yield,par_price,default_indicator,sp_long,coupon_type,federal_tax_status,use_of_proceeds,muni_security_type,muni_issue_type,capital_type,other_enhancement_type,next_coupon_payment_date,first_coupon_date,last_period_accrues_from_date,rating,trade_history,last_yield_spread,last_ficc_ycl,last_rtrs_control_number,last_yield,last_dollar_price,last_seconds_ago,last_size,last_calc_date,last_maturity_date,last_next_call_date,last_par_call_date,last_refund_date,last_trade_datetime,last_calc_day_cat,last_settlement_date,last_trade_type,similar_trade_history,ficc_ycl,yield_spread,treasury_rate,ficc_treasury_spread,quantity,callable,called,zerocoupon,whenissued,sinking,deferred,days_to_settle,days_to_maturity,days_to_call,days_to_refund,days_to_par,call_to_maturity,accrued_days,days_in_interest_payment,scaled_accrued_days,A/E,last_trade_date,new_ficc_ycl,target_attention_features,new_ys,max_ys_ys,max_ys_ttypes,max_ys_ago,max_ys_qdiff,min_ys_ys,min_ys_ttypes,min_ys_ago,min_ys_qdiff,max_qty_ys,max_qty_ttypes,max_qty_ago,max_qty_qdiff,min_ago_ys,min_ago_ttypes,min_ago_ago,min_ago_qdiff,D_min_ago_ys,D_min_ago_ttypes,D_min_ago_ago,D_min_ago_qdiff,P_min_ago_ys,P_min_ago_ttypes,P_min_ago_ago,P_min_ago_qdiff,S_min_ago_ys,S_min_ago_ttypes,S_min_ago_ago,S_min_ago_qdiff,ys_prediction,ys_delta
5663,2024120613576200,639319MW9,415.9,False,2025-02-15,2018-03-20,2018-03-20,NaT,5.0,2018-03-20,2024-12-06,2024-12-06 15:35:42,2025-02-15,Semiannually,True,False,True,False,False,False,True,True,TX,P,20000.0,2032-02-15,2024-12-09,2025-02-15,5.963789,5.963788,111.583949,5.963788,0.0,100.141,2025-02-15,51,13,3,2024-08-18,Unlimited Tax School Building Bonds - Series 2018,37,0,0,0,10,Unlimited Tax School Building Bonds - Series 2018,1,No series name,I,100.0,100.0,False,0,2.83,100.0,False,MR,8,2,9,5,,6,,2025-02-18,2019-02-15,2031-08-15,MR,"[[11.17359190132899, -92.0, 5.0, 0.0, 1.0, 7.8...",11.173592,177.826408,2022073000000000.0,189.0,107.673,74499947.0,100000.0,2025-02-15,2032-02-15,2025-02-15,2025-02-15,2025-02-15,2022-07-28 09:09:55,3.0,2022-08-01,S,"[[55.81245262092841, -75.4, 4.698969841003418,...",288.011631,127.888369,4.19,-130.988369,4.30103,False,True,False,False,False,False,3,3.419129,1.838849,1.838849,1.838849,3.407731,2419,180.0,1209.5,0.627778,2022-07-28,274.949132,"[[4.301030158996582, 1.0, 0.0]]",140.950868,27.018562,SP,7.931268,5.812914,11.173592,SP,7.872156,4.903095,27.018562,SP,7.931268,5.812914,11.173592,SP,7.872156,4.903095,23.884903,DP,7.872568,4.903095,23.884903,PP,7.872568,4.903095,11.173592,SP,7.872156,4.903095,67.009895,73.940973


In [70]:
find_columns_with_different_values(old_df_rtrs_control_number_of_interest, new_df_rtrs_control_number_of_interest)

Columns with different values: ['ys_prediction', 'min_amount_outstanding', 'next_coupon_payment_date', 'max_amount_outstanding', 'previous_coupon_payment_date', 'other_enhancement_type', 'issue_text', 'instrument_primary_name', 'last_period_accrues_from_date', 'has_unexpired_lines_of_credit', 'call_timing_in_part', 'issue_price', 'call_timing', 'ys_delta', 'A/E', 'issue_amount', 'series_name', 'state_tax_status']




['ys_prediction',
 'min_amount_outstanding',
 'next_coupon_payment_date',
 'max_amount_outstanding',
 'previous_coupon_payment_date',
 'other_enhancement_type',
 'issue_text',
 'instrument_primary_name',
 'last_period_accrues_from_date',
 'has_unexpired_lines_of_credit',
 'call_timing_in_part',
 'issue_price',
 'call_timing',
 'ys_delta',
 'A/E',
 'issue_amount',
 'series_name',
 'state_tax_status']

In [72]:
create_input_for_df(old_df_rtrs_control_number_of_interest)

BEGIN create_input
END create_input. Execution time: 0:00:00.002


([array([[[ 55.81245262, -75.4       ,   4.69896984,   0.        ,
             0.        ,   4.95931334],
          [ 55.81245262, -75.4       ,   4.69896984,   0.        ,
             0.        ,   4.95931334],
          [ 50.61245262, -80.6       ,   4.69896984,   0.        ,
             1.        ,   4.95931334],
          [226.61245262,  95.4       ,   4.69896984,   1.        ,
             0.        ,   5.00898746],
          [122.91245262,  -8.3       ,   4.69896984,   0.        ,
             0.        ,   5.00899171]]]),
  array([[[ 11.1735919 , -92.        ,   5.        ,   0.        ,
             1.        ,   7.87215597],
          [ 23.88490342, -90.1       ,   5.        ,   0.        ,
             0.        ,   7.87256812],
          [ 23.88490342, -90.1       ,   5.        ,   1.        ,
             0.        ,   7.87256818],
          [ 18.98490342, -95.        ,   5.        ,   0.        ,
             0.        ,   7.87256998],
          [ 27.0185621 , -39.     

In [73]:
create_input_for_df(new_df_rtrs_control_number_of_interest)

BEGIN create_input
END create_input. Execution time: 0:00:00.001


([array([[[ 55.81245262, -75.4       ,   4.69896984,   0.        ,
             0.        ,   4.95931334],
          [ 55.81245262, -75.4       ,   4.69896984,   0.        ,
             0.        ,   4.95931334],
          [ 50.61245262, -80.6       ,   4.69896984,   0.        ,
             1.        ,   4.95931334],
          [226.61245262,  95.4       ,   4.69896984,   1.        ,
             0.        ,   5.00898746],
          [122.91245262,  -8.3       ,   4.69896984,   0.        ,
             0.        ,   5.00899171]]]),
  array([[[ 11.1735919 , -92.        ,   5.        ,   0.        ,
             1.        ,   7.87215597],
          [ 23.88490342, -90.1       ,   5.        ,   0.        ,
             0.        ,   7.87256812],
          [ 23.88490342, -90.1       ,   5.        ,   1.        ,
             0.        ,   7.87256818],
          [ 18.98490342, -95.        ,   5.        ,   0.        ,
             0.        ,   7.87256998],
          [ 27.0185621 , -39.     

### Anomaly 3: test model has similar accuracy to production model
The v2 yield spread with similar trades model trained on 2024-12-06 used only 2 epochs.

In [28]:
similar_trades_model_2024_12_09, _ = load_model('2024-12-09', 'yield_spread_with_similar_trades')
similar_trades_model_v2_2024_12_09 = keras.models.load_model(os.path.join('gs://'+AUTOMATED_TRAINING_BUCKET, f'similar-trades-v2-model-2024-12-09'))    # create path of the form: <bucket>/<model>

BEGIN load_model
Attempting to load model from gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-09
Model failed to load from gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-09 with exception: Error executing an HTTP request: HTTP response code 404 with body '<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-09</Details></Error>'
	 when reading gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-09
Attempting to load model from gs://automated_training/similar-trades-model-2024-12-09




Model loaded from gs://automated_training/similar-trades-model-2024-12-09
END load_model. Execution time: 0:00:39.141




In [34]:
create_summary_of_results_for_model(old_df_on_2024_12_06, similar_trades_model_2024_12_09)

BEGIN create_input
END create_input. Execution time: 0:00:00.158
 1/59 [..............................] - ETA: 1:04

2024-12-09 18:25:21.984458: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:693] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" model: "0" frequency: 2400 num_cores: 10 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }




Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,10.732,58248
Dealer-Dealer,10.913,21730
Bid Side / Dealer-Purchase,10.587,16752
Offered Side / Dealer-Sell,10.655,19766
AAA,9.588,8953
Investment Grade,10.334,47445
Trade size >= 100k,9.392,13024
Last trade <= 7 days,9.481,40836
7 days < Last trade <= 14 days,11.335,4164
14 days < Last trade <= 28 days,13.279,5090


In [35]:
create_summary_of_results_for_model(new_df_on_2024_12_06, similar_trades_model_v2_2024_12_09)

BEGIN create_input
END create_input. Execution time: 0:00:00.167
 1/60 [..............................] - ETA: 1:06

2024-12-09 18:25:32.008801: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:693] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" model: "0" frequency: 2400 num_cores: 10 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }




Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,11.264,59014
Dealer-Dealer,11.3,21781
Bid Side / Dealer-Purchase,11.415,16769
Offered Side / Dealer-Sell,11.101,20464
AAA,10.189,8915
Investment Grade,10.68,47749
Trade size >= 100k,9.303,13723
Last trade <= 7 days,9.853,41617
7 days < Last trade <= 14 days,11.893,4160
14 days < Last trade <= 28 days,13.454,5084
