# Create Table for Compliance for Edward Jones


This notebook as is, will transform an Edward Jones CSV and make it ready for compliance. Possibly, this capability should be extracted from this notebook and made a new notebook for that purpose alone. 

The associated task is here: ```https://ficcai.atlassian.net/browse/FA-2650```

The idea for this notebook is to create the trade history and reference data for compliance pricing using the training and testing pipeline. The reason to do this is that prior to the time, at which we converted the reference data and trade history, it is difficult and costly in terms of BigQuery costs to create the reference data and trade history for multiple timestamps. The key observation is that all of the trades have actually occurred, and so we can get the reference data and trade history directly from the materialized table that is used in automated training.

The approach here is to take an input CSV that contains the following: (1) a direction (which must be converted to the MSRB convention), (2) a timestamp, (3), a cusip, (4) a quantity, and (5) a price. Then, we match those to a single trade (`rtrs_control_number`) in the MSRB data. Finally, we take those `rtrs_control_number`s and use `materialized_trade_history` to price them. 

Intially, when we did this for Edward Jones on trades from 2024-10 and before, they gave us the price corresponding to the MSRB's `dollar_price`. In February 2025, they asked us to price a CSV again, and after some investigation I realized that they had only given us the evaluator's price (`MKT_PRC_AMT`) and not the trade price (`FILL_UNIT_PRC_AMT`). At this point we ascertained that it would be better to use the compliance module since (a) the trades were more recent and stored in our redis, and (b) not having the MSRB `dollar_price` makes it significantly more difficult to find the matching trade.

I will include the notebook up to the point that we decided to take a different approach. If we ever need to price trades earlier than say 2024-12, we might use this notebook in future. I will describe the steps that should folow below in markdown. 

In [None]:
import os
import numpy as np
import pandas as pd

from google.cloud import bigquery

In [None]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/Users/user/base/ficc/creds.json'
bq_client = bigquery.Client()

In [None]:
def sqltodf(sql, limit=''):
    if limit != '': 
        limit = f' ORDER BY RAND() LIMIT {limit}'
    bqr = bq_client.query(sql + limit).result()
    return bqr.to_dataframe()

In [None]:
path = '/Users/user/downloads/edward_jones_2025-02-10.xlsx'    # excel file provided by Edward Jones

In [None]:
df = pd.read_excel(path)
len(df)

In [None]:
df.head()

In [None]:
df['msrb_trade_type_for_joining'] = np.where(
    df['CLNT_TY_DESC'] == 'Dealer', 'D',
    np.where(
        (df['CLNT_TY_DESC'] == 'Client') & (df['SIDE'] == 'Buy'), 'S',    # Client buys from Edward Jones is a dealer sell in MSRB language
        np.where(
            (df['CLNT_TY_DESC'] == 'Client') & (df['SIDE'] == 'Sell'), 'P',    # Client buys from Edward Jones is a dealer purchase in MSRB language
            np.nan  
        )
    )
)
df.head(5)

In [None]:
# Convert to correct data types 
df['TRADE_TS'] = pd.to_datetime(df['TRADE_TS'], errors='coerce')
df['TRADE_TS'] = df['TRADE_TS'] + pd.Timedelta(hours=1)    # Edward Jones times are in CT, MSRB is ET
df['TRADE_TS'] = df['TRADE_TS'].dt.tz_localize(None)

df['CUSIP_NO'] = df['CUSIP_NO'].astype(str)
df['msrb_trade_type_for_joining'] = df['msrb_trade_type_for_joining'].astype(str)

print(df.dtypes)

In [None]:
# Define schema explicitly
schema = [
    bigquery.SchemaField('GRP_ORD_NO', 'INTEGER'),
    bigquery.SchemaField('FI_TRD_TRAN_ID', 'INTEGER'),
    bigquery.SchemaField('FILL_ID', 'INTEGER'),
    bigquery.SchemaField('PMP_ID', 'FLOAT'),
    bigquery.SchemaField('SIDE', 'STRING'),
    bigquery.SchemaField('CUSIP_NO', 'STRING'),
    bigquery.SchemaField('TRADE_TS', 'DATETIME'),  
    bigquery.SchemaField('FILL_UNIT_QTY', 'INTEGER'),
    bigquery.SchemaField('MKT_PRC_AMT', 'FLOAT'),
    bigquery.SchemaField('CLNT_TY_DESC', 'STRING'),
    bigquery.SchemaField('TRD_CAP_CD', 'STRING'),
    bigquery.SchemaField('msrb_trade_type_for_joining', 'STRING'),
]

In [None]:
table_id = 'jesse_tests.ej_2025-02-10_trades'

In [None]:
job_config = bigquery.LoadJobConfig(schema=schema,
                                    write_disposition='WRITE_TRUNCATE')    # Overwrites existing table
job = bq_client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()    # Wait for job to complete

print(f'Upload successful! {len(df)} rows added to {table_id}.')

In [None]:
oldest_trade_ts = df['TRADE_TS'].min()
print('Oldest TRADE_TS:', oldest_trade_ts)

Below are the steps that would be taken to create the trade history and reference data for the Edward Jones trades. 

Here is the query to match trades with MSRB data. Note that the price should match not the evaluator's price (`MKT_PRC_AMT`) as it is here but the trade price (`FILL_UNIT_PRC_AMT`), which wasn't included in the CSV.

In [None]:
query = f'''SELECT
  b.rtrs_control_number,
  a.CUSIP_NO,
  b.cusip, 
  a.TRADE_TS, 
  b.trade_datetime,
  a.msrb_trade_type_for_joining,
  b.trade_type,
  a.FILL_UNIT_QTY,
  b.par_traded ,
  a.MKT_PRC_AMT,
  b.dollar_price,
  b.msrb_valid_to_date,
  ROW_NUMBER() OVER (PARTITION BY b.rtrs_control_number ORDER BY a.TRADE_TS ASC) AS row_num,
  FI_TRD_TRAN_ID
FROM
  `jesse_tests.ej_2025-02-10_trades` as a
LEFT JOIN
  `auxiliary_views.msrb_final` as b
ON
a.CUSIP_NO = b.cusip 
and datetime_trunc(a.TRADE_TS, minute) = datetime_trunc(b.trade_datetime, minute)
  and a.msrb_trade_type_for_joining = b.trade_type
  and a.FILL_UNIT_QTY = b.par_traded 
  and round(a.MKT_PRC_AMT,1) = round(b.dollar_price,1)    --MSRB and Edward Jones prices don't match exactly
where 
b.MSRB_valid_to_date > CURRENT_DATETIME()
and rtrs_control_number is not null'''

Once we have a trade in the MSRB data for every Edward Jones trade, we can then take (arbitrarily) only one `rtrs_control_number` per Edward Jones trade. There are cases where more than one trade corresponds to `price`, `trade_datetime`, and `par_traded` when many trades happen simultaneously.

Once we have one `rtrs_control_number` per Edward Jones trade, we can then query `materialized_trade_history` to get the trade history and reference data for the compliance module. A final step is to convert the Edward Jones side and direction to the appropriate ones for compliance. 

Below, we process the DataFrame for using in the compliance model without attempting to associate the Edward Jones trades with MSRB trades. 

In [None]:
df['trade_type'] = df['msrb_trade_type_for_joining']
df

In [None]:
df['compliance_side'] = np.where(
        (df['SIDE'] == 'Buy'), 'S',    # Client buys from Edward Jones is a dealer sell in MSRB language
        np.where(
            (df['SIDE'] == 'Sell'), 'P',    # Client buys from Edward Jones is a dealer purchase in MSRB language
            np.nan  
        )
    )
df.head()

In [None]:
df.rename(columns={'CUSIP_NO': 'cusip', 'FILL_UNIT_QTY':'quantity','TRADE_TS': 'trade_datetime', 'MKT_PRC_AMT': 'user_price'}, inplace=True)
df

In [None]:
df = df[['cusip', 'quantity', 'trade_type', 'user_price', 'trade_datetime', 'compliance_side', 'GRP_ORD_NO', 'FI_TRD_TRAN_ID', 'FILL_ID', 'PMP_ID', 'SIDE', 'CLNT_TY_DESC', 'TRD_CAP_CD']]
df

In [None]:
df.to_pickle('ej_2025-02-10.pkl')