# Trade
- Swing trading/long-term trading
    - Exposed to overnight risk (close price previous day might not equal to open 
    price next day if there are major events happening between market closure and
    market open).
- Assume I already have which day to long, which day to short
- Conduct post-trade analysis
- Refine risk management techniques (Comparing starting on 2023-12-22)
    - Boeing: Main character in the events
        - Stock -18.61%

    - Direct competitors
        - Airbus (EPA: AIR): Boeing's primary competitor in commercial aircraft manufacturing
            - Stock +5.93%
        - Lockhead Martin (LMT): More focused on defense but also compete in aerospace
    - Suppliers
        - General Electric (GE): Supplies engines for Boeing aircraft
            - Have presence in aviation, healthcare, power, renewable energy
            - Doesn't seem to be affected
            - Can also supply engines to other aircraft manufacturers (effect on
            stock price is complicated)
    - Customers
        - Alaska Airlines (ALK): Main airline involved
            - Stock -11.73%
        - American Airlines (UAL - NasdaqGS)
            - Stock -4.91%
        - Delta Air Lines (DAL)
            - -11.73%
        - Southwest Airlines
- Trading timing (NYSE) vs news timing
    - The news was updated on January 18, 2024, at 4:36 AM GMT+8, which translates to January 17, 2024, at 3:36 PM Eastern Time (since GMT+8 is 13 hours ahead of Eastern Time). Since the NYSE closes at 4:00 PM ET, this news would have come out just before the market close.
    - Difference stock exchanges might operate at different timings also
- No training and validation - straight go to validation (backtesting)


# Set Up

In [1]:
import os
import ast
import requests
import logging

import yfinance as yf
import pandas as pd
import numpy as np
import finnhub
from dotenv import load_dotenv
from pathlib import Path    
import sys
import time

import scipy.stats as stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
import typing

sys.path.append('../') # Change the python path at runtime

# Self-created modules
from src.utils import path as path_yq
from src.backtesting import Backtest, Strategy


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
load_dotenv()
POLYGON_API_KEY = os.environ.get('POLYGON_API_KEY')

BT_START_DATE = '2023-11-01'
BT_START_STR = '20231101'
BT_END_DATE = '2024-01-31'
BT_END_STR = '20240131'

cur_dir = Path.cwd()
root_dir = path_yq.get_root_dir(cur_dir)

logging.basicConfig(filename=Path.joinpath(root_dir, 'logs', 'trading_system.log'),
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
                    level=logging.DEBUG)

stm_techs = ['stc', 'blob', 'sid', 'bert', 'finbert']
contents = ['cln_hdl', 'cln_smr', 'cln_news']
lemmas = ['', 'lemma']

# Fetch Tick Data

## Polygon

Polygon docs: https://polygon.io/docs/stocks/get_v2_aggs_ticker__stocksticker__range__multiplier___timespan___from___to

- FIXME: The timings include those in pre-market hours
- The timestamp is in ms, not sec

Similar to download data codes
TODO: Assumption: assume other stocks share the same timezone

- The data is incomplete (not every minute)
24265	47739.0	217.6996	217.6800	217.7000	217.7000	217.6800	1705096560000	21	2024-01-12 21:56:00
24266	171.0	217.5423	217.5000	217.5000	217.5000	217.5000	1705096800000	5	2024-01-12 22:00:00

In [None]:
url = f"https://api.polygon.io/v2/aggs/ticker/BA/range/1/minute/{BT_START_DATE}/{BT_END_DATE}?adjusted=true&sort=asc&limit=50000&apiKey={POLYGON_API_KEY}"

# Make the GET request
resp = requests.get(url)

In [None]:
# Check if the request was successful
if resp.status_code == 200:
    # Convert the 'results' list to a DataFrame
    df = pd.DataFrame(resp.json().get('results'))

    # Rename the columns to more descriptive names
    column_mapping = {
        "v": "Volume",
        "vw": "VWAP",
        "o": "Open",
        "c": "Close",
        "h": "High",
        "l": "Low",
        "t": "Timestamp",
        "n": "Transactions"
        # Add more mappings as necessary
    }

    df.rename(columns=column_mapping, inplace=True)

    # Optionally, convert the 'Timestamp' column from Unix milliseconds to a datetime format
    df['Datetime'] = pd.to_datetime(df['Timestamp'], unit='ms')

    # Display the updated DataFrame
    print(df)
else:
    # Handle errors (e.g., logging, raising an exception)
    print(f"Error fetching data: {resp.status_code}, {resp.text}")



In [None]:
# Boeing open high low close data
raw_path = Path.joinpath(root_dir, 'data', 'raw', f'BA_OHLC_{BT_START_STR}_{BT_END_STR}.csv')
df.to_csv(raw_path, index=False)

## Yahoo (Outdated)

In [None]:
# Define the ticker list
ticker_list = ['BA']

# Fetch the data
dl_data = yf.download(ticker_list, start=BT_START_DATE, end=BT_END_DATE) # Auto adjust is false

dl_data = pd.DataFrame(dl_data)
data = dl_data.drop(columns=['Close'], axis=1)
data = data.rename(columns={'Adj Close': 'Close'})
display(data.isna().sum(axis=0)) # Axis=0: along the indices, row-wise opertaion
# Gives the sum for rows in a column
data.index = pd.to_datetime(data.index)
data


In [None]:
dates = pd.DataFrame(data.index.strftime('%Y-%m-%d'))
# dates.to_csv("trading_dates.csv", index=False)

In [None]:
# After performing sentiment
stm_path = root_dir.joinpath('data', 'proc', 'boeing_stm_20231101_to_20240131.csv')
news = pd.read_csv(stm_path, index_col=False)
news2 = news[['datetime2', 'news_pol_blob']]
news2

news2.plot()
# # data['Sentiment'] = np.random.random(len(data)) * 2 - 1
# display(len(data))
# sentiment = np.array([0, -1, -0.8, 0, 0, 0]) # Put -1 on 01-05 (Before the whole thing Boeing case appeared after market closed on 01-05 to prepare to trade for 01-08)
# data['Sentiment'] = sentiment
# display(data.tail(20))

In [None]:
# Ensure datetime2 in news2 is in pandas datetime format
news2['datetime2'] = pd.to_datetime(news2['datetime2'])

# Assuming data.index is already a DatetimeIndex, no need to convert it again
# Just ensure it's sorted
data.sort_index(inplace=True)

# Function to find the closest previous date in data for each date in news2
def find_closest_previous_date(target_date, date_index):
    previous_dates = date_index[date_index <= target_date]
    if not previous_dates.empty:
        return previous_dates.max()
    else:
        return pd.NaT  # Return Not-A-Time (NaT) if no previous date is found

# Apply the function to each date in news2['datetime2']
closest_dates = news2['datetime2'].apply(lambda x: find_closest_previous_date(x, data.index))

# Add this closest date information to news2
news2['closest_date'] = closest_dates
news2

In [None]:
# TODO: Need to think of how to combine the data (might have many neutral etc.)
# as_index will retain closest_date
news3 = news2.groupby('closest_date', as_index=False)['news_pol_blob'].mean().reset_index(drop=True) 
news3

In [None]:
merged = pd.merge(data, news3, left_on='Date', right_on='closest_date', how='left')
merged

In [None]:
# Clean for 2 lines only
merged2 = merged.dropna().reset_index(drop=True)
merged2

# Merge data

In [20]:

def convert_data(row):
    """
    A function from sentiment.ipynb.
    """
    try:
        # First, try to evaluate the row as a list
        evaluated = ast.literal_eval(row)
        # If the result is a list, return it directly
        if isinstance(evaluated, list):
            return evaluated
        # If not, it's already the correct type (int, float, etc.)
        return evaluated
    except ValueError:
        # Handle the case where the row is not a valid Python literal
        # This could be a string that should not be converted
        return row
    except SyntaxError:
        # Handle syntax errors which might occur if ast.literal_eval can't parse the string
        return row
    except Exception as e:
        print(f'Exception: {e}')
        return row

score_path = root_dir.joinpath('data', 'proc', f'BA_score_{BT_START_STR}_{BT_END_STR}.csv') 
df9 = pd.read_csv(score_path, index_col=False)

# Apply the conversion function to each specified column
for col in df9.columns:
    df9[col] = df9[col].apply(convert_data)
df9['datetime2'] = pd.to_datetime(df9['datetime2'])

# print(df8.equals(df7))
# print(type(df8['datetime2'][0]))

In [21]:
# Fetch and sort tick data
# Boeing open high low close data
raw_path = Path.joinpath(root_dir, 'data', 'raw', f'BA_OHLC_{BT_START_STR}_{BT_END_STR}.csv')
tick = pd.read_csv(raw_path, index_col=False)
tick['Datetime'] = pd.to_datetime(tick['Datetime'])
tick = tick.sort_values(by='Datetime')

# Make sure the tick data is within backtest date range
tick = tick[(tick['Datetime'] >= BT_START_DATE) & (tick['Datetime'] <= BT_END_DATE)]
tick

Unnamed: 0,Volume,VWAP,Open,Close,High,Low,Timestamp,Transactions,Datetime
0,991.0,186.6991,186.6200,186.8000,186.8000,186.6200,1698829200000,31,2023-11-01 09:00:00
1,410.0,186.8187,186.8200,186.8200,186.8200,186.8200,1698829560000,5,2023-11-01 09:06:00
2,1289.0,187.6589,187.5900,187.7000,187.7000,187.5900,1698830040000,29,2023-11-01 09:14:00
3,535.0,188.1637,187.8400,187.9600,187.9600,187.8400,1698830100000,34,2023-11-01 09:15:00
4,442.0,188.8297,188.7900,188.7900,188.7900,188.7900,1698830160000,27,2023-11-01 09:16:00
...,...,...,...,...,...,...,...,...,...
31393,1009.0,199.7495,199.7500,199.7500,199.7500,199.7500,1706658600000,7,2024-01-30 23:50:00
31394,250.0,199.6644,199.6500,199.6500,199.6500,199.6500,1706658720000,4,2024-01-30 23:52:00
31395,315.0,199.7283,199.7369,199.7369,199.7369,199.7369,1706658960000,11,2024-01-30 23:56:00
31396,503.0,199.7896,199.7999,199.7999,199.7999,199.7999,1706659080000,6,2024-01-30 23:58:00


In [22]:
tick[tick['Datetime'] >= '2024-01-12 21:47:23'].head()

Unnamed: 0,Volume,VWAP,Open,Close,High,Low,Timestamp,Transactions,Datetime
24265,47739.0,217.6996,217.68,217.7,217.7,217.68,1705096560000,21,2024-01-12 21:56:00
24266,171.0,217.5423,217.5,217.5,217.5,217.5,1705096800000,5,2024-01-12 22:00:00
24267,202.0,217.6949,217.69,217.6999,217.6999,217.69,1705096920000,3,2024-01-12 22:02:00
24268,119.0,217.6883,217.6892,217.6892,217.6892,217.6892,1705097280000,5,2024-01-12 22:08:00
24269,100.0,217.69,217.69,217.69,217.69,217.69,1705097520000,1,2024-01-12 22:12:00


In [23]:
# Assuming data.index is already a DatetimeIndex, no need to convert it again
df9['datetime2'] = pd.to_datetime(df9['datetime2'])
tick['Datetime'] = pd.to_datetime(tick['Datetime'])

# Make sure to sort first
df9 = df9.sort_values(by='datetime2')
tick = tick.sort_values(by='Datetime')

# Function to find the closest previous date in tick for each date in news2
def find_closest_prev_date(target_date, date_col):
    # The information gotten at this time point can only be used in the next time point
    prev_dates = date_col[date_col <= target_date] 
    if not prev_dates.empty:
        return prev_dates.max()
    else:
        # Can happen when the news is earlier than all the tick data
        print(f"WARNING. Previous date not found for {target_date}")
        print(date_col)
        return pd.NaT  # Return Not-A-Time (NaT) if no previous date is found

# Apply the function to each date in news2['datetime2']
closest_dates = df9['datetime2'].apply(lambda x: find_closest_prev_date(x, tick['Datetime']))

# Add this closest date information to news2
df9['closest_date'] = closest_dates
df9.sort_values(by='datetime2')
df9.reset_index(inplace=True, drop=True)
df9

0       2023-11-01 09:00:00
1       2023-11-01 09:06:00
2       2023-11-01 09:14:00
3       2023-11-01 09:15:00
4       2023-11-01 09:16:00
                ...        
31393   2024-01-30 23:50:00
31394   2024-01-30 23:52:00
31395   2024-01-30 23:56:00
31396   2024-01-30 23:58:00
31397   2024-01-31 00:00:00
Name: Datetime, Length: 31398, dtype: datetime64[ns]


Unnamed: 0,id,datetime2,cln_hdl,cln_smr,cln_news,cln_hdl_lemma,cln_smr_lemma,cln_news_lemma,cln_hdl_pol_blob,cln_smr_pol_blob,...,cln_hdl_lemma_pol_bert_score,cln_smr_lemma_pol_bert_score,cln_news_lemma_pol_bert_score,cln_hdl_pol_finbert_score,cln_smr_pol_finbert_score,cln_news_pol_finbert_score,cln_hdl_lemma_pol_finbert_score,cln_smr_lemma_pol_finbert_score,cln_news_lemma_pol_finbert_score,closest_date
0,123559928,2023-11-01 05:39:51,"[Ford, GM bumped to buy Boeing gets 2 upgrades...",[Goldman Sachs upgraded Simon Property Group (...,[Investing.com — Here is your Pro Recap of the...,"[Ford , GM bumped buy Boeing get 2 upgrade : 4...",[Goldman Sachs upgraded Simon Property Group (...,[Investing.com — Pro Recap biggest analyst pic...,[0.0],[0.0],...,0.727060,-0.689266,-0.340141,0.894530,0.549459,0.360264,0.641842,0.836147,0.400344,NaT
1,123544219,2023-11-01 11:39:06,[UPDATE 2-Spirit Aero cuts 737 fuselage delive...,[Spirit AeroSystems on Wednesday announced $10...,"[(Adjusts shares in paragraph 5, adds Airbus c...",[UPDATE 2-Spirit Aero cut 737 fuselage deliver...,[Spirit AeroSystems Wednesday announced $ 101 ...,"[( Adjusts share paragraph 5 , add Airbus comm...",[0.0],"[0.0, 0.0625, 0.0]",...,0.142053,-0.793133,-0.597804,-0.943793,-0.335737,0.295100,-0.900221,-0.361670,0.448931,2023-11-01 11:39:00
2,123566505,2023-11-01 13:30:29,"[Compared to Estimates, Spirit Aerosystems (SP...",[Although the revenue and EPS for Spirit Aeros...,"[For the quarter ended September 2023, Spirit ...","[Compared Estimates , Spirit Aerosystems ( SPR...",[Although revenue EPS Spirit Aerosystems ( SPR...,"[quarter ended September 2023 , Spirit Aerosys...",[0.0],[0.15],...,0.174489,0.000000,-0.215470,0.000000,0.000000,0.354270,0.000000,0.000000,0.530006,2023-11-01 13:30:00
3,123545059,2023-11-01 14:21:57,[Morning Brew: AMDs Q4 Guidance Weighs on Stoc...,[Advanced Micro Devices (NASDAQ:AMD) stock was...,[Advanced Micro Devices (NASDAQ:AMD) stock was...,[Morning Brew : AMDs Q4 Guidance Weighs Stock ...,[Advanced Micro Devices ( NASDAQ : AMD ) stock...,[Advanced Micro Devices ( NASDAQ : AMD ) stock...,[-0.3],"[0.1527777777777778, 0.22727272727272727, -0.06]",...,-0.744470,-0.263412,-0.343342,-0.958961,-0.322292,-0.101700,-0.852977,-0.267872,0.130774,2023-11-01 14:21:00
4,123567205,2023-11-01 22:24:31,[UPDATE 1-US Air Force blows up Minuteman III ...,[The U.S. Air Force said on Wednesday it had b...,[Nov 1 (Reuters) - The U.S. Air Force said on ...,[UPDATE 1-US Air Force blow Minuteman III test...,[U.S. Air Force said Wednesday blown Minuteman...,[Nov 1 ( Reuters ) - U.S. Air Force said Wedne...,[0.0],"[-0.4, -0.25, 0.0]",...,-0.733265,-0.253360,0.317687,-0.892588,-0.872526,-0.066805,0.000000,0.000000,-0.047870,2023-11-01 22:01:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
546,125415682,2024-01-30 21:10:56,[Boeing Seen Narrowing Q4 Loss Amid 737 Max Gr...,[Dow Jones giant Boeing reports Q4 results ear...,[Dow Jones giant Boeing reports Q4 results ear...,[Boeing Seen Narrowing Q4 Loss Amid 737 Max Gr...,[Dow Jones giant Boeing report Q4 result early...,[Dow Jones giant Boeing report Q4 result early...,[0.0],"[0.05, 0.1]",...,-0.737527,-0.239781,-0.239781,-0.965217,-0.952749,-0.952749,-0.965217,-0.944420,-0.944420,2024-01-30 21:09:00
547,125415680,2024-01-30 22:23:48,"[Hawaiian Airlines ekes out Q4 revenue beat, e...",[Hawaiian Holdings (HA) — the parent company o...,[Hawaiian Holdings (HA) — the parent company o...,"[Hawaiian Airlines ekes Q4 revenue beat , earn...",[Hawaiian Holdings ( HA ) — parent company Haw...,[Hawaiian Holdings ( HA ) — parent company Haw...,[0.0],"[-0.06666666666666667, -0.15555555555555559, 0...",...,-0.776986,-0.258785,-0.224460,-0.381478,-0.922368,-0.055458,-0.667220,-0.905073,0.281039,2024-01-30 22:21:00
548,125415679,2024-01-30 22:39:00,"[Boeings Earnings Are Coming., Investors Are W...",[The list of points to watch when the jet make...,[The number of watch items in Boeings fourth-q...,"[Boeings Earnings Coming ., Investors Watching...",[list point watch jet maker report latest resu...,[number watch item Boeings fourth-quarter repo...,"[0.0, 0.0]",[0.225],...,0.698278,0.720234,0.151111,0.856519,0.000000,0.000000,0.730035,0.000000,0.000000,2024-01-30 22:38:00
549,125417521,2024-01-30 23:03:43,[Boeing was once known for safety and engineer...,[Part of the fuselage blowing off shortly afte...,[Part of the fuselage blowing off shortly afte...,"[Boeing known safety engineering ., critic say...","[Part fuselage blowing shortly takeoff , leavi...","[Part fuselage blowing shortly takeoff , leavi...","[0.0, 0.0]","[0.0, -0.1587179487179487, 0.0, -0.125]",...,-0.262539,-0.442463,-0.278902,0.000000,-0.928406,-0.508453,0.000000,-0.912366,-0.404513,2024-01-30 23:03:00


In [24]:
# Drop the NaT in find previous closest dates
def drop_na(df):
    # Drop all the news_content with na
    print(f"Before dropping na: {df.isna().sum().sum()}")
    df1 = df.dropna()
    df1.reset_index(inplace=True, drop=True)
    print(f"After dropping na: {df.isna().sum().sum()}")
    return df1

drop_na(df9).head()

Before dropping na: 1
After dropping na: 1


Unnamed: 0,id,datetime2,cln_hdl,cln_smr,cln_news,cln_hdl_lemma,cln_smr_lemma,cln_news_lemma,cln_hdl_pol_blob,cln_smr_pol_blob,...,cln_hdl_lemma_pol_bert_score,cln_smr_lemma_pol_bert_score,cln_news_lemma_pol_bert_score,cln_hdl_pol_finbert_score,cln_smr_pol_finbert_score,cln_news_pol_finbert_score,cln_hdl_lemma_pol_finbert_score,cln_smr_lemma_pol_finbert_score,cln_news_lemma_pol_finbert_score,closest_date
0,123544219,2023-11-01 11:39:06,[UPDATE 2-Spirit Aero cuts 737 fuselage delive...,[Spirit AeroSystems on Wednesday announced $10...,"[(Adjusts shares in paragraph 5, adds Airbus c...",[UPDATE 2-Spirit Aero cut 737 fuselage deliver...,[Spirit AeroSystems Wednesday announced $ 101 ...,"[( Adjusts share paragraph 5 , add Airbus comm...",[0.0],"[0.0, 0.0625, 0.0]",...,0.142053,-0.793133,-0.597804,-0.943793,-0.335737,0.2951,-0.900221,-0.36167,0.448931,2023-11-01 11:39:00
1,123566505,2023-11-01 13:30:29,"[Compared to Estimates, Spirit Aerosystems (SP...",[Although the revenue and EPS for Spirit Aeros...,"[For the quarter ended September 2023, Spirit ...","[Compared Estimates , Spirit Aerosystems ( SPR...",[Although revenue EPS Spirit Aerosystems ( SPR...,"[quarter ended September 2023 , Spirit Aerosys...",[0.0],[0.15],...,0.174489,0.0,-0.21547,0.0,0.0,0.35427,0.0,0.0,0.530006,2023-11-01 13:30:00
2,123545059,2023-11-01 14:21:57,[Morning Brew: AMDs Q4 Guidance Weighs on Stoc...,[Advanced Micro Devices (NASDAQ:AMD) stock was...,[Advanced Micro Devices (NASDAQ:AMD) stock was...,[Morning Brew : AMDs Q4 Guidance Weighs Stock ...,[Advanced Micro Devices ( NASDAQ : AMD ) stock...,[Advanced Micro Devices ( NASDAQ : AMD ) stock...,[-0.3],"[0.1527777777777778, 0.22727272727272727, -0.06]",...,-0.74447,-0.263412,-0.343342,-0.958961,-0.322292,-0.1017,-0.852977,-0.267872,0.130774,2023-11-01 14:21:00
3,123567205,2023-11-01 22:24:31,[UPDATE 1-US Air Force blows up Minuteman III ...,[The U.S. Air Force said on Wednesday it had b...,[Nov 1 (Reuters) - The U.S. Air Force said on ...,[UPDATE 1-US Air Force blow Minuteman III test...,[U.S. Air Force said Wednesday blown Minuteman...,[Nov 1 ( Reuters ) - U.S. Air Force said Wedne...,[0.0],"[-0.4, -0.25, 0.0]",...,-0.733265,-0.25336,0.317687,-0.892588,-0.872526,-0.066805,0.0,0.0,-0.04787,2023-11-01 22:01:00
4,123567203,2023-11-01 22:48:19,[Boeing says cyber incident hit parts business...,"[WASHINGTON (Reuters) -Boeing, one of the worl...","[WASHINGTON (Reuters) -Boeing, one of the worl...",[Boeing say cyber incident hit part business r...,"[WASHINGTON ( Reuters ) -Boeing , one world la...","[WASHINGTON ( Reuters ) -Boeing , one world la...",[0.0],"[0.0, 0.21666666666666667, -0.4]",...,-0.89158,-0.816141,-0.514869,-0.933888,-0.852436,-0.545566,-0.918345,-0.739639,-0.677498,2023-11-01 22:46:00


Check whether there are 30 columns of scores

In [30]:
df9.columns

Index(['id', 'datetime2', 'cln_hdl', 'cln_smr', 'cln_news', 'cln_hdl_lemma',
       'cln_smr_lemma', 'cln_news_lemma', 'cln_hdl_pol_blob',
       'cln_smr_pol_blob', 'cln_news_pol_blob', 'cln_hdl_lemma_pol_blob',
       'cln_smr_lemma_pol_blob', 'cln_news_lemma_pol_blob', 'cln_hdl_pol_sid',
       'cln_smr_pol_sid', 'cln_news_pol_sid', 'cln_hdl_lemma_pol_sid',
       'cln_smr_lemma_pol_sid', 'cln_news_lemma_pol_sid', 'cln_hdl_pol_bert',
       'cln_smr_pol_bert', 'cln_news_pol_bert', 'cln_hdl_lemma_pol_bert',
       'cln_smr_lemma_pol_bert', 'cln_news_lemma_pol_bert',
       'cln_hdl_pol_finbert', 'cln_smr_pol_finbert', 'cln_news_pol_finbert',
       'cln_hdl_lemma_pol_finbert', 'cln_smr_lemma_pol_finbert',
       'cln_news_lemma_pol_finbert', 'cln_hdl_pol_stc', 'cln_smr_pol_stc',
       'cln_news_pol_stc', 'cln_hdl_lemma_pol_stc', 'cln_smr_lemma_pol_stc',
       'cln_news_lemma_pol_stc', 'cln_hdl_pol_stc_score',
       'cln_smr_pol_stc_score', 'cln_news_pol_stc_score',
       'cln_hdl

## Merge Scores between Trading Periods

In [11]:
df_list = []

for col_name in col_list:
    tmp = df9.groupby('closest_date', as_index=False)[col_name].mean().reset_index(drop=True) 
    df_list.append(tmp)
    # display(tmp)
# print(df_list)

# # Assumes df_list has at least two elements
# merged = df_list[0]
# for i in range(1, len(df_list)):
#     merged = pd.merge(left=merged, right=df_list[i], on='closest_date', how='inner')
# merged

from functools import reduce
# A simpler implementation
merged = reduce(lambda left, right: pd.merge(left, right, on='closest_date', how='inner'), df_list)
merged

Unnamed: 0,closest_date,cln_hdl_pol_stc_score,cln_smr_pol_stc_score,cln_news_pol_stc_score,cln_hdl_lemma_pol_stc_score,cln_smr_lemma_pol_stc_score,cln_news_lemma_pol_stc_score,cln_hdl_pol_blob_score,cln_smr_pol_blob_score,cln_news_pol_blob_score,...,cln_news_pol_bert_score,cln_hdl_lemma_pol_bert_score,cln_smr_lemma_pol_bert_score,cln_news_lemma_pol_bert_score,cln_hdl_pol_finbert_score,cln_smr_pol_finbert_score,cln_news_pol_finbert_score,cln_hdl_lemma_pol_finbert_score,cln_smr_lemma_pol_finbert_score,cln_news_lemma_pol_finbert_score
0,2023-11-01 11:39:00,0.830,0.176667,0.462500,0.830,0.716667,0.563750,0.0,0.062500,0.114889,...,-0.408932,0.142053,-0.793133,-0.597804,-0.943793,-0.335737,0.295100,-0.900221,-0.361670,0.448931
1,2023-11-01 13:30:00,0.430,0.610000,0.519259,0.430,0.610000,0.520000,0.0,0.150000,0.006184,...,-0.099371,0.174489,0.000000,-0.215470,0.000000,0.000000,0.354270,0.000000,0.000000,0.530006
2,2023-11-01 14:21:00,-0.330,0.653333,0.452609,-0.330,0.610000,0.445217,-0.3,0.106684,0.110974,...,-0.148375,-0.744470,-0.263412,-0.343342,-0.958961,-0.322292,-0.101700,-0.852977,-0.267872,0.130774
3,2023-11-01 22:01:00,0.600,0.603333,0.635714,0.600,0.620000,0.642857,0.0,-0.325000,-0.325000,...,0.462276,-0.733265,-0.253360,0.317687,-0.892588,-0.872526,-0.066805,0.000000,0.000000,-0.047870
4,2023-11-01 22:46:00,-0.720,-0.263333,-0.104737,-0.720,-0.263333,0.066842,0.0,-0.091667,-0.049907,...,-0.419641,-0.891580,-0.816141,-0.514869,-0.933888,-0.852436,-0.545566,-0.918345,-0.739639,-0.677498
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
483,2024-01-30 21:09:00,-0.750,0.445000,0.445000,-0.750,0.445000,0.445000,0.0,0.075000,0.075000,...,-0.387261,-0.737527,-0.239781,-0.239781,-0.965217,-0.952749,-0.952749,-0.965217,-0.944420,-0.944420
484,2024-01-30 22:21:00,0.540,0.432500,0.434211,0.540,0.455000,0.305789,0.0,0.031684,0.144500,...,-0.182459,-0.776986,-0.258785,-0.224460,-0.381478,-0.922368,-0.055458,-0.667220,-0.905073,0.281039
485,2024-01-30 22:38:00,0.525,-0.500000,-0.400000,0.525,0.640000,-0.400000,0.0,0.225000,-0.050000,...,0.000000,0.698278,0.720234,0.151111,0.856519,0.000000,0.000000,0.730035,0.000000,0.000000
486,2024-01-30 23:03:00,0.515,-0.192500,0.225789,0.515,-0.207500,0.181504,0.0,-0.141859,0.066913,...,-0.176781,-0.262539,-0.442463,-0.278902,0.000000,-0.928406,-0.508453,0.000000,-0.912366,-0.404513


## Merge Tick Data and Scores

In [42]:
merged2 = pd.merge(left=tick, right=merged, left_on='Datetime', right_on='closest_date', how='left')
merged2.reset_index(inplace=True, drop=True)

In [13]:
merge_path = root_dir.joinpath('data', 'proc', f'BA_merged_{BT_START_STR}_{BT_END_STR}.csv') # TODO: Change dates
merged2.to_csv(merge_path, index=False)

## Simple Post-Trade Analysis

In [None]:
merged2[merged2.index == 1873]

In [None]:
# Choose col_name to describe
merged2[col_name].describe()

In [None]:
# Post-trade analysis
merged2[merged2['Datetime'] >= pd.to_datetime('2024-01-03T14:17:00')]


# Backtesting
- Pros
    - Test single strategy
    - Have optimizer, graphs
- Cons
    - Cannot trade multiple assets FIXME: not applicable to portfolio
    - Does not trade fractional shares
https://kernc.github.io/backtesting.py/#example


- Other backtesting framework: backtrader, zipline - both can do multi-asset trading
- Backtrader works with Pandas DataFrames, CSV, and real-time data feeds from Interactive Brokers, Oanda, and Visual Chart. 
- 2% rule: https://www.investopedia.com/terms/t/two-percent-rule.asp#:~:text=What%20Is%20the%202%25%20Rule,capital%20on%20any%20single%20trade.
- Try to have less than 10% of drawdown: https://www.quora.com/How-do-I-use-the-never-risk-more-than-2-rule-in-Forex-trading


Hypothesis
- Takes in a df from start to end, with all the ticker data (including those NA for sentiment)
- Enters trade at 549 (My information should backfill)
548	308.0	247.7006	247.7000	247.7000	247.7000	247.7000	1704291540000	8	2024-01-03 14:19:00	2024-01-03 14:19:00	0.156808
549	264.0	247.6105	247.6000	247.6000	247.6000	247.6000	1704291780000	9	2024-01-03 14:23:00	NaT	NaN
550	1157.0	247.5724	247.6000	247.5031	247.6001	247.5031	1704291840000	49	2024-01-03 14:24:00	NaT	NaN
- I can compare the results between lemmatization or not, and fix other variables constant
- I can compare the results between different content and fix others constant



### Strategy

In [5]:
class SimpleStmStrat(Strategy):
    """
    Use a proportional amount of cash to trade with the sentiment score indicator.
    """
    # Strategy class should define parameters as class variables before they can be optimized or run with.
    col = None

    # Add the parameters in init
    def __init__(self, broker, data, **kwargs):
        super().__init__(broker, data, **kwargs)  # Make sure the parent class can handle **kwargs appropriately
        self.col = kwargs.get('col', self.col)

    # Initialize additional indicators here if needed
    def init(self):
        # self.trade_size = 40 # This times the next open price cannot exceed equity
        self.sl_pct = 0.01
        self.tp_pct = 0.02
        self.risk_per_trade = 0.5 # Maximum of the portfolio on one trade

    def next(self):
        cur_stm = self.data[self.col][-1]
        # print(self.data['closest_date'][-1])
        cur_price = self.data['Close'][-1]

        # print(f"-----{self.data['Datetime'][-1]}-----")
        trade_size = (0.5 * (abs(cur_stm) ** 2) + 0.5) * self.risk_per_trade
        if (cur_stm > 0): # Many losses if I don't take
            self.buy(size=trade_size, sl=(1 - self.sl_pct) * cur_price, tp=(1 + self.tp_pct) * cur_price)
            # If size is a value between 0 and 1, it is interpreted as a fraction of current available liquidity (cash plus Position.pl minus used margin). A value greater than or equal to 1 indicates an absolute number of units.

        elif cur_stm < 0:
            self.sell(size=trade_size, sl=(1 + self.sl_pct) * cur_price, tp=(1 - self.tp_pct) * cur_price)
        elif (cur_stm == 0):
            pass
        # print(cur_stm)


In [6]:

merge_path = root_dir.joinpath('data', 'proc', f'BA_merged_{BT_START_STR}_{BT_END_STR}.csv') 
merged2 = pd.read_csv(merge_path, index_col=False)
# convert_data(merged2)


### Run for All

In [38]:
tar_dir = root_dir.joinpath('outputs', 'trade-plots')
tar_dir.mkdir(parents=True, exist_ok=True)
df_list = []

for stm_tech in stm_techs:
    for lemma in lemmas:
        for content in contents:
            results_dict = {
                'stm_tech': stm_tech,
                'lemma': 'No',
                'content': content
            }
            if lemma:
                col_name = f'{content}_{lemma}_pol_{stm_tech}_score'
                filename = str(tar_dir.joinpath(f"{content}_lemma_{stm_tech}.html"))
                results_dict[lemma] = 'Yes'
            else:
                col_name = f'{content}_pol_{stm_tech}_score'
                filename = str(tar_dir.joinpath(f"{content}_no_lemma_{stm_tech}.html"))

            # Running the backtest
            bt = Backtest(
                data=merged2, 
                strategy=SimpleStmStrat, 
                        cash=10000, 
                        margin=1,
                        commission=.0,
                        trade_on_close=False,
                        hedging=True
                        )
            
            results, returns = bt.run(col=col_name)

            # display(results)
            # print(type(returns))
            # display(returns)

            bt.plot(filename=filename,
                    results=results,
                    plot_return=True,
                    open_browser=False)

            results_dict['returns'] = list(returns)
            results_dict.update(results)
            df_list.append(results_dict)
            # These are the main results that we need
            # print(results.get('Return [%]'), results.get('Max. Drawdown [%]'), results.get('# Trades'), results.get('Win Rate [%]'))



  bt = Backtest(
INFO:bokeh.io.state:Session output file '/Users/tangyiqwan/dev/projects/quant/fyp/outputs/trade-plots/cln_hdl_no_lemma_stc.html' already exists, will be overwritten.
  fig = gridplot(
  fig = gridplot(
  bt = Backtest(
INFO:bokeh.io.state:Session output file '/Users/tangyiqwan/dev/projects/quant/fyp/outputs/trade-plots/cln_smr_no_lemma_stc.html' already exists, will be overwritten.
  fig = gridplot(
  fig = gridplot(
  bt = Backtest(
INFO:bokeh.io.state:Session output file '/Users/tangyiqwan/dev/projects/quant/fyp/outputs/trade-plots/cln_news_no_lemma_stc.html' already exists, will be overwritten.
  fig = gridplot(
  fig = gridplot(
  bt = Backtest(
INFO:bokeh.io.state:Session output file '/Users/tangyiqwan/dev/projects/quant/fyp/outputs/trade-plots/cln_hdl_lemma_stc.html' already exists, will be overwritten.
  fig = gridplot(
  fig = gridplot(
  bt = Backtest(
INFO:bokeh.io.state:Session output file '/Users/tangyiqwan/dev/projects/quant/fyp/outputs/trade-plots/cln_smr

In [43]:
# Append each dictionary as rows into a new df
rdf = pd.DataFrame(df_list)
rdf.head()

Unnamed: 0,stm_tech,lemma,content,returns,Start,End,Duration,Exposure Time [%],Equity Final [$],Equity Peak [$],...,Avg. Trade [%],Max. Trade Duration,Avg. Trade Duration,Profit Factor,Expectancy [%],SQN,Kelly Criterion,_strategy,_equity_curve,_trades
0,stc,No,cln_hdl,"[-0.009718352307425127, -0.010218127911901798,...",0.0,31397.0,31397.0,85.970444,10137.533592,11350.94807,...,0.061676,1824.0,309.543943,1.112167,0.072979,0.221724,0.008579,SimpleStmStrat(col=cln_hdl_pol_stc_score),Equity DrawdownPct DrawdownDura...,Size EntryBar ExitBar EntryPrice Exi...
1,stc,No,cln_smr,"[-0.00981769394680776, -0.009718352307425127, ...",0.0,31397.0,31397.0,87.212561,10701.017589,11698.663928,...,0.079227,1824.0,315.628235,1.140963,0.090535,1.215605,0.053313,SimpleStmStrat(col=cln_smr_pol_stc_score),Equity DrawdownPct DrawdownDura...,Size EntryBar ExitBar EntryPrice Exi...
2,stc,No,cln_news,"[-0.00981769394680776, -0.009718352307425127, ...",0.0,31397.0,31397.0,85.196509,11044.637426,12170.47802,...,0.158817,1824.0,319.226328,1.276865,0.170442,1.826066,0.082744,SimpleStmStrat(col=cln_news_pol_stc_score),Equity DrawdownPct DrawdownDura...,Size EntryBar ExitBar EntryPrice Exit...
3,stc,Yes,cln_hdl,"[-0.009718352307425127, -0.010218127911901798,...",0.0,31397.0,31397.0,86.040512,10327.272096,11410.574287,...,0.071754,1824.0,316.966667,1.128335,0.083115,0.5173,0.023721,SimpleStmStrat(col=cln_hdl_lemma_pol_stc_score),Equity DrawdownPct DrawdownDura...,Size EntryBar ExitBar EntryPrice Exi...
4,stc,Yes,cln_smr,"[-0.00981769394680776, -0.009718352307425127, ...",0.0,31397.0,31397.0,85.53411,10618.554027,11717.093466,...,0.065961,1824.0,309.057279,1.119167,0.077273,1.057888,0.046249,SimpleStmStrat(col=cln_smr_lemma_pol_stc_score),Equity DrawdownPct DrawdownDura...,Size EntryBar ExitBar EntryPrice Exi...


In [44]:
count = 0
for stm_tech in stm_techs:
    for lemma in lemmas:
        for content in contents:
            if lemma:
                col_name = f'{content}_{lemma}_pol_{stm_tech}_score'
            else:
                col_name = f'{content}_pol_{stm_tech}_score'
            
            overall_return = rdf['Return [%]']
            print(f"{col_name}: {overall_return}")
            # if count > 2: break
            # count += 1
            # returns = results_dict.get(col_name).get('returns')
            # normality_test(np.log(returns))

cln_hdl_pol_stc_score: 0      1.375336
1      7.010176
2     10.446374
3      3.272721
4      6.185540
5      9.085117
6      6.237475
7      4.316190
8      5.650002
9      7.683272
10     4.052808
11     2.929241
12     8.262432
13     6.212244
14     7.643808
15     4.470092
16     3.779027
17     7.958096
18    12.478882
19     5.097958
20     6.099515
21    13.202210
22     5.014634
23     4.636732
24     5.430373
25    12.039484
26    13.547596
27     5.904786
28    12.127242
29    13.858808
Name: Return [%], dtype: float64
cln_smr_pol_stc_score: 0      1.375336
1      7.010176
2     10.446374
3      3.272721
4      6.185540
5      9.085117
6      6.237475
7      4.316190
8      5.650002
9      7.683272
10     4.052808
11     2.929241
12     8.262432
13     6.212244
14     7.643808
15     4.470092
16     3.779027
17     7.958096
18    12.478882
19     5.097958
20     6.099515
21    13.202210
22     5.014634
23     4.636732
24     5.430373
25    12.039484
26    13.547596
27     5.

## Analysis

### Cleaning

In [68]:
def remove_outliers(data: pd.Series, factor: float = 1.5) -> pd.Series:
    """
    Removes outliers from a Pandas Series based on the IQR method.

    Parameters:
    - data: Pandas Series from which to remove outliers.
    - factor: Multiplier for the IQR to define the cut-off beyond which data is considered an outlier.

    Returns:
    - Pandas Series with outliers removed.
    """
    q1 = data.quantile(0.25)
    q3 = data.quantile(0.75)
    iqr = q3 - q1

    # Define outliers using the factor parameter
    lower_bound = q1 - factor * iqr
    upper_bound = q3 + factor * iqr

    # Filter out the outliers
    filtered_data = data[(data >= lower_bound) & (data <= upper_bound)]
    return filtered_data


In [71]:
rdf['returns'] = rdf['returns'].apply(lambda x: remove_outliers(pd.Series(x)).tolist())


### ANOVA

In [73]:
from scipy.stats import f_oneway

def anova(*returns):
    # Calculates n F-statistic between bootstrap samples and return the list
    def bootstrap_f_stat(data_groups, n_bootstraps=1000):
        bs_f_stat_list = []
        # data_groups is all the groups that we want to compare
        
        for _ in range(n_bootstraps):
            # Get a list of list of samples for each group
            resampled_groups = [np.random.choice(group, size=len(group), replace=True) for group in data_groups]

            # Calculate the F-statistic for ith bootstrap
            # Unzip the resampled_groups to be parameters
            f_stat, p_val = f_oneway(*resampled_groups)
            bs_f_stat_list.append(f_stat)

        return bs_f_stat_list

    # Calculate the observed F-statistic
    obs_f_stat, obs_p_val = f_oneway(*returns)

    # Bootstrap the F-statistic
    bs_f_stat_list = bootstrap_f_stat(data_groups=returns)

    # print(bs_f_stat_list[:10])

    # A small proportion of bootstrap is greater or equal to my observed f stat
    obs_f_stat_upper_quantile = np.mean(np.array(bs_f_stat_list) >= obs_f_stat)
    # A small proportion of boostrap is smaller than my observed f stat
    obs_f_stat_lower_quantile = np.mean(np.array(bs_f_stat_list) < obs_f_stat)

    alpha = 0.05
    tail_prob = alpha / 2

    if obs_f_stat_lower_quantile < tail_prob or obs_f_stat_upper_quantile < tail_prob:
        
        print("The difference between groups is statistically significant.")
    else:
        print("No significant difference between groups was found.")


In [74]:
import itertools
all_combo = list(itertools.combinations(range(0, 30), 2))
for a, b in all_combo:
    anova(rdf['returns'][a], rdf['returns'][b])

[3.5101081981412405, 0.3200566428559239, 0.2783380918978707, 0.3550333885915213, 2.4847951909813335, 0.30594354051253425, 1.6090640193431656, 0.16991746816173983, 0.5991602338611111, 1.4416950245935138]
No significant difference between groups was found.
[0.030233354806728098, 0.7792092533939587, 4.807032472746196, 2.4317230884167738, 2.989895308180462, 2.4016777684149258, 0.07914834000523495, 4.760078696549887, 1.842979768152733, 0.37033788226023645]
No significant difference between groups was found.


[0.2804864054385372, 1.0695542218685878, 0.02335636300454891, 0.019476454182934155, 0.1268056425713035, 0.13978241097820562, 0.8470545328681564, 0.01867019111722105, 0.2123456007222393, 0.23261596446304603]
No significant difference between groups was found.
[1.8547426319971627, 0.19377927960903155, 1.9253481862187365, 0.09657214076506797, 8.32586971802245, 2.649914245101273, 0.004559842844958121, 1.382469427529331, 0.10769820078749688, 0.5137168623085145]
No significant difference between groups was found.
[1.723620958486968, 0.21584204091097023, 9.463466715214473, 6.973425191725217, 1.940252694601663, 2.0949074472233957, 0.007222349069450906, 0.9870291813140641, 1.8146870775731587, 9.518342695486693e-06]
No significant difference between groups was found.
[1.7475023724916263, 0.37512181262578803, 0.20361717317378294, 3.096754024799986, 2.2980538497335505, 6.868360744333028, 1.3466946437954455, 3.7775576564134647, 9.392533954967202, 7.141633534790891]
No significant difference between

In [50]:
print(np.mean(np.array([1,5]) > 2))

0.5


# Archive

In [30]:


def normality_test(data: typing.List):
    data3 = data
    # data2 = np.log(data)
    # pl2 = pl
    # q1 = data2.quantile(0.25)
    # q3 = data2.quantile(0.75)
    # iqr = q3 - q1

    # # Define outliers
    # lower_bound = q1 - 1.5 * iqr
    # upper_bound = q3 + 1.5 * iqr

    # data3 = data2[(data2 >= lower_bound) & (data2 <= upper_bound)]

    # Normality Test
    _, p_value_normality_group1 = stats.shapiro(data3)

    print(f"Normality Test P-Values: Group1={p_value_normality_group1}")

    # Q-Q Plots for Visual Normality Check
    plt.figure(figsize=(5,3))
    sm.qqplot(data3, line ='45')
    plt.title('Group 1 Q-Q Plot')
    plt.show()

    plt.figure(figsize=(5,3))
    plt.hist(data3, bins=50, alpha=0.75, color='blue')
    plt.title('Returns Distribution')
    plt.xlabel('Returns')
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.show()

In [None]:
from scipy.stats import mannwhitneyu

# Perform the Mann-Whitney U Test
stat, p_value = mannwhitneyu(cln_hdl_pol_stc_score_pl, cln_news_pol_blob_score_pl, alternative='two-sided')

print(f"Mann-Whitney U statistic: {stat}")
print(f"P-value: {p_value}")

In [None]:
import numpy as np

def bootstrap_returns(returns, n_bootstraps=100):
    """Generate bootstrap samples for returns and calculate mean for each sample."""
    bootstrap_means = np.array([np.mean(np.random.choice(returns, size=len(returns), replace=True)) for _ in range(n_bootstraps)])
    return bootstrap_means

from scipy.stats import kstest, norm

def ks_test_with_theoretical_distribution(bootstrap_means):
    """Perform KS test comparing bootstrap means with a normal distribution."""
    # Assuming the theoretical normal distribution has the same mean and std as the bootstrap_means
    mean, std = np.mean(bootstrap_means), np.std(bootstrap_means)
    return kstest(bootstrap_means, 'norm', args=(mean, std))

def nested_ks_test_for_p_values(p_values):
    """Perform KS test to check if the given p-values are uniformly distributed."""
    return kstest(p_values, 'uniform')

# Mock data: returns for different sentiment analysis techniques
returns_data = {
    'Technique0': pl
    # 'Technique1': np.random.normal(0.05, 0.02, 1000),
    # 'Technique2': np.random.normal(0.04, 0.02, 1000),
    # Add more techniques as needed
}

n_bootstraps = 100
p_values_for_ks_tests = []

for technique, returns in returns_data.items():
    # Step 1: Bootstrap
    bootstrap_means = bootstrap_returns(returns, n_bootstraps)
    
    # Step 2: KS Test with Theoretical Distribution
    ks_stat, ks_p_value = ks_test_with_theoretical_distribution(bootstrap_means)
    print(f"KS test for {technique}: Stat={ks_stat}, P-Value={ks_p_value}")
    
    p_values_for_ks_tests.append(ks_p_value)

# Step 3: Nested-KS Test
nested_ks_stat, nested_ks_p_value = nested_ks_test_for_p_values(p_values_for_ks_tests)
print(f"Nested KS test: Stat={nested_ks_stat}, P-Value={nested_ks_p_value}")


In [76]:
rdf.to_latex(index=False, header=True)

'\\begin{tabular}{llllrrrrrrrrrrrrrrrrrrrrrrrrrrrrlll}\n\\toprule\nstm_tech & lemma & content & returns & Start & End & Duration & Exposure Time [%] & Equity Final [$] & Equity Peak [$] & Return [%] & Buy & Hold Return [%] & Return (Ann.) [%] & Volatility (Ann.) [%] & Sharpe Ratio & Sortino Ratio & Calmar Ratio & Max. Drawdown [%] & Avg. Drawdown [%] & Max. Drawdown Duration & Avg. Drawdown Duration & # Trades & Win Rate [%] & Best Trade [%] & Worst Trade [%] & Avg. Trade [%] & Max. Trade Duration & Avg. Trade Duration & Profit Factor & Expectancy [%] & SQN & Kelly Criterion & _strategy & _equity_curve & _trades \\\\\n\\midrule\nstc & No & cln_hdl & [-0.009718352307425127, -0.010218127911901798, -0.012548338726225294, 0.019278639902547523, -0.01005121007788401, 0.01354330708661422, 0.018923937124169177, -0.009526315789473827, 0.01957384583246302, -0.011919094016222043, -0.009543076923076876, -0.01071076923076919, -0.009949259392137866, -0.009878793927041829, -0.009986202446704229, -0.0

### Next


In [None]:
# TODO: Draw plots for overall, isolate factors, compare which factor is the most significant
# TODO: Think how to tabulate the data (30 rows and columns? Compare which two are significant)