# SPY ETF price Data
In order to have a baseline reference to assess the performance of our insider transactions, we need to compare it to a representation of the US equities market. We have chosed to use the SPY ETF which is designed to closely follow the returns of the S&P 500. The S&P 5000 is a market-capitalization-weighted index of the 500 leading publicly traded US companies. It is often used as a bnechmark for 'large-cap' US equity performance. If we are able to outperform the SPY, then we potentially have a good strategy for pulling alpha from the market.

# Section 1: Connecting to Google Colab and Google Drive

In [None]:
# Start by importing drive from google.colab
from google.colab import drive
import os

# Mount the drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
# Students Google Drive Path
toms_path = "/content/drive/MyDrive/Colab Notebooks/593 - Milestone I/593 - Insider Trading Milestone I Project"
kirts_path = None
ramis_path = None

# Navigate to the right working directory and confirm our current working drive
os.chdir(toms_path)
# os.chdir(kirts_path)
# os.chdir(ramis_path)
print(os.getcwd())

/content/drive/MyDrive/Colab Notebooks/593 - Milestone I/593 - Insider Trading Milestone I Project


#Section 2: Importing Libraries and Capturing Dependencies
This section will hold all of the libraries that we will be using for data import, manipulation, and analysis. We will then capture the versions of all libriaries for reproducibility with our code. Unforutnately, google colab uses an outdated yfinance, so let's explicity install the upgraded library first and restart the kernel.

In [None]:
# Let's explicity update yfiance just incase
%pip install yfinance --upgrade --no-cache-dir



In [None]:
# Data import
import numpy as np
import pandas as pd
from datetime import date
import yfinance as yf

In [None]:
# print the dependencies in the notebook
%pip freeze

# create a .txt file that contains all versions
#%pip freeze > colab_requirements.txt

absl-py==1.4.0
accelerate==1.6.0
aiohappyeyeballs==2.6.1
aiohttp==3.11.15
aiosignal==1.3.2
alabaster==1.0.0
albucore==0.0.24
albumentations==2.0.6
ale-py==0.11.0
altair==5.5.0
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.9.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
array_record==0.7.2
arviz==0.21.0
astropy==7.0.2
astropy-iers-data==0.2025.5.12.0.38.29
astunparse==1.6.3
atpublic==5.1
attrs==25.3.0
audioread==3.0.1
autograd==1.8.0
babel==2.17.0
backcall==0.2.0
backports.tarfile==1.2.0
beautifulsoup4==4.13.4
betterproto==2.0.0b6
bigframes==2.4.0
bigquery-magics==0.9.0
bleach==6.2.0
blinker==1.9.0
blis==1.3.0
blobfile==3.0.0
blosc2==3.3.2
bokeh==3.7.3
Bottleneck==1.4.2
bqplot==0.12.44
branca==0.8.1
build==1.2.2.post1
CacheControl==0.14.3
cachetools==5.5.2
catalogue==2.0.10
certifi==2025.4.26
cffi==1.17.1
chardet==5.2.0
charset-normalizer==3.4.2
chex==0.1.89
clarabel==0.10.0
click==8.2.0
cloudpathlib==0.21.0
cloudpickle==3.1.1
cmake==3.31.6
cmdstanpy==1.2.5
colorcet

# Section 3: Importing Large .CSV file for Data Aggregation
We have a relatively clean datafile at this point, so we should not need to do much data cleaning or preparation in order to add our benchmark data. First we will pull in our .CSV file and make sure it contains the information we want.

In [None]:
# Start by taking a quick look at the files in our directory so we can pull the right one
print(os.listdir())

['insider_transactions_readme[1].pdf', 'parse_form4.py', 'colab_requirements.txt', 'all_common_stock_purchases (6).csv', 'all_common_stock_purchases (6).gsheet', 'Mounting_Notebook_Importing_Form4.ipynb', 'Insider transaction data sets', 'Meeting Summaries', 'Insiders_multi_zip.ipynb', "Insider Trading: Do Corporate Insiders Know Something We Don't?.docx", 'Insider Trading Proposal.docx', 'all_common_stock_purchases 2006 to 4Q24.csv', 'sec_insider_zips', 'Insiders_zip_data_processing.ipynb', 'stock_purchases_by_insider.csv', 'download_sec_zips.ipynb', 'common_stock_purchases_with_price_data.csv', 'yahoo_finance_price_data.ipynb', 'stock_purchases_enhanced_with_company_info.csv', 'enhanced_common_stock_purchases_with_spy_data.csv', 'Market_Cap_Sector_Industry_Classification.ipynb', 'SPY_etf_benchmark_data.ipynb']


In [None]:
# Read in the .csv file
cs_df = pd.read_csv("stock_purchases_enhanced_with_company_info.csv")
print(f"Let's take a look at the size of our dataframe: {cs_df.shape}\n")
# print(cs_df.head())

# Let's take a look at the number of unique tickers in this file
tickers = list(cs_df["Ticker"].unique())
print(f"We have {len(tickers)} unique tickers\n")

# Let's take a look at the number of missing values in the file
missing_counts = cs_df.isna().sum()
print(missing_counts)

Let's take a look at the size of our dataframe: (122067, 44)

We have 3595 unique tickers

Insider Name                   0
Insider Title                  0
Insider Role                   0
Issuer                         0
Ticker                         0
CIK Code                       0
Period of Report               0
Transaction Date               0
Security                       0
Transaction Code               0
Ownership Type                 0
ACCESSION_NUMBER               0
shares                         0
price_per_share                0
shares_after                  44
total_capital                  0
average_price_per_share      539
price_-1month                  0
trend_-1month                  0
trend_transactiondate          0
price_1month                   4
trend_1month                   4
price_2month                 208
trend_2month                 208
price_3month                 828
trend_3month                 828
price_4month                1132
trend_4month      

This is looking like a really nice dataset. We still will have to do with some pretty significant numbers of missing datafields, however some of the larger ones appear to be in datafields that wont be directly relevant to our current project. The first thing we need to do is prepare the data for the aggregated SPY data we will be adding to it.

In [None]:
new_columns = [
    "spy_price_-1month",
    "spy_trend_-1month",
    "spy_price_transactiondate",
    "spy_trend_transactiondate",
    "spy_price_1month",
    "spy_trend_1month",
    "spy_price_2month",
    "spy_trend_2month",
    "spy_price_3month",
    "spy_trend_3month",
    "spy_price_4month",
    "spy_trend_4month",
    "spy_price_5month",
    "spy_trend_5month",
    "spy_price_6month",
    "spy_trend_6month",
]
for col in new_columns:
    cs_df[col] = pd.NA
print(cs_df.shape)
cs_df.head()

(122067, 60)


Unnamed: 0,Insider Name,Insider Title,Insider Role,Issuer,Ticker,CIK Code,Period of Report,Transaction Date,Security,Transaction Code,...,spy_price_2month,spy_trend_2month,spy_price_3month,spy_trend_3month,spy_price_4month,spy_trend_4month,spy_price_5month,spy_trend_5month,spy_price_6month,spy_trend_6month
0,AARON BARTH F,Secretary,Officer,FULL HOUSE RESORTS INC,FLL,891482,12-Aug-11,12-Aug-11,Common Stock,P,...,,,,,,,,,,
1,AARON BARTH F,Secretary,Officer,FULL HOUSE RESORTS INC,FLL,891482,17-Sep-08,17-Sep-08,Common Stock,P,...,,,,,,,,,,
2,AARON BARTH F,Secretary,Officer,FULL HOUSE RESORTS INC,FLL,891482,7-Aug-07,7-Aug-07,Common Stock,P,...,,,,,,,,,,
3,AARON HENRY L,Missing,Director,MEDALLION FINANCIAL CORP,MFIN,1000209,8-Aug-19,8-Aug-19,Common Stock,P,...,,,,,,,,,,
4,AARON SUSAN D,Missing,Director,HORIZON BANCORP /IN/,HBNC,706129,13-Dec-12,13-Dec-12,Common Stock,P,...,,,,,,,,,,


# Section 4: Yahoo Finance API and Data Aggregatoin
The dataframe is prepared for us to make one query to yahoo finance and get all market data for the SPY etf. It shouldn't take long for us to download this data and apply some easy operations to obtain the trend data. In my first iterations, I iterated over all rows using .iterrows() in order to populate the dataframe however this was relatively slow. Looking at the pandas documentation, I found that we can use merge_asof (implemented in C and is faster) to try and merge the data. Let's see if that is significantly faster. First we will prepare our dataframe to merge and then we will show both methods and the timing.

In [None]:
# Let's get the spy dataframe and calculate the momentum of the trends
ticker = "SPY"
print(f"\nProcessing ticker {ticker}...")
# Let's be sure to stay consistent with our ticker data calls
spy_data = yf.download(
    tickers=ticker,
    period="max",
    interval="1d",
    auto_adjust=True,
    actions=False,
    threads=False,
)
# Let's calculate the moving averages
spy_data["28MA"] = spy_data["Close"].rolling(window=28).mean()
# Normalize it based on the previous days MA for comparisons
spy_data["MA_diff"] = spy_data["28MA"].pct_change() * 100
# let's get rid of the first 29days because they dont have an MA_diff
spy_data = spy_data.dropna().copy()
# Finally, lets catch the monthly trend of this moving average
spy_data["MA_trend"] = spy_data["MA_diff"].rolling(window=28).mean()
# Let's drop the missing data again
spy_data = spy_data.dropna().copy()
# Let's explicitly make sure the date is in the proper format
spy_data.index = pd.to_datetime(spy_data.index)
print(f"\nThe shape of our dataframe is {spy_data.shape}\n")
print(
    f"The first date is {min(spy_data.index)} and the last day is {max(spy_data.index)}"
)
spy_data.head()


Processing ticker SPY...


[*********************100%***********************]  1 of 1 completed


The shape of our dataframe is (8078, 8)

The first date is 1993-04-20 00:00:00 and the last day is 2025-05-20 00:00:00





Price,Close,High,Low,Open,Volume,28MA,MA_diff,MA_trend
Ticker,SPY,SPY,SPY,SPY,SPY,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
1993-04-20,24.900005,25.022321,24.742742,24.987374,279500,25.076701,-0.074921,0.043273
1993-04-21,24.882538,24.952433,24.812643,24.952433,67900,25.059761,-0.067555,0.036204
1993-04-22,24.568014,24.987383,24.568014,24.777698,97700,25.040904,-0.075247,0.031102
1993-04-23,24.46316,24.585475,24.428212,24.515581,106000,25.013955,-0.10762,0.024489
1993-04-26,24.270943,24.567995,24.201048,24.480627,62600,24.980141,-0.13518,0.018232


The spy data clearly covers the necessary timeframe that we are looking at dating back to 1993. So that is great. Now we can set up our two different versions of combining the data to see which one is more efficient.

# Section 5: Mering the data
Let's make sure that we are using copies of the data so that we don't accidently edit our data, we will do this prior to timing our functions.

In [None]:
temp_full1 = cs_df.copy()
temp_full2 = cs_df.copy()

## Method 1: Simple Looping Function  
We will be iterating over 120,000+ rows. This is relatively small compared to what we are capable of doing, so it shouldn't take too much time but we want to build this with the intention of scaling up. So, let's find the most efficient way possible.

In [None]:
%%timeit

# Set up our looping function.
for index, row in temp_full1.iterrows():
    # Find the original transaction data
    trans_date = pd.to_datetime(row["Transaction Date"])
    # Let's define all of the other dates we will look for in the spy_data
    date_premonth = trans_date - pd.DateOffset(months=1)
    date_onemonth = trans_date + pd.DateOffset(months=1)
    date_twomonth = trans_date + pd.DateOffset(months=2)
    date_threemonth = trans_date + pd.DateOffset(months=3)
    date_fourmonth = trans_date + pd.DateOffset(months=4)
    date_fivemonth = trans_date + pd.DateOffset(months=5)
    date_sixmonth = trans_date + pd.DateOffset(months=6)
    # Let's grab all of the price data from the spy_data. The initial data is double indexed so use [ticker] to get access to the data
    price_premonth = np.round(spy_data["Close"][ticker].asof(date_premonth), 2)
    price_transactiondate = np.round(spy_data["Close"][ticker].asof(trans_date), 2)
    price_onemonth = np.round(spy_data["Close"][ticker].asof(date_onemonth), 2)
    price_twomonth = np.round(spy_data["Close"][ticker].asof(date_twomonth), 2)
    price_threemonth = np.round(spy_data["Close"][ticker].asof(date_threemonth), 2)
    price_fourmonth = np.round(spy_data["Close"][ticker].asof(date_fourmonth), 2)
    price_fivemonth = np.round(spy_data["Close"][ticker].asof(date_fivemonth), 2)
    price_sixmonth = np.round(spy_data["Close"][ticker].asof(date_sixmonth), 2)
    # print(price_premonth,price_transactiondate,price_sixmonth)
    # Let's get the momentum of all the trends
    trend_premonth = np.round(spy_data["MA_trend"].asof(date_premonth), 4)
    trend_transactiondate = np.round(spy_data["MA_trend"].asof(trans_date), 4)
    trend_onemonth = np.round(spy_data["MA_trend"].asof(date_onemonth), 4)
    trend_twomonth = np.round(spy_data["MA_trend"].asof(date_twomonth), 4)
    trend_threemonth = np.round(spy_data["MA_trend"].asof(date_threemonth), 4)
    trend_fourmonth = np.round(spy_data["MA_trend"].asof(date_fourmonth), 4)
    trend_fivemonth = np.round(spy_data["MA_trend"].asof(date_fivemonth), 4)
    trend_sixmonth = np.round(spy_data["MA_trend"].asof(date_sixmonth), 4)
    # print(trend_premonth,trend_transactiondate,trend_sixmonth)

    # Get todays date
    today = pd.to_datetime(date.today())

    # Let's update the original dataframe
    temp_full1.at[index, "spy_price_-1month"] = price_premonth
    temp_full1.at[index, "spy_price_transactiondate"] = price_transactiondate

    if date_onemonth < today:
        temp_full1.at[index, "spy_price_1month"] = price_onemonth
    if date_twomonth < today:
        temp_full1.at[index, "spy_price_2month"] = price_twomonth
    if date_threemonth < today:
        temp_full1.at[index, "spy_price_3month"] = price_threemonth
    if date_fourmonth < today:
        temp_full1.at[index, "spy_price_4month"] = price_fourmonth
    if date_fivemonth < today:
        temp_full1.at[index, "spy_price_5month"] = price_fivemonth
    if date_sixmonth < today:
        temp_full1.at[index, "spy_price_6month"] = price_sixmonth

    # Lets update the trend data
    temp_full1.at[index, "spy_trend_-1month"] = trend_premonth
    temp_full1.at[index, "spy_trend_transactiondate"] = trend_transactiondate
    if date_onemonth < today:
        temp_full1.at[index, "spy_trend_1month"] = trend_onemonth
    if date_twomonth < today:
        temp_full1.at[index, "spy_trend_2month"] = trend_twomonth
    if date_threemonth < today:
        temp_full1.at[index, "spy_trend_3month"] = trend_threemonth
    if date_fourmonth < today:
        temp_full1.at[index, "spy_trend_4month"] = trend_fourmonth
    if date_fivemonth < today:
        temp_full1.at[index, "spy_trend_5month"] = trend_fivemonth
    if date_sixmonth < today:
        temp_full1.at[index, "spy_trend_6month"] = trend_sixmonth

In [None]:
# Let's take a look at the number of missing values in the file
missing_counts = temp_full1.isna().sum()
print(missing_counts)

# Let's take a look at the min and max dates in here
print(
    f"Min date: {np.min(temp_full1['Transaction Date'])}; Max date {np.max(temp_full1['Transaction Date'])}"
)

temp_full1.tail()

Insider Name                     0
Insider Title                    0
Insider Role                     0
Issuer                           0
Ticker                           0
CIK Code                         0
Period of Report                 0
Transaction Date                 0
Security                         0
Transaction Code                 0
Ownership Type                   0
ACCESSION_NUMBER                 0
shares                           0
price_per_share                  0
shares_after                    44
total_capital                    0
average_price_per_share        539
price_-1month                    0
trend_-1month                    0
trend_transactiondate            0
price_1month                     4
trend_1month                     4
price_2month                   208
trend_2month                   208
price_3month                   828
trend_3month                   828
price_4month                  1132
trend_4month                  1132
price_5month        

Unnamed: 0,Insider Name,Insider Title,Insider Role,Issuer,Ticker,CIK Code,Period of Report,Transaction Date,Security,Transaction Code,...,spy_price_2month,spy_trend_2month,spy_price_3month,spy_trend_3month,spy_price_4month,spy_trend_4month,spy_price_5month,spy_trend_5month,spy_price_6month,spy_trend_6month
122062,ZYBALA MICHAEL G,Missing,Tenpercentowner,INTERGROUP CORP,INTG,69422,19-Mar-14,19-Mar-14,COMMON STOCK,P,...,155.75,0.0176,162.14,0.104,163.93,0.1222,164.5,0.0145,167.19,0.0587
122063,ZYDA CHRISTOPHER J,Senior Vice President & CFO,Officer,LUMINENT MORTGAGE CAPITAL INC,LUM,1236309,12-Jun-07,12-Jun-07,Common Stock,P,...,103.48,-0.0227,105.74,-0.1249,112.32,0.1526,103.25,0.0744,107.32,-0.1528
122064,ZYDA CHRISTOPHER J,Senior Vice President & CFO,Officer,LUMINENT MORTGAGE CAPITAL INC,LUM,1236309,23-May-06,23-May-06,Common Stock,P,...,87.04,-0.0725,91.12,0.0626,92.72,0.1238,96.95,0.1418,99.39,0.1419
122065,ZYLSTRA MICHAEL J,Cracker Barrel Gen. Counsel,Officer,CBRL GROUP INC,CBRL,1067294,6-Jan-06,6-Jan-06,Common Stock,P,...,89.25,0.0252,91.49,0.0635,92.64,0.0399,88.65,-0.0271,89.49,-0.1047
122066,ZYNGIER ALEXANDRE,Missing,Director,AUDIOEYE INC,AEYE,1362190,8-Jul-20,8-Jul-20,Common Stock,P,...,311.17,0.2538,322.33,0.0385,328.31,0.0344,347.07,0.1748,359.0,0.2168


Method 1 took approximately 22 mins to complete for 120,000+ rows of data. The next method we will try is using a merge based on time adjusted dates. This will require adding columns, merging on those columns and then dropping the columns. I imagine that this will be significantly faster. We will use the second copy of the dataset in order to try to complete this.

In [None]:
# Let's start building out the first merge which will be price and trend for a month previous.=
temp_full2["Transaction Date"] = pd.to_datetime(
    temp_full2["Transaction Date"], format="%d-%b-%y"
)
temp_full2["Period of Report"] = pd.to_datetime(
    temp_full2["Period of Report"], format="%d-%b-%y"
)
# Let's compare the two date columns. If they are more than a week apart, I want to investigate further. We can use a where staement here
difference = pd.Timedelta(days=365)
temp_full2["date_status"] = np.where(
    (temp_full2["Period of Report"] - temp_full2["Transaction Date"]).abs()
    <= difference,
    "normal",
    "abnormal",
)
abnormal_df = temp_full2[temp_full2["date_status"] == "abnormal"]
print(abnormal_df.shape)
abnormal_df.sample(15)


(1229, 45)


Unnamed: 0,Insider Name,Insider Title,Insider Role,Issuer,Ticker,CIK Code,Period of Report,Transaction Date,Security,Transaction Code,...,beta,pe_ratio,forward_pe,dividend_yield,fifty_two_week_high,fifty_two_week_low,company_name,market_cap_category,return_6month_pct,date_status
41231,GLANDON TIMOTHY,Vice President,Officer,METHODE ELECTRONICS INC,MEI,65270,2006-10-31,2011-07-29,Common Stock,P,...,0.889,,9.243902,7.39,17.45,5.08,"Methode Electronics, Inc.",Small Cap (< $300M),-3.082614,abnormal
115583,WANGER ERIC,Missing,"Director,Tenpercentowner",ALTIGEN COMMUNICATIONS INC,ATGN,1003607,2007-02-08,2008-06-20,Common Stock,P,...,1.13,,,,0.84,0.33,"Altigen Communications, Inc.",Small Cap (< $300M),-49.64539,abnormal
23113,CULANG HOWARD BERNARD,Missing,Director,RADIAN GROUP INC,RDN,890926,2007-12-31,2003-01-16,Common Stock,P,...,0.701,8.887755,9.441734,2.93,37.86,29.32,Radian Group Inc.,Large Cap ($2B - $10B),16.893645,abnormal
43291,GRANOFF JONATHON G,Missing,Director,SENTRY TECHNOLOGY CORP,SKVY,1030708,2006-12-31,2000-10-27,Common stock,P,...,47.435,,,,0.0005,1e-06,Sentry Technology Corp.,Small Cap (< $300M),-50.0,abnormal
527,ADDIS DENNIS J,"President, Plant Nutrient",Officer,ANDERSONS INC,ANDE,821026,2003-07-23,2004-10-22,COMMON STOCK,P,...,0.736,11.378549,8.797561,2.16,55.52,31.03,"The Andersons, Inc.",Mid Cap ($300M - $2B),41.916168,abnormal
48395,HEMPHILL ROBERT F JR.,Missing,Tenpercentowner,NORTHWEST BIOTHERAPEUTICS INC,NWBO,1072379,2011-10-28,2013-07-31,Common Stock,P,...,,,,,0.5,0.17,"Northwest Biotherapeutics, Inc.",Mid Cap ($300M - $2B),55.389222,abnormal
50090,HILTON STEVEN J,Missing,Director,WESTERN ALLIANCE BANCORPORATION,WAL,1212545,2007-07-28,2009-08-14,Common Stock,P,...,1.297,10.732142,8.690767,1.95,98.1,56.7,Western Alliance Bancorporation,Large Cap ($2B - $10B),-31.658291,abnormal
23303,CUMMING IAN M.,Missing,Director,"Crimson Wine Group, Ltd",CWGL,1562151,2013-02-25,2015-08-19,Common Stock,P,...,,281.5,,,7.0,5.39,"Crimson Wine Group, Ltd.",Small Cap (< $300M),-14.20765,abnormal
82905,OSBORN KEITH D.,Missing,Tenpercentowner,Vystar Corp,VYST,1308027,2012-01-20,2014-07-10,Common Stock,P,...,,,,,1.0,0.0005,Vystar Corporation,Small Cap (< $300M),-60.0,abnormal
106423,STEIN TODD J,Missing,Director,"Spok Holdings, Inc",SPOK,1289945,2019-03-12,2020-03-12,Common Stock,P,...,0.479,22.000002,20.166668,7.38,17.96,13.55,"Spok Holdings, Inc.",Mid Cap ($300M - $2B),-1.920236,abnormal


# Section 6: Save an intermediate .CSV file
Now that we have obtained all of the SPY data, we can save an intermediate .CSV file.

In [None]:
# temp_full1.to_csv('enhanced_common_stock_purchases_with_spy_data.csv',index=False)