# Insider Transaction Processing
This notebook is the second in our pipeline of analyzing insider transactions. In this notebook, our focus will be on extracting important infromation from all of the archived Form 4 ZIP files that we retrieved from the SEC website (See `Notebook 1 - donload_sec_zips.ipynb`). 

### Outline:  
1) Load all SEC Form 4 filing ZIP archives (source:https://www.sec.gov/data-research/sec-markets-data/insider-transactions-data-sets) that have been downloaded locally to the dir: sec_zips.
2) Extract and process three .tsv files ('NONDERIV_TRANS.tsv, 'REPORTINGOWNER.tsv','SUBMISSION.tsv'). We will filter down to only insider transactions (non-entity LPs, Funds, Trusts, etc.) and focus on open-market purchases. For details on the data contained in these ZIP files see 'insider_transactions_readme.pdf' in the github repository.  
3) Clean data by removing any invalid records (missing roles, etc.)  
4) Concatenate records from all ZIP files and export to a stable CSV file for future work

Note: It is relatively easy to adjust this notebook for future analysis if we decide that we want to look at entities or do a deeper dive into why invalid records exist.

## Section 1: Libraries
Let's load all of the libraries that we expect to use in this notebook. All libraries and versions are contained in the conda environment file environment.yml. If setting up a virtual environment for the first time, see `Virtual_Environment_Setup.ipynb` in the github repository.

In [40]:
# Load Libraries
import zipfile
import os
import re
import pandas as pd

## Section 2: Obtaining Data and Concatenating
In this section, we will iterate through all ZIP files and pull out the information that we need, we will put that information in a termorary dataframe. After iterating through all ZIP files we will then concatenate all of the dataframes together. Due to the fact that this is a loop, it will be a pretty large cell block of all the processing. I will explain the reasoning behind everything with inline commenting. I will split out and unnecessary code in the loop first to help with readability.

Let's tart with finding all of the ZIP files that we have stored locally and creating a list oc them that we can use to iterate through. This notebook should be saved in the same directory as `Notebook 1` meaning that our current working directory should be accurate.

In [41]:
# Let's find our local directory where we stored the ZIP files
local_zip_dir = "sec_zips"
# In order to concatenate all files in order, we can sort this list
all_files = sorted([f for f in os.listdir(local_zip_dir) if f.endswith(".zip")])
print(
    f"Found {len(all_files)} ZIP files in '{local_zip_dir}', starting from: {all_files[0]} and ending with: {all_files[-1]}"
)

Found 77 ZIP files in 'sec_zips', starting from: 2006q1_form345.zip and ending with: 2025q1_form345.zip


Great, so we have identified the right directory and we can see how many records we will be iterating through to process. We can also see our earliest and most recent record for processing to make sure that they line up with our current date. Now, let's preprocess our final dataframe a little. Based on the 'insider_transaction_readme.pdf' file, we have identified columns that we think will be important to this study. These columns, like in many datasets, have titles that may not be very straight forward. So let's create two lists. One that is the selected columns that we want to pull from the documents, the other is a mapping dictionary that we can use to rename these columns in final processing.

In [42]:
# Let's create a list to store the columns that we want for use in final_df and filtered_entities
selected_columns = [
    "RPTOWNERNAME",
    "RPTOWNER_TITLE",
    "Insider Role",
    "ISSUERNAME",
    "ISSUERTRADINGSYMBOL",
    "ISSUERCIK",
    "PERIOD_OF_REPORT",
    "TRANS_DATE",
    "SECURITY_TITLE",
    "TRANS_CODE",
    "TRANS_SHARES",
    "TRANS_PRICEPERSHARE",
    "SHRS_OWND_FOLWNG_TRANS",
    "DIRECT_INDIRECT_OWNERSHIP",
    "ACCESSION_NUMBER",
]

renaming_dict = {
    "RPTOWNERNAME": "Insider Name",
    "RPTOWNER_TITLE": "Insider Title",
    "Insider Role": "Insider Role",
    "ISSUERNAME": "Issuer",
    "ISSUERTRADINGSYMBOL": "Ticker",
    "ISSUERCIK": "CIK Code",
    "PERIOD_OF_REPORT": "Period of Report",
    "TRANS_DATE": "Transaction Date",
    "SECURITY_TITLE": "Security",
    "TRANS_CODE": "Transaction Code",
    "TRANS_SHARES": "Shares",
    "TRANS_PRICEPERSHARE": "Price per Share",
    "SHRS_OWND_FOLWNG_TRANS": "Shares After",
    "DIRECT_INDIRECT_OWNERSHIP": "Ownership Type",
}

Alright, now that we have everything setup, we can start iterating through the ZIP files to process them. As stated before, I am going to use a lot of inline commenting for explanation through out this longer code block.  

While exploring these files, I will talk a little about what each file contains. 
REPORTINGOWNER.tsv - Rows: Variable; Columns: 13  
This report contains information related to the Insider. Has two unique keys as the accession number and a central index key of the reporting insider. It also contains the insiders name, role (officer, director, tenpercentowner, other), insider titel (CEO, CFO, VP, etc.), Additional details of position, street address, city, state, zip, description of state and then an SEC file number. We are looking for Open Market purchases which are indicated by the Transaction Code 'P'.

NONDERIV_TRANS.tsv - Rows: Variable; Columns: 28  
This file contains all the non-derivative (options, futures, etc.) transactions uses the accession number as key as well as a surrogate key. This contains information about the transaction like security title, transaction date, exccution date, transaction type, shares, Nature of ownership etc.

SUBMISSION.tsv - Rows: Variable; Columns: 13  
This form identifies the XML originating submissions, filer and issuer information again using the Aceession number as the primary key. This contains infomation about the filing_date, period_of_report, Symbol, etc.

Note: This brings up a very valid concern in the validity of the study. We have been using transaction date and filling dates. However, there is a deemed execution date that may be more appropriate. It would be interesting to do an analysis and see if that results in a significant amount of retained price data that get's filtered out in Notebook 5.

We will start by creating a list to store our dataframes for final concatenation.

In [45]:
# Create a list to store all merged DataFrames for each iteration
merged_all = []

# Loop through each ZIP file
for zip_filename in all_files:
    print(f"Processing file: {zip_filename}")
    # Contruct the full path to the ZIP file
    zip_path = os.path.join(local_zip_dir, zip_filename)
    # Create a folder name by dropping the ".zip" extension
    folder_name = zip_filename.replace(".zip", "")
    # Create a folder to extract the contents of the ZIP file
    extract_path = f"{local_zip_dir}/{folder_name}"

    # Extract the ZIP files but use an elegant Try/Except block to handle errors
    try:
        with zipfile.ZipFile(zip_path, "r") as zip_ref:
            zip_ref.extractall(extract_path)
    except Exception as e:
        print(f"Skipping {zip_filename} due to extraction error: {e}")
        continue

    # Now that we have extracted the ZIP file, let's find the TSV files we want in the extracted folder
    try:
        nonderiv = pd.read_csv(
            os.path.join(extract_path, "NONDERIV_TRANS.tsv"), sep="\t", low_memory=False
        )  # used to suppress dtype warning
        report = pd.read_csv(os.path.join(extract_path, "REPORTINGOWNER.tsv"), sep="\t")
        submission = pd.read_csv(os.path.join(extract_path, "SUBMISSION.tsv"), sep="\t")
    except Exception as e:
        print(f"Skipping {zip_filename} due to load error: {e}")
        continue

    # In the original notebook we had a function get_role() which did not work, so this replaces it with a simpler approach
    report["Insider Role"] = report["RPTOWNER_RELATIONSHIP"].str.strip().str.title()

    # Now let's filter the non-deriv file for open-market buys "P" we can include Sales in the future "S""Insider Trading_ Do Corporate Insiders Know Something We Don't_.docx"
    filtered = nonderiv[
        (nonderiv["SECURITY_TITLE"].str.lower() == "common stock")
        & (nonderiv["TRANS_CODE"] == "P")
    ]

    # Let's also filter out any "penny stocks" in this case we will say any with a share price < $5
    filtered = filtered[filtered["TRANS_PRICEPERSHARE"] >= 5.0].copy()

    # Here we are going to use a merge statement to join the filtered and the report data that we want
    filtered = filtered.merge(
        report[
            [
                "ACCESSION_NUMBER",
                "RPTOWNERNAME",
                "RPTOWNER_TITLE",
                "RPTOWNER_RELATIONSHIP",
                "Insider Role",
            ]
        ],
        on="ACCESSION_NUMBER",
        how="left",
    )

    # Now, let's create a copy to work on incase we mess anything up it will be easy to redo
    before_entity_filter = filtered.copy()

    # Let's convert the 'RPTOWNERNAME' to all uppercase for ease
    filtered["RPTOWNERNAME"] = filtered["RPTOWNERNAME"].str.upper()

    # Let's also create a list of entity_keywords that we want to search for
    entity_keywords = [
        "LLC",
        "L L C",
        "L.L.C.",
        "LP",
        "L P",
        "L.P.",
        "LTD",
        "INC",
        "TRUST",
        "CORP",
        "FOUNDATION",
        "COMPANY",
        "CO",
        "CO.",
        "PARTNERS",
        "ADVISORS",
        "ADVISORY",
        "CAPITAL",
        "INVESTMENT",
        "INVESTMENTS",
        "HOLDINGS",
        "MGMT",
        "MANAGEMENT",
        "FUND",
        "GROUP",
        "VENTURES",
        "BIOVENTURES",
        "INVESTORS",
        "EQUITY",
        "LIFE INSURANCE",
        "GP",
        "FAMILY",
        "PBC",
        "SDN BHD",
        "GMBH",
    ]

    # Now, let's create a regex pattern that detects keywordse with leading punctuation or spacing (to avoid names)
    # For a full description of what this pattern does see `explanation of regex in Notebook2.docx`
    pattern = "(?i)" + "|".join(
        r"(?<!\w)" + re.escape(k) + r"(?=\W|$)" for k in entity_keywords
    )

    # Save the rows that will be the filtered out entities (for later review)
    # filtered_out_df = before_entity_filter[before_entity_filter["RPTOWNERNAME"].str.contains(pattern, case=False, na=False, regex=True)].copy()

    # Merge the entity-filtered-out rows with submission info to align with final_df format
    # filtered_out_df = filtered_out_df.merge(
    #    submission[["ACCESSION_NUMBER", "ISSUERNAME", "ISSUERTRADINGSYMBOL", "PERIOD_OF_REPORT", "ISSUERCIK"]],
    #    on="ACCESSION_NUMBER", how="left"
    # )

    # Remove rows where the insider name matches any known entity keyword (e.g., LLC, INC, TRUST)
    # Uses word boundaries to avoid false positives
    filtered = filtered[
        ~filtered["RPTOWNERNAME"].str.contains(
            pattern, case=False, na=False, regex=True
        )
    ]

    # Keep only valid insiders: director, officer, or has a job title, I may consider removing this line in the future
    # .loc[;, ] used to address warning (means assign this transformation to every row in the column)
    filtered.loc[:, "RPTOWNER_RELATIONSHIP"] = filtered[
        "RPTOWNER_RELATIONSHIP"
    ].str.upper()
    filtered = filtered[
        filtered["RPTOWNER_RELATIONSHIP"].str.contains(
            "DIRECTOR|OFFICER|TENPERCENTOWNER", na=False
        )
        | filtered["RPTOWNER_TITLE"].notna()
    ]

    # Merge with submission to get equity issuer info
    filtered = filtered.merge(
        submission[
            [
                "ACCESSION_NUMBER",
                "ISSUERNAME",
                "ISSUERTRADINGSYMBOL",
                "PERIOD_OF_REPORT",
                "ISSUERCIK",  # Added "ISSUECIK" to map this field with SIC code
            ]
        ],
        on="ACCESSION_NUMBER",
        how="left",
    )

    # Filter out equity issuers that are investment funds
    filtered = filtered[
        ~filtered["ISSUERNAME"].str.contains("FUND", case=False, na=False)
        & ~filtered["ISSUERNAME"].str.contains("trust", case=False, na=False)
    ]

    # Now we can rename output columns using our dictionary from earlier
    final = filtered[selected_columns].rename(columns=renaming_dict)

    # Append cleaned dataframe to master list
    merged_all.append(final)

Processing file: 2006q1_form345.zip


Alright, so we have been able to iterate through all of our ZIP files, combine the information necessary in three different .tsv files. Filter out some unnecessary insider transactions that are unimportant for the hypothesis we have developed for this project. We now have a list of 77 dataframes that we need to concatenate.

In [None]:
# Combine all cleaned rows into one DataFrame
if merged_all:
    final_df = pd.concat(merged_all, ignore_index=True)

    # Save merged data
    final_df.to_csv("notebook1_insider_data.csv", index=False)
    print("Saved merged data to notebook1_insider_data.csv")

    # Preview output
    print("Preview of merged data:")
    pd.set_option('display.max_columns', None)
    display(final_df.head(10))
    
else:
    print("No valid purchase data found in uploaded zip files.")

If we want to have a list of all of the filtered out entities, we can uncomment this next block. For now, it will not be used in any analysis so we will keep it commented.

In [None]:
# Create CSV of filtered-out entities using same column names for records and future use
#filtered_entities = filtered_out_df[selected_columns].rename(columns=renaming_dict)

# Save to local drive
#filtered_entities.to_csv("notebook1_filtered_out_entities.csv", index=False)

Now, we can proceed to `Notebook 3 = yahoo_finance_price_data.ipynb` where we will use this large .CSV file in order to query the Yahoo! Finance API, 'yinance' to get historical price data for the 7 months in and around the insider transaction date.