# SEC Data Aquisition

### Intial Notes
This notebook originiated from the work of Kirtland Corregan in our Milestone I porject of the Master's of Applied Data Science. I have forked the original data that can be seen here: https://github.com/RamiHaider/Do-Insiders-Know-Something-We-Dont. The reason for doing this is two fold. Firstly, I want to run through the code line by line in order to fully undestand and implement the data aquisition phase of the project. Secondly, I want to expand all files that I didn't do personally so they are in a Jupyter Notebook format for easy explanation and reproduction.

In some cases I may change the code in order to follow more strict coding guidelines or for more efficient processes. I plan on continuing the project in my own time for a more deep analysis and hope to use it for further generation of alpha.

## Archived Form 4 Data
This notebook is used to download the archived quarterly SEC Form 4 ZIP files from the SEC website: https://www.sec.gov/data-research/sec-markets-data/insider-transactions-data-sets. We have designed the paths to store locally for further processing in analysis. The current working directory will be the one where this notebook is saved.

### Section 1: Import Libraries  
The first thing that we will do is import all of the libraries that we will need for this notebook. There shouldn't be many as we are just downloading files from a website.

In [None]:
# import libraries
import os
import requests
import pandas as pd

### Section 2: Explicity Set Working Directory  
To start, we will explicity set the working directory to the one in which we are saving this notebook. This will allow us to create the necessary directory for storage of the zip files and then we will be able to use relative paths in all future notebooks to access any of the intermediate CSV files or original documents.

In [None]:
# Let's confirm we are in the right directory
notebook_dif = os.getcwd()
print(f"Current working directory: {os.getcwd()}")

### Section 3: Download Archived Form 4 ZIP files from SEC Website
Let's create a function that will download the archived ZIP files from the SEC website. We will be downloading all files that we can get, so every quarter we can rerun this workflow and expand our novel dataset in the future for deeper analysis.

This function will:  
1) Create or open the directory for storing the ZIP files  
2) Dynamically find the current quarter  
3) Iterate through 2006 to Now and download the ZIP files  
4) Elegantly catch errors and show any failed downloads

In [None]:
# Create a function to download SEC Form 4 ZIP files
def download_sec_zips(save_dir: str = "sec_zips") -> None:
    ''' 
    This is a function that will download SEC Form 4 ZIP files from the SEC website.
    
    Parameters:
    save_dir (str): The directory where the downloaded ZIP files will be saved.
                     Default is "sec_zips".
    output:
    This function will download all available SEC Form 4 ZIP files from 2006 to the current year and quarter.
    '''

    # Create directory if it doesn't exist otherwise use existing directory.
    os.makedirs(save_dir, exist_ok=True)
    # The base URL for the SEC Form 4 data sets.
    base_url = (
        "https://www.sec.gov/files/structureddata/data/insider-transactions-data-sets"
    )

    # SEC blocks anonymous requests; requires a valid email.
    headers = {"User-Agent": "tmacphe@umich.edu"}

    # Create a list to keep track of any failed downloads
    failed = []

    # Let's get the current year and quarter so that we can get the most recent data
    now = pd.Timestamp.now()
    current_year = now.year
    quarter = now.quarter

    # Now that we have this, we can dynamically set the range of years and quarters to download.
    for year in range(2006, current_year + 1):
        for q in range(1, 5):
            if year == current_year and q >= quarter:
                break  # Only get files through the latest completed quarter.

            # Construct the filename and URL for each quarter so we can download the zip files.
            filename = f"{year}q{q}_form345.zip"
            url = f"{base_url}/{filename}"
            local_path = os.path.join(save_dir, filename)

            try:
                # Send a GET request to the URL to download the file.
                r = requests.get(url, headers=headers, timeout=30)
                # Check to see if the request was successful.(200 means successful download)
                if r.status_code == 200:
                    # Open file and then save to the local directory.
                    with open(local_path, "wb") as f:
                        f.write(r.content)
                    print(f"Downloaded: {filename}")
                else:
                    print(f"Failed: {filename} (status {r.status_code})")
                    failed.append(filename)
            # Handle any exceptions that occur during the download process.
            except Exception as e:
                print(f"Error downloading {filename}: {e}")
                failed.append(filename)

    if failed:
        print("\nSome files failed to download:")
        for f in failed:
            print(" -", f)
    else:
        print("\nAll zip files downloaded successfully.")


Now that we have completely set up the function we can run it independently and make sure that we get all of the files. We have use print statements so that we can track the status of the function as it iterates through all of the data.

In [None]:
# Run the function to download the SEC Form 4 ZIP files
download_sec_zips()

That concludes the notebook used to gather all archived Form 4 Insider trading data from the SEC website. Please feel free to continue on to 'Notebbok 2 - insider_zip_data_processing.ipynb' in order to see how we merge important files and then extract pertinent information for our project.