# Introduction
Place holder for our introduction to insider trading and the benefit of using successful insider trades.

## Libraries and Dependencies
In order to increase the repeatability, reproducibility, and replicability of our project, we will load in all of our libraries and freeze the dependencies to a file so that anyone replicating our research will know the versions used.

In [7]:
#Colab Libraries
from google.colab import drive, files
#Data Import Libraries
import os, zipfile, time, requests
from bs4 import BeautifulSoup
#Data Manipulation Libraries
import numpy as np
import pandas as pd
#Visualization Libraries
import matplotlib.pyplot as plt

In [None]:
#print the dependencies in the notebook
!pip freeze

#create a .txt file that contains all versions
#!pip freeze > colab_requirements.txt

## Loading Primary Dataset
We will load multiple SEC Form 4 filing zip archives (Source: https://www.sec.gov/data-research/sec-markets-data/insider-transactions-data-sets). Each ZIP archive contains 10 files, we will extract and process three .tsv files inside of each archive and filter insider transactions by open-market purchases transacted by individual insiders (excluding investment entitities such as funds, limited parnerships, and trusts). We will identify transactions involving corporate officers and clen the data by removing all invalid records (those with missing roles). The processed results are compiled into a dataframe and saved to a .csv for backup and potential upload to a database or machine-learning pipeline (e.g. BigQuery).

We will start by mounting our google drive and importing files.

In [4]:
#Mount google drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
'''For the final project, we will have only a singular path to all infomation'''
#Students Google Drive Path
toms_path = '/content/drive/MyDrive/Colab Notebooks/593 - Milestone I/593 - Insider Trading Milestone I Project'
kirts_path = None
ramis_path = None

#Navigate to the right working directory and confirm our current working drive
os.chdir(toms_path)
#os.chdir(kirts_path)
#os.chdir(ramis_path)
print(os.getcwd())

/content/drive/MyDrive/Colab Notebooks/593 - Milestone I/593 - Insider Trading Milestone I Project


The first thing that we need to do is go to the SEC website and download all of the ZIP archives of the data and save them to our google drive. You must identify yourself with a proper User-Agent header in order to connect to the SEC website

In [None]:
'''Only run this cell once to download the data'''

#URLs where we can download the files
url = 'https://www.sec.gov/data-research/sec-markets-data/insider-transactions-data-sets'
#Create a session with a real User-Agent
session = requests.Session()
session.headers.update({'User-Agent':'tmacphe (tmacpe@umich.edu)',
                       'Accept-Encoding': 'gzip,deflate',
                       'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})
#Let's grab the page
page = session.get(url)
page.raise_for_status()

#Now, we need to find all of the .zip links
soup = BeautifulSoup(page.content, 'html.parser')
zip_links = []
for a in soup.find_all('a',href=True):
    href = a['href']
    if href.lower().endswith('.zip'):
        url = href if href.startswith('http') else 'https://www.sec.gov' + href
        zip_links.append(url)

#Create a folder to store all of our zipped archives
os.makedirs('sec_insider_zips',exist_ok=True)

#Download each file into the directory
for url in zip_links:
    #We can pull out the file name using the os.path
    filename = os.path.basename(url)
    out_path = os.path.join('sec_insider_zips',filename)
    if os.path.exists(out_path):
        print(f"Skipping {filename} because it has already been downloaded")
        continue
    else:
        print(f"Downloading {filename}...")
        #Now let's get the new webpages with the zip files
        zip_file = session.get(url)
        zip_file.raise_for_status()
        #create a file and write the contents to it (write binary 'wb')
        with open(out_path, 'wb') as f:
            f.write(zip_file.content)
        print(f"{filename} downloaded")
        time.sleep(0.5)


Next I will add all of Kirt's data manipulation for merging the files.