This application downloads corporate ownership data from the SEC's Edgar web app and processes it into summaries by company. It also associates the ownership data with political donations and lobbying data from http://opensecrets.org.
- Downloads list of all forms submitted to the SEC starting at 1990 from the SEC's EDGAR:
https://www.sec.gov/edgar/searchedgar/legacy/companysearch.html
- Download all corporate ownership-related forms (SEC form SC-13D, SC-13G, 3, 4 and 5) for a given year range
- Process forms into 'major owners' (SC-13D, SC-13G) and 'direct owners' (3, 4, 5)
- Process major and direct owner tables into a unified corporate ownership summary in MySQL
Imports 527, lobbying and campaign data into MySQL:
https://www.opensecrets.org/MyOS/bulk.php
Imports company subsidiaries data into MySQL:
http://api.corpwatch.org/documentation/db_dump/
These 3 databases are then linked: corporate ownership plus campaign finance data along with lists of each company's subsidiaries.
DataGrinder must be run on either a Mac or Linux box, the author uses a custom Hackintosh tower: 3.79 GHz Intel Quad-Core i5 with 24GB of RAM.
You'll definitely need an app for browsing the mysql / mongodb data. Author uses:
http://www.sequelpro.com/download
http://app.robomongo.org/download.html
- Install mysql. Any version should work, author used 5.5: https://dev.mysql.com/downloads/mysql/
- Install mongodb, author used 3.0.8: https://www.mongodb.org/downloads#production
Make sure you have plenty of hard drive space available wherever you install mysql/mongodb, I'd budget at least 200gb if you're processing all the available data. The more RAM you have the faster it will go, 16GB is probably the bare minimum.
- Create DBs, start Rails
bundle install
rake db:create
rake db:migrate
Once setup, there are now a number of rake tasks to run.
Open ./Rakefile in your favorite text editor.
Find the 'get_indices' task. Enter the years you'd like to import as the first two parameters to Sec::get_indices. The earliest year supported by datagrinder is 1994.
The SEC keeps an 'index' of all forms that have been submitted by year, these must be downloaded first so we know what corporate ownership forms to download in the next step.
rake sec:get_indices
In Rakefile, find import_forms and change the years/quarters to match the ones downloaded in step 1.
rake sec:import_forms
This one takes forever. For each record we created in SEC.2, download the form data itself into our mongodb record in the 'txt' field. This takes maybe 1-12 hours per quarter, depending on your internet speed. A wired internet connection is highly recommended.
To speed things along, I run multiple download processes at once in separate tabs:
rake sec:download_forms year=1994 qtr=1
rake sec:download_forms year=1994 qtr=2
rake sec:download_forms year=1994 qtr=3
rake sec:download_forms year=1994 qtr=4
With the forms downloaded, we now parse them into 'major owners' of 5% or more (the SC forms) and 'direct owners' (forms 3, 4, 5).
Once again, make sure the years and quarters are configured correctly in Rakefile.
This is the meat of the original programming for this project, dealing with the absolutely horrific state of the SC forms' formatting and managing to still extract the ownership data. For an idea of how different they can be, check these out:
https://www.sec.gov/Archives/edgar/data/61986/0000734269-94-000014.txt
https://www.sec.gov/Archives/edgar/data/902320/0000950144-94-000447.txt
Now that the direct_owners and major_owners collections are populated, we can turn them into summaries, one for each unique owner.
This task creates the empty summary records in the mysql table 'summaries'.
Populates the summary fields 'owned_by_insider', 'owned_by_5percent', 'owner_of_insider', 'owner_of_5percent' based on the major-owners and direct_owners collections.
Populates summary.subsidiaries, based on CorpWatch's subsidiary data.
When this is finished, the corporate ownership summaries are complete!
Now we move on to campaign finance
This must be run three times, specifying a different set of tables to import each time:
- 527
- lobby
- campaign
OpenSecrets summarization
OpenSecrets summarization
Follow the instructions herein to export the summary data and import it on another server.
- Read in Edgar's 'index' files to create the 'forms' MongoDB collection
- For each form, download the text data into the forms.txt field
- Process each form, extracting owners
- Create ownership 'summary' entries in MySQL for each company and owner
- Populate summaries using owner tables
Q: Why is it both mongoid and mysql? A: There's two types of data: 30 gigs of unstructured weirdness (mongo) and the nicely structured summary data (mongodb, about 300mb).
Q: Why is it a Rails app? A: Mostly for the Mongoid niceties, and I'm just generally used to them.
This app processes 6 forms:
Forms 3, 4 or 5. These are required to be filed by 'insiders', or employees at a given company when they acquire or dispose of shares. These end up in the direct_owners collection in MongoDB.
Forms SC 13D, SC 13G and SC 13G/A. These are required to be filed by anyone who owns 5%+ or more of a given company when they acquire or dispose of shares. These end up in the major_owners collection.
We ignore 'direct owners' for data before 2004 Quarter 4. The reason is, before this date the data is freeform and extremely difficult to parse out, and nice clean XML afterwards. Realistically employees own very small amount of the company's stock, and employee information from 12+ years ago is likely to be out of date. If they're still employed at the company, it's very likely that their stock ownership has changed since then and thus will appear in our DB.