Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is EDGAR Bulk API of any use? #257

Open
abitrolly opened this issue Jan 1, 2022 · 12 comments
Open

Is EDGAR Bulk API of any use? #257

abitrolly opened this issue Jan 1, 2022 · 12 comments

Comments

@abitrolly
Copy link

Is Bulk Data download from https://www.sec.gov/edgar/sec-api-documentation useful? It looks more accessible than dealing with imposed API restrictions.

I am specifically interested in executive compensation from DEF-14A but I have no idea what it takes to extract it for all companies.

@jackmoody11
Copy link
Member

Yes, this would be useful to add. I wrote some comments on this in #227. Would you be willing to help make this part of the package? I am happy to help walk you through the process.

@abitrolly
Copy link
Author

abitrolly commented Jan 3, 2022

Well, some structured approach to getting this data would be nice. I haven't found a list of fields that sec-edgar can extract. Only that it can download the fillings.

Even if sec-edgar is able to extract the field, the Bulk API data may not contain what I need, so I need to check that first.

@abitrolly
Copy link
Author

Download required setting user-agent.

curl -O https://www.sec.gov/Archives/edgar/daily-index/xbrl/companyfacts.zip --user-agent "No company <private@email.com>"
curl -O https://www.sec.gov/Archives/edgar/daily-index/bulkdata/submissions.zip --user-agent "No company <private@email.com>" 

@abitrolly
Copy link
Author

companyfacts.zip expanded almost 1Gb archive into 13Gb of .json files named like CIK0001859035.json

$ 7z l companyfacts.zip
...
2021-12-24 01:19:46 .....        17497         3954  CIK0001859035.json
2021-12-23 20:12:08 .....        18623         4074  CIK0001869824.json
2021-12-23 20:07:40 .....        98785        13056  CIK0001881592.json
2021-12-27 16:56:36 .....           47           45  CIK0000924186.json
2021-12-27 16:56:39 .....           47           46  CIK0001436581.json
2021-12-28 00:53:33 .....        17183         3668  CIK0001873441.json
2021-12-28 17:17:05 .....           47           46  CIK0001869467.json
2021-12-29 23:52:42 .....          797          463  CIK0001827401.json
2021-12-30 00:01:59 .....        17298         4265  CIK0001853314.json
2021-12-31 00:16:56 .....        61458         9412  CIK0001781397.json
2021-12-30 18:03:40 .....        23878         5088  CIK0001867956.json
2021-12-30 18:55:57 .....        12221         3045  CIK0001879373.json
------------------- ----- ------------ ------------  ------------------------
2021-12-31 00:56:34        13515969619   1001420451  15483 files
$ 7z x companyfacts.zip -ocompanyfacts
...
Files: 15483
Size:       13515969619
Compressed: 1003959685  

@abitrolly
Copy link
Author

companyfacts.zip yields no "executive compensation" strings.

@abitrolly
Copy link
Author

abitrolly commented Jan 4, 2022

submissions.zip contains 500000+ files in a single dir (!) and my system have problems even unpacking that. :D

84% 484097 - CIK0001561746.json

18 hours and still processing.

@abitrolly
Copy link
Author

Using unzip instead of 7z in the end was 18 hours faster. ) So I unpacked files, but I still have no idea about the structure, because there is no scheme or ERD diagrams. Need a way to look at that somehow without being overwhelmed.

@jackmoody11
Copy link
Member

Using unzip instead of 7z in the end was 18 hours faster. ) So I unpacked files, but I still have no idea about the structure, because there is no scheme or ERD diagrams. Need a way to look at that somehow without being overwhelmed.

If you are on Linux or UNIX you can use tree

@abitrolly
Copy link
Author

The file layout is pretty simple - it is 781211 JSON files in a single dir. The diagrams I need are about JSON structure. I have no idea how to find which fields should contain executive compensation from DEF 14A. Maybe there are no such fields at all.

@ethankershner
Copy link

I am currently trying to use the bulk downloads to retrieve SEC Form 4s. It appears that each JSON has accession numbers and form types for each filing by CIK, but I haven't figured out how to actually get the URL to that filing. Does anyone know how to get the URL to filings? I haven't been able to figure out the URL structure yet.

@abitrolly
Copy link
Author

No idea. I've got a strong feeling that SEC is doing its job so poorly on purpose. If the democracy works, they should just hire the maintainers and contributors to this repo to make things right for people.

@jackmoody11
Copy link
Member

I am currently trying to use the bulk downloads to retrieve SEC Form 4s. It appears that each JSON has accession numbers and form types for each filing by CIK, but I haven't figured out how to actually get the URL to that filing. Does anyone know how to get the URL to filings? I haven't been able to figure out the URL structure yet.

Here is an example of URL for Nike form 4: https://www.sec.gov/Archives/edgar/data/320187/000112760223015552/0001127602-23-015552.txt

You have the CIK (stripped of leading zeros), then the accession number (stripped of hyphens), then the accession number with .txt at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants