SEC TEXT: NLP

A codebase to allow for search and additional Natural Language Processing work by analysts and developers for SEC 10K and 10Q filings.

The product is built in four files.

The first 2 are written in R and use the edgarWebR library from https://mwaldstein.github.io/edgarWebR/, a well-maintained and popular library for pulling SEC documents and slicing the SEC's unique XBRL approach into readable sections.

sec_nlp_getter.R reads a list of tickers from a local csv file, column named Symbol, and 
a) retrieves all filings from the SEC for that symbol,
b) saves the base HTML document in a file tokenized (split) by sentences
c) parses the base document into MDNA and Risk Factor sections
d) creates a local file filing_index.csv which stores the location of each document for each ticker.

sec_R_utils.R is the utility file for sec_nlp_getter.

The second 2 files are written in Python and use NLTK and pattern libraries to apply sentiment analysis to the extracted documents.

(sample_workflow.py is a sample file to show the various combinations available to the SECTextNLP class)

sec_text_nlp.py contains the SECTextNLP class.

sec_nlp_utils.py is the utility file for sec_text_nlp.py.

The following workflow is an example of the use of the SECTextNLP class.


In [7]:

from sec_text_nlp import *
stn = SECTextNLP("AAPL")

stn.df_file_index[['ticker','period_date','form_name','type']].head()


Unnamed: 0,ticker,period_date,form_name,type
139,AAPL,2020-06-27T04:00:00Z,Quarterly report [Sections 13 or 15(d)],10-Q
140,AAPL,2020-03-28T04:00:00Z,Quarterly report [Sections 13 or 15(d)],10-Q
141,AAPL,2019-12-28T05:00:00Z,Quarterly report [Sections 13 or 15(d)],10-Q
142,AAPL,2019-09-28T04:00:00Z,"Annual report [Section 13 and 15(d), not S-K I...",10-K
143,AAPL,2019-06-29T04:00:00Z,Quarterly report [Sections 13 or 15(d)],10-Q


In [8]:

stn.df_text.head()


Unnamed: 0,part.name,item.name,sentence_text,file,href
0,,,united states securities and exchange commissi...,aapl-20200627_sentences.csv,https://www.sec.gov/Archives/edgar/data/320193...
1,,,20549 form 10-q (mark one) ☒ quarterly repor...,aapl-20200627_sentences.csv,https://www.sec.gov/Archives/edgar/data/320193...
2,,,commission file number: 001-36743 apple inc.,aapl-20200627_sentences.csv,https://www.sec.gov/Archives/edgar/data/320193...
3,,,(exact name of registrant as specified in its ...,aapl-20200627_sentences.csv,https://www.sec.gov/Archives/edgar/data/320193...
4,,,employer identification no.),aapl-20200627_sentences.csv,https://www.sec.gov/Archives/edgar/data/320193...


In [9]:

pd.merge(stn.df_text,stn.df_file_index,how = 'inner',left_on='href',right_on='href')[['ticker','filing_date','sentence_text']].head()


Unnamed: 0,ticker,filing_date,sentence_text
0,AAPL,2020-07-31T04:00:00Z,united states securities and exchange commissi...
1,AAPL,2020-07-31T04:00:00Z,20549 form 10-q (mark one) ☒ quarterly repor...
2,AAPL,2020-07-31T04:00:00Z,commission file number: 001-36743 apple inc.
3,AAPL,2020-07-31T04:00:00Z,(exact name of registrant as specified in its ...
4,AAPL,2020-07-31T04:00:00Z,employer identification no.)


In [10]:

pd.merge(stn.df_mdna,stn.df_file_index,how = 'inner',left_on='href',right_on='href')[['ticker','filing_date','sentence_text']].head()


Unnamed: 0,ticker,filing_date,sentence_text
0,AAPL,2020-07-31T04:00:00Z,item 2.
1,AAPL,2020-07-31T04:00:00Z,management's discussion and analysis of financ...
2,AAPL,2020-07-31T04:00:00Z,forward-looking statements provide current exp...
3,AAPL,2020-07-31T04:00:00Z,"for example, statements in this form 10-q rega..."
4,AAPL,2020-07-31T04:00:00Z,forward-looking statements can also be identif...


In [12]:

stn.get_noun_phrases_around_topic(BUSINESS_SEGMENT_LIST)


['americas segment',
 'asia pacific',
 'asia pacific segment',
 'china segment',
 'distribution partners',
 'europe segment',
 'geographic segment',
 'hong kong',
 'retail stores',
 'software products']