## Research Project 4
---
```text
- Source: SEC
- Goal: Extract CEO appointments from 8K
- Techniques: NER, sentence tokenization
- Tools: Spacy, NLTK
- Lines of code: ~50```

### Request and parse SEC index
---

In [6]:
# Let's find a page on the SEC website with all 8-Ks
url = 'https://www.sec.gov/Archives/edgar/daily-index/2018/QTR2/form.20180402.idx'

In [7]:
import requests
res = requests.get(url)

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1;
            background-color: #FCF3CF;
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4">
    <a href="../deep_dives/http.ipynb" style="text-decoration: none"> 
    <h3 style="font-family: monospace">Deep-dive</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Hypertext Transfer Protocol (HTTP)</p></a></font>
</div>

In [8]:
print(res.text[:1000])

Description:           Daily Index of EDGAR Dissemination Feed by Form Type
Last Data Received:    Apr  2, 2018
Comments:              webmaster@sec.gov
Anonymous FTP:         ftp://ftp.sec.gov/edgar/
 
 
 
 
Form Type   Company Name                                                  CIK
      Date Filed  File Name
---------------------------------------------------------------------------------------------------------------------------------------------
1-SA        HYGEN INDUSTRIES, INC.                                        1661116     20180402    edgar/data/1661116/0001065949-18-000055.txt         
1-U         FUNDRISE REAL ESTATE INVESTMENT TRUST, LLC                    1645583     20180402    edgar/data/1645583/0001144204-18-018374.txt         
1-U         Fundrise East Coast Opportunistic REIT, LLC                   1660918     20180402    edgar/data/1660918/0001144204-18-018370.txt         
1-U         Fundrise Equity REIT, LLC                                     1648956     2018

In [9]:
line = '1-SA        HYGEN INDUSTRIES, INC.                                        1661116     20180402    edgar/data/1661116/0001065949-18-000055.txt'

In [15]:
print(line.find('HYGEN')); 
print(line.find('1661116')); 
print(line.find('20180402')); 
print(line.find('edgar/data'))

12
74
86
98


In [18]:
record = {
    'Form Type': line[0:12].strip(),
    'Company Name': line[12:74].strip(),
    'CIK': line[74:86].strip(),
    'Date Filed': line[86:98].strip(),
    'File Name': line[98:].strip()
}
record

{'CIK': '1661116',
 'Company Name': 'HYGEN INDUSTRIES, INC.',
 'Date Filed': '20180402',
 'File Name': 'edgar/data/1661116/0001065949-18-000055.txt',
 'Form Type': '1-SA'}

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="./playground.ipynb" style="text-decoration: none"> 
    <h3 style="font-family: monospace">Exercise 4.1</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Create a function that parses "Date Filed" into a python datetime object.
    </p></a></font>
</div>

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="./playground.ipynb" style="text-decoration: none"> 
    <h3 style="font-family: monospace">Exercise 4.2</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Create a function that, given an SEC index, parses it and returns a list of records.
    </p></a></font>
</div>

### Parse 8-K
---

In [22]:
for line in res.text.split('\n')[11:]:
    if line.startswith('8-K'):
        url = 'https://www.sec.gov/Archives/' + line[98:].strip()
        doc = requests.get(url)
        break

In [25]:
url

'https://www.sec.gov/Archives/edgar/data/318306/0001144204-18-018693.txt'

In [24]:
doc.text[:1000]

'<SEC-DOCUMENT>0001144204-18-018693.txt : 20180402\n<SEC-HEADER>0001144204-18-018693.hdr.sgml : 20180402\n<ACCEPTANCE-DATETIME>20180402170558\nACCESSION NUMBER:\t\t0001144204-18-018693\nCONFORMED SUBMISSION TYPE:\t8-K\nPUBLIC DOCUMENT COUNT:\t\t3\nCONFORMED PERIOD OF REPORT:\t20180329\nITEM INFORMATION:\t\tDeparture of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers: Compensatory Arrangements of Certain Officers\nITEM INFORMATION:\t\tRegulation FD Disclosure\nITEM INFORMATION:\t\tFinancial Statements and Exhibits\nFILED AS OF DATE:\t\t20180402\nDATE AS OF CHANGE:\t\t20180402\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tABEONA THERAPEUTICS INC.\n\t\tCENTRAL INDEX KEY:\t\t\t0000318306\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tPHARMACEUTICAL PREPARATIONS [2834]\n\t\tIRS NUMBER:\t\t\t\t830221517\n\t\tSTATE OF INCORPORATION:\t\t\tDE\n\t\tFISCAL YEAR END:\t\t\t1231\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t8-K\n\t\tSEC ACT:\t\t1934 Act\n

In [28]:
from lxml import html
tree = html.fromstring(doc.content)

In [30]:
' '.join(tree.itertext())[:1000]

'0001144204-18-018693.txt : 20180402\n 0001144204-18-018693.hdr.sgml : 20180402\n 20180402170558\nACCESSION NUMBER:\t\t0001144204-18-018693\nCONFORMED SUBMISSION TYPE:\t8-K\nPUBLIC DOCUMENT COUNT:\t\t3\nCONFORMED PERIOD OF REPORT:\t20180329\nITEM INFORMATION:\t\tDeparture of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers: Compensatory Arrangements of Certain Officers\nITEM INFORMATION:\t\tRegulation FD Disclosure\nITEM INFORMATION:\t\tFinancial Statements and Exhibits\nFILED AS OF DATE:\t\t20180402\nDATE AS OF CHANGE:\t\t20180402\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tABEONA THERAPEUTICS INC.\n\t\tCENTRAL INDEX KEY:\t\t\t0000318306\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tPHARMACEUTICAL PREPARATIONS [2834]\n\t\tIRS NUMBER:\t\t\t\t830221517\n\t\tSTATE OF INCORPORATION:\t\t\tDE\n\t\tFISCAL YEAR END:\t\t\t1231\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t8-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-15771\n\t\tFILM NUM

In [37]:
import re
text = ' '.join(tree.itertext())
text = re.sub(r'\n|\t|\xa0', ' ', text)
text = re.sub(' +', ' ', text)

In [39]:
from nltk.tokenize import sent_tokenize
char = text.find('Item 5.02')
if char >= 0:
    sentences = sent_tokenize(text[char:])
    joined = ' '.join(sentences[0:5])
    print(joined)

Item 5.02. Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers. Effective March 29, 2018, Frank Carsten Thiel, Ph.D., age 55, has been named by the Board of Directors of Abeona Therapeutics Inc. (the “Company”) as the Company's Chief Executive Officer. Dr. Thiel brings 25 years of proven global biopharmaceutical industry experience, including rare and orphan diseases, to Abeona. His most recent position at Alexion, he served as its Senior Vice President, Europe/Middle East/Africa and Asia Pacific where he was responsible for driving Alexion’s global commercial operations in these regions, including maximizing the current rare disease portfolio as well as guiding the launch of anticipated new products and indications.


In [None]:
# This requires building a training set for text classification, 
# likely with a deep neural network. We'll stop here.

In [1]:
# Standard library
import datetime

# Third-party
import spacy
import gensim
import requests
from lxml import html
from nltk.tokenize import sent_tokenize

url = 'https://www.sec.gov/Archives/edgar/daily-index/2018/QTR2/form.20180402.idx'
res = requests.get(url)
form_index = res.text.split('\n')
records = []
for line in form_index[11:]:
    record = {}
    record['filing'] = line[0:12].strip()
    record['name'] = line[12:74].strip()
    try:
        record['cik'] = int(line[74:86].strip())
    except ValueError:
        continue
    try:
        record['date'] = datetime.datetime(int(line[86:90]),
                                           int(line[90:92]),
                                           int(line[92:94]))
    except ValueError:
        continue
    record['path'] = 'https://www.sec.gov/Archives/' + line[98:].strip()
    records.append(record)

eight_ks = [i for i in records if i['filing'].startswith('8-K')]
for num, doc in enumerate(eight_ks):
    res = requests.get(doc['path'])
    tree = html.fromstring(res.content)
    clean = ' '.join(tree.itertext()).replace('\n', ' ').replace('\t', ' ').replace('\xa0', ' ')
    clean = re.sub(' +', ' ', clean)
    char = clean.find('Item 5.02')
    if char >= 0:
        sentences = sent_tokenize(clean[char:])
        joined = ' '.join(sentences[0:5])
        if 'CEO' in joined or 'Chief Executive Officer' in joined: 
            print('%d) %s: "%s"\n' % (num, doc['name'], joined))

0) ABEONA THERAPEUTICS INC.: "Item 5.02. Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers. Effective March 29, 2018, Frank Carsten Thiel, Ph.D., age 55, has been named by the Board of Directors of Abeona Therapeutics Inc. (the “Company”) as the Company's Chief Executive Officer. Dr. Thiel brings 25 years of proven global biopharmaceutical industry experience, including rare and orphan diseases, to Abeona. His most recent position at Alexion, he served as its Senior Vice President, Europe/Middle East/Africa and Asia Pacific where he was responsible for driving Alexion’s global commercial operations in these regions, including maximizing the current rare disease portfolio as well as guiding the launch of anticipated new products and indications."

7) AMARILLO BIOSCIENCES INC: "Item 5.02. Compensatory Arrangements of Certain Officers. On March 28, 2018, the Company entered into employment cont