# Project 5: NLP on Financial Statements
## Instructions
Each problem consists of a function to implement and instructions on how to implement the function.  The parts of the function that need to be implemented are marked with a `# TODO` comment. After implementing the function, run the cell to test it against the unit tests we've provided. For each problem, we provide one or more unit tests from our `project_tests` package. These unit tests won't tell you if your answer is correct, but will warn you of any major errors. Your code will be checked for the correct solution when you submit it to Udacity.

## Packages
When you implement the functions, you'll only need to you use the packages you've used in the classroom, like [Pandas](https://pandas.pydata.org/) and [Numpy](http://www.numpy.org/). These packages will be imported for you. We recommend you don't add any import statements, otherwise the grader might not be able to run your code.

The other packages that we're importing are `project_helper` and `project_tests`. These are custom packages built to help you solve the problems.  The `project_helper` module contains utility functions and graph functions. The `project_tests` contains the unit tests for all the problems.

### Install Packages

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting alphalens==0.3.2 (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/a5/dc/2f9cd107d0d4cf6223d37d81ddfbbdbf0d703d03669b83810fa6b97f32e5/alphalens-0.3.2.tar.gz (18.9MB)
[K    100% |████████████████████████████████| 18.9MB 24kB/s  eta 0:00:01   24% |████████                        | 4.7MB 32.6MB/s eta 0:00:01    32% |██████████▌                     | 6.2MB 25.7MB/s eta 0:00:01    63% |████████████████████▎           | 12.0MB 31.5MB/s eta 0:00:01    70% |██████████████████████▊         | 13.4MB 31.1MB/s eta 0:00:01    86% |███████████████████████████▌    | 16.3MB 27.9MB/s eta 0:00:01    93% |██████████████████████████████  | 17.7MB 32.1MB/s eta 0:00:01
[?25hCollecting nltk==3.3.0 (from -r requirements.txt (line 2))
  Downloading https://files.pythonhosted.org/packages/50/09/3b1755d528ad9156ee7243d52aa5cd2b809ef053a0f31b53d92853dd653a/nltk-3.3.0.zip (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 336kB/s eta 0:00:01
[?25hColl

### Load Packages

In [2]:
import nltk
import numpy as np
import pandas as pd
import pickle
import pprint
import project_helper
import project_tests

from tqdm import tqdm

### Download NLP Corpora
You'll need two corpora to run this project: the stopwords corpus for removing stopwords and wordnet for lemmatizing.

In [3]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

## Get 10ks
We'll be running NLP analysis on 10-k documents. To do that, we first need to download the documents. For this project, we'll download 10-ks for a few companies. To lookup documents for these companies, we'll use their CIK. If you would like to run this against other stocks, we've provided the dict `additional_cik` for more stocks. However, the more stocks you try, the long it will take to run.

In [4]:
cik_lookup = {
    'AMZN': '0001018724',
    'BMY': '0000014272',   
    'CNP': '0001130310',
    'CVX': '0000093410',
    'FL': '0000850209',
    'FRT': '0000034903',
    'HON': '0000773840'}

additional_cik = {
    'AEP': '0000004904',
    'AXP': '0000004962',
    'BA': '0000012927', 
    'BK': '0001390777',
    'CAT': '0000018230',
    'DE': '0000315189',
    'DIS': '0001001039',
    'DTE': '0000936340',
    'ED': '0001047862',
    'EMR': '0000032604',
    'ETN': '0001551182',
    'GE': '0000040545',
    'IBM': '0000051143',
    'IP': '0000051434',
    'JNJ': '0000200406',
    'KO': '0000021344',
    'LLY': '0000059478',
    'MCD': '0000063908',
    'MO': '0000764180',
    'MRK': '0000310158',
    'MRO': '0000101778',
    'PCG': '0001004980',
    'PEP': '0000077476',
    'PFE': '0000078003',
    'PG': '0000080424',
    'PNR': '0000077360',
    'SYY': '0000096021',
    'TXN': '0000097476',
    'UTX': '0000101829',
    'WFC': '0000072971',
    'WMT': '0000104169',
    'WY': '0000106535',
    'XOM': '0000034088'}

### Get list of 10-ks
The SEC has a limit on the number of calls you can make to the website per second. In order to avoid hiding that limit, we've created the `SecAPI` class. This will cache data from the SEC and prevent you from going over the limit.

In [5]:
sec_api = project_helper.SecAPI()

With the class constructed, let's pull a list of filled 10-ks from the SEC for each company.

In [6]:
from bs4 import BeautifulSoup

def get_sec_data(cik, doc_type, start=0, count=60):
    rss_url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany' \
        '&CIK={}&type={}&start={}&count={}&owner=exclude&output=atom' \
        .format(cik, doc_type, start, count)
    sec_data = sec_api.get(rss_url)
    feed = BeautifulSoup(sec_data.encode('ascii'), 'xml').feed
    entries = [
        (
            entry.content.find('filing-href').getText(),
            entry.content.find('filing-type').getText(),
            entry.content.find('filing-date').getText())
        for entry in feed.find_all('entry', recursive=False)]

    return entries

Let's pull the list using the `get_sec_data` function, then display some of the results. For displaying some of the data, we'll use Amazon as an example. 

In [7]:
example_ticker = 'AMZN'
sec_data = {}

for ticker, cik in cik_lookup.items():
    sec_data[ticker] = get_sec_data(cik, '10-K')

pprint.pprint(sec_data[example_ticker][:5])

[('http://www.sec.gov/Archives/edgar/data/1018724/000101872418000005/0001018724-18-000005-index.htm',
  '10-K',
  '2018-02-02'),
 ('http://www.sec.gov/Archives/edgar/data/1018724/000101872417000011/0001018724-17-000011-index.htm',
  '10-K',
  '2017-02-10'),
 ('http://www.sec.gov/Archives/edgar/data/1018724/000101872416000172/0001018724-16-000172-index.htm',
  '10-K',
  '2016-01-29'),
 ('http://www.sec.gov/Archives/edgar/data/1018724/000101872415000006/0001018724-15-000006-index.htm',
  '10-K',
  '2015-01-30'),
 ('http://www.sec.gov/Archives/edgar/data/1018724/000101872414000006/0001018724-14-000006-index.htm',
  '10-K',
  '2014-01-31')]


### Download 10-ks
As you see, this is a list of urls. These urls point to a file that contains metadata related to each filling. Since we don't care about the metadata, we'll pull the filling by replacing the url with the filling url.

In [8]:
raw_fillings_by_ticker = {}

for ticker, data in sec_data.items():
    raw_fillings_by_ticker[ticker] = {}
    for index_url, file_type, file_date in tqdm(data, desc='Downloading {} Fillings'.format(ticker), unit='filling'):
        if (file_type == '10-K'):
            file_url = index_url.replace('-index.htm', '.txt').replace('.txtl', '.txt')            
            
            raw_fillings_by_ticker[ticker][file_date] = sec_api.get(file_url)


print('Example Document:\n\n{}...'.format(next(iter(raw_fillings_by_ticker[example_ticker].values()))[:1000]))

Downloading AMZN Fillings: 100%|██████████| 23/23 [00:04<00:00,  4.72filling/s]
Downloading BMY Fillings: 100%|██████████| 28/28 [00:08<00:00,  3.47filling/s]
Downloading CNP Fillings: 100%|██████████| 20/20 [00:14<00:00,  1.40filling/s]
Downloading CVX Fillings: 100%|██████████| 26/26 [00:10<00:00,  2.51filling/s]
Downloading FL Fillings: 100%|██████████| 23/23 [00:04<00:00,  5.03filling/s]
Downloading FRT Fillings: 100%|██████████| 30/30 [00:05<00:00,  5.47filling/s]
Downloading HON Fillings: 100%|██████████| 26/26 [00:05<00:00,  4.53filling/s]

Example Document:

<SEC-DOCUMENT>0001018724-18-000005.txt : 20180202
<SEC-HEADER>0001018724-18-000005.hdr.sgml : 20180202
<ACCEPTANCE-DATETIME>20180201204115
ACCESSION NUMBER:		0001018724-18-000005
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		110
CONFORMED PERIOD OF REPORT:	20171231
FILED AS OF DATE:		20180202
DATE AS OF CHANGE:		20180201

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			AMAZON COM INC
		CENTRAL INDEX KEY:			0001018724
		STANDARD INDUSTRIAL CLASSIFICATION:	RETAIL-CATALOG & MAIL-ORDER HOUSES [5961]
		IRS NUMBER:				911646860
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	000-22513
		FILM NUMBER:		18568399

	BUSINESS ADDRESS:	
		STREET 1:		410 TERRY AVENUE NORTH
		CITY:			SEATTLE
		STATE:			WA
		ZIP:			98109
		BUSINESS PHONE:		2062661000

	MAIL ADDRESS:	
		STREET 1:		410 TERRY AVENUE NORTH
		CITY:			SEATTLE
		STATE:			WA
		ZIP:			98109
</SEC-HEADER>
<DOCUMENT>
<TYPE>10-K
<




### Get Documents
With theses fillings downloaded, we want to break them into their associated documents. These documents are sectioned off in the fillings with the tags `<DOCUMENT>` for the start of each document and `</DOCUMENT>` for the end of each document. There's no overlap with these documents, so each `</DOCUMENT>` tag should come after the `<DOCUMENT>` with no `<DOCUMENT>` tag in between.

Implement `get_documents` to return a list of these documents from a filling. Make sure not to include the tag in the returned document text.

In [19]:
import re


def get_documents(text):
    """
    Extract the documents from the text

    Parameters
    ----------
    text : str
        The text with the document strings inside

    Returns
    -------
    extracted_docs : list of str
        The document strings found in `text`
    """
    #text="<DOCUMENT> for the start of each document and </DOCUMENT>"
    # TODO: Implement
    pattern = r'<DOCUMENT>(.*?)</DOCUMENT>'
    
    #result=re.compile('<DOCUMENT>(.*?)</DOCUMENT>', re.DOTALL | re.IGNORECASE).findall(text)
    x=re.compile(pattern,re.DOTALL | re.IGNORECASE)
    result=x.findall(text)
    
    print(text)
    return result


project_tests.test_get_documents(get_documents)

This is before the test document<DOCUMENT>
This is inside the document
This is the text that should be copied</DOCUMENT>
This is after the document
This shouldn	 be included.
<SEC-DOCUMENT>0002014754-18-050402.txt : 20180202
<SEC-HEADER>00002014754-18-050402.hdr.sgml : 20180202
<ACCEPTANCE-DATETIME>20180201204115
ACCESSION NUMBER:		0002014754-18-050402
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		110
CONFORMED PERIOD OF REPORT:	20171231
FILED AS OF DATE:		20180202
DATE AS OF CHANGE:		20180201

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			TEST
		CENTRAL INDEX KEY:			0001018724
		STANDARD INDUSTRIAL CLASSIFICATION:	RANDOM [2357234]
		IRS NUMBER:				91236464620
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	000-2225413
		FILM NUMBER:		13822526583969

	BUSINESS ADDRESS:	
		STREET 1:		422320 PLACE AVENUE
		CITY:			SEATTLE
		STATE:			WA
		ZIP:			234234
		BUSINESS PHONE:		306234534246600

	

With the `get_documents` function implemented, let's extract all the documents.

In [20]:
filling_documents_by_ticker = {}

for ticker, raw_fillings in raw_fillings_by_ticker.items():
    filling_documents_by_ticker[ticker] = {}
    for file_date, filling in tqdm(raw_fillings.items(), desc='Getting Documents from {} Fillings'.format(ticker), unit='filling'):
        filling_documents_by_ticker[ticker][file_date] = get_documents(filling)


print('\n\n'.join([
    'Document {} Filed on {}:\n{}...'.format(doc_i, file_date, doc[:200])
    for file_date, docs in filling_documents_by_ticker[example_ticker].items()
    for doc_i, doc in enumerate(docs)][:3]))

Getting Documents from AMZN Fillings:   6%|▌         | 1/18 [00:00<00:05,  3.04filling/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Getting Documents from AMZN Fillings:  17%|█▋        | 3/18 [00:01<00:05,  2.68filling/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Getting Documents from AMZN Fillings:  22%|██▏       | 4/18 [00:02<00:07,  1.91filling/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending o

-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: webmaster@www.sec.gov
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 K4VtJIWrZd06/M/RtgM6gBE7FAJiDORO1RiQNL6Gl3H+hNKwoHsPm7i45dhhiN3g
 RlSeWaZjUHbUSHxadZnSKg==

<SEC-DOCUMENT>0000891020-00-000622.txt : 20000411
<SEC-HEADER>0000891020-00-000622.hdr.sgml : 20000411
ACCESSION NUMBER:		0000891020-00-000622
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		9
CONFORMED PERIOD OF REPORT:	19991231
FILED AS OF DATE:		20000329

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			AMAZON COM INC
		CENTRAL INDEX KEY:			0001018724
		STANDARD INDUSTRIAL CLASSIFICATION:	RETAIL-CATALOG & MAIL-ORDER HOUSES [5961]
		IRS NUMBER:				911646860
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		
		SEC FILE NUMBER:	000-22513
		FILM NUMBER:		5

Getting Documents from BMY Fillings:   4%|▍         | 1/24 [00:01<00:33,  1.48s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Getting Documents from BMY Fillings:   8%|▊         | 2/24 [00:02<00:27,  1.24s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Getting Documents from BMY Fillings:  12%|█▎        | 3/24 [00:03<00:24,  1.18s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending outp

-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: webmaster@www.sec.gov
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 PDvRpc3vpC6w3Z8WaJYImfNYaVmvn7z0fHntesArL9WBgz21pqtiUrOtMt/BpuHh
 b04lGrd1xPW70s0ld6StUQ==

<SEC-DOCUMENT>0000014272-97-000008.txt : 19970401
<SEC-HEADER>0000014272-97-000008.hdr.sgml : 19970401
ACCESSION NUMBER:		0000014272-97-000008
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		14
CONFORMED PERIOD OF REPORT:	19961231
FILED AS OF DATE:		19970331
SROS:			NYSE
SROS:			PSE

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			BRISTOL MYERS SQUIBB CO
		CENTRAL INDEX KEY:			0000014272
		STANDARD INDUSTRIAL CLASSIFICATION:	PHARMACEUTICAL PREPARATIONS [2834]
		IRS NUMBER:				220790350
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE

Getting Documents from CNP Fillings:   6%|▋         | 1/16 [00:01<00:17,  1.18s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Getting Documents from CNP Fillings:  12%|█▎        | 2/16 [00:02<00:16,  1.15s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Getting Documents from CNP Fillings:  19%|█▉        | 3/16 [00:03<00:16,  1.29s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending outp

Getting Documents from CVX Fillings:  41%|████      | 9/22 [00:16<00:24,  1.88s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Getting Documents from CVX Fillings:  50%|█████     | 11/22 [00:18<00:18,  1.65s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Getting Documents from CVX Fillings:  59%|█████▉    | 13/22 [00:18<00:12,  1.43s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending ou

-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: webmaster@www.sec.gov
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 IO51KTMKivAd2l9nP66Nuuz9sw4Qriq0BdkQmQAid7fiCLC93YqSchzx1yjwuUMJ
 nCeddZIag+1eCiQGbvYYYg==

<SEC-DOCUMENT>0001206774-04-000280.txt : 20040405
<SEC-HEADER>0001206774-04-000280.hdr.sgml : 20040405
<ACCEPTANCE-DATETIME>20040405171246
ACCESSION NUMBER:		0001206774-04-000280
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		10
CONFORMED PERIOD OF REPORT:	20040131
FILED AS OF DATE:		20040405

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			FOOT LOCKER INC
		CENTRAL INDEX KEY:			0000850209
		STANDARD INDUSTRIAL CLASSIFICATION:	RETAIL-SHOE STORES [5661]
		IRS NUMBER:				133513936
		STATE OF INCORPORATION:			NY
		FISCAL YEAR END:			0127

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBE

Getting Documents from FRT Fillings:   5%|▌         | 1/20 [00:01<00:20,  1.08s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Getting Documents from FRT Fillings:  20%|██        | 4/20 [00:04<00:16,  1.05s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Getting Documents from FRT Fillings:  25%|██▌       | 5/20 [00:05<00:16,  1.11s/filling]IOPub data rate exceeded.
The notebook server will temporarily stop sending outp

Getting Documents from HON Fillings: 100%|██████████| 21/21 [00:14<00:00,  1.41filling/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




### Get Document Types
Now that we have all the documents, we want to find the 10-k form in this 10-k filing. Implement the `get_document_type` function to return the type of document given. The document type is located on a line with the `<TYPE>` tag. For example, a form of type "TEST" would have the line `<TYPE>TEST`. Make sure to return the type as lowercase, so this example would be returned as "test".

In [23]:
def get_document_type(doc):
    """
    Return the document type lowercased

    Parameters
    ----------
    doc : str
        The document string

    Returns
    -------
    doc_type : str
        The document type lowercased
    """
    
    # TODO: Implement
    type_results=re.compile('<TYPE>(.*?)\n', re.DOTALL | re.IGNORECASE).findall(doc)
    print(type_results)
    return  type_results[0].lower()


project_tests.test_get_document_type(get_document_type)

['10-K']
Tests Passed


With the `get_document_type` function, we'll filter out all non 10-k documents.

In [27]:
ten_ks_by_ticker = {}

for ticker, filling_documents in filling_documents_by_ticker.items():
    ten_ks_by_ticker[ticker] = []
    for file_date, documents in filling_documents.items():
        for document in documents:
            if get_document_type(document) == '10-k':
                ten_ks_by_ticker[ticker].append({
                    'cik': cik_lookup[ticker],
                    'file': document,
                    'file_date': file_date})


project_helper.print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['cik', 'file', 'file_date'])

['10-K']
['EX-12.1']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['EX-101.INS']
['EX-101.SCH']
['EX-101.CAL']
['EX-101.DEF']
['EX-101.LAB']
['EX-101.PRE']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['EXCEL']
['XML']
['XML']
['XML']
['ZIP']
['10-K']
['EX-12.1']
['EX-21.1']
['EX-23.1']
['

['EX-101.INS']
['EX-101.SCH']
['EX-101.CAL']
['EX-101.DEF']
['EX-101.LAB']
['EX-101.PRE']
['GRAPHIC']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['EXCEL']
['XML']
['XML']
['10-K']
['EX-12.1']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['10-K']
['EX-10.4']
['EX-10.5']
['EX-10.6']
['EX-12.1']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['GRAPHIC']
['10-K']
['EX-12.1']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['10-K']
['EX-12.1']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['10-K']
['EX-10.8']
['EX-12.1']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['10-K']
['EX-12.1']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-

['EX-101.DEF']
['EX-101.LAB']
['EX-101.PRE']
['GRAPHIC']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['EXCEL']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['ZIP']
['XML']
['XML']
['XML']
['10-K']
['EX-3.(B)']
['EX-10.(AA)']
['EX-10.(CC)']
['EX-10.(DD)']
['EX-10.(EE)']
['EX-10.(MM)']
['EX

['XML']
['XML', 'explicitMember</type></DimensionInfo></anyType><anyType xsi:type="Segment"><IsDefaultForEntity>false</IsDefaultForEntity><Name /><IsFromEntityGroup>false</IsFromEntityGroup><ValueName>Technology [Member]</ValueName>', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>Unit1</UnitID><UnitType>Standard</UnitType><StandardMeasure><MeasureSchema>http://www.xbrl.org/2003/iso4217</MeasureSchema><MeasureValue>USD</MeasureValue><MeasureNamespace>iso4217</MeasureNamespace></StandardMeasure><Scale>0</Scale></UnitProperty></UPS><CurrencyCode>USD</CurrencyCode><OriginalCurrencyCode>USD</OriginalCurrencyCode></MCU><CurrencySymbol>$</CurrencySymbol><Labels><Label Id="1" Label="Licenses [Member]" /><Label Id="2" Label="Technology [Member]" /><Label Id="3" Label="1/1/2010 - 12/31/2010" /></Labels></Columns><Columns><Id>5</Id><IsAbstractGroupTitle>false</IsAbstractGroupTitle><LabelColumn>false</LabelColumn><Curren', 'explicit

['EXCEL']
['XML']
['XML']
['XML', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>Unit1</UnitID><UnitType>Standard</UnitType><StandardMeasure><MeasureSchema>http://www.xbrl.org/2003/iso4217</MeasureSchema', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>Unit1</UnitID><UnitType>Standard</UnitType><StandardMeasure><MeasureSchema>http://www.xbrl.org/2003/iso4217</MeasureSchema><M', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>Unit1</UnitID><UnitType>Standard</UnitType><StandardMeasure><MeasureSchema>http://www.xbrl.org/2003/iso4217</MeasureSchema><M', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>Unit1</UnitID><UnitType>Standard</UnitType><StandardMeasure><MeasureSchema>http://www.xbrl.org/2003/iso4217</MeasureSchema><M', 'explicit

['10-K']
['EX-10.M']
['EX-10.BB']
['EX-12']
['EX-21']
['EX-23']
['EX-31.A']
['EX-31.B']
['EX-32.A']
['EX-32.B']
['GRAPHIC']
['GRAPHIC']
['10-K']
['EX-10.(FF)']
['EX-12']
['EX-21']
['EX-23.(A)']
['EX-23.(B)']
['EX-31.(A)']
['EX-31.(B)']
['EX-32.(A)']
['EX-32.(B)']
['GRAPHIC']
['10-K']
['EX-10.(B)']
['EX-10.(Q)']
['EX-10.(U)']
['EX-10.(V)']
['EX-12']
['EX-21']
['EX-23.(A)']
['EX-23.(B)']
['EX-31.(A)']
['EX-31.(B)']
['EX-32.(A)']
['EX-32.(B)']
['GRAPHIC']
['GRAPHIC']
['10-K']
['EX-3.(B)']
['EX-10.(L)']
['EX-10.(Q)']
['EX-10.(S)']
['EX-10.(T)']
['EX-10.(U)']
['EX-12']
['EX-21']
['EX-23']
['EX-31.(A)']
['EX-31.(B)']
['EX-32.(A)']
['EX-32.(B)']
['GRAPHIC']
['10-K']
['EX-3.(B)']
['EX-4.(F)']
['EX-10.(F)']
['EX-10.(L)']
['EX-10.(Q)']
['EX-12']
['EX-21']
['EX-23']
['EX-31.(A)']
['EX-31.(B)']
['EX-32.(A)']
['EX-32.(B)']
['10-K']
['EX-3.B']
['EX-4.G']
['EX-4.S']
['EX-4.T']
['EX-12']
['EX-21']
['EX-23']
['EX-31.A']
['EX-31.B']
['EX-32.A']
['EX-32.B']
['10-K']
['EX-3.(B)']
['EX-10.(C)']
['EX-21']
[

['10-K']
['EX-10.(KK)(2)']
['EX-10.(KK)(3)']
['EX-10.(LL)']
['EX-10.(MM)']
['EX-12']
['EX-21']
['EX-23']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['EX-101.INS']
['EX-101.SCH']
['EX-101.CAL']
['EX-101.DEF']
['EX-101.LAB']
['EX-101.PRE']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['EXCEL']
['XML']
['XML']
['10-K']
['EX-4.(E)(31)']
['EX-4.(E)(32)']
['EX-10.(H)(1)']
['EX-10.(H)(2)']
['EX-10.(N)(3)']
['EX-10.(HH)(1)']
['EX-10.(HH)(2)']
['EX-10.(II)(1)']
['EX-10.(II)(2)']
['EX-10.(JJ)(1)']
['EX-10.(JJ)(2)']
['EX-10.(KK)']
['EX-10.(LL)']
['EX-10.(MM)']
['EX-10.(NN)']
['EX-12']
['EX-21']
['EX-23']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['10-K']
['EX-10.(G)']
['EX-10.(DD)']
['EX-10.(EE)']
['EX-10.(FF)']
['EX-10.(GG)']
['EX-12']
['EX-2

['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['EXCEL']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['10-K']
['EX-10.9']
['EX-10.13']
['EX-12.1']
['EX-21.1']
['EX-23.1']
['EX-24.1']
['EX-24.2']
['EX-24.3']
['EX-24.4']
['EX-24.5']
['EX-24.6']
['EX-24.7']
['EX-24.8']
['EX-24.9']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['EX-95']
['EX-99.1'

['XML', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitPro', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitPro', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID>', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID>', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitPro', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitPro', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitPro', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitPr', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID><UnitType

['EX-101.INS']
['EX-101.SCH']
['EX-101.CAL']
['EX-101.LAB']
['EX-101.PRE']
['EX-101.DEF']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['EXCEL']
['XML']
['XML', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>', 'explicitMember</type>

['EX-101.PRE']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['EXCEL']

['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['EX-12']
['EX-21']
['EX-23']
['GRAPHIC']
['EX-31.1']
['GRAPHIC']
['EX-31.2']
['GRAPHIC']
['EX-32']
['GRAPHIC']
['GRAPHIC']
['10-K']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['EX-10.10']
['EX-10.40']
['EX-10.41']
['EX-12']
['EX-21']
['EX-23']
['EX-31.1']
['EX-31.2']
['EX-32']
['10-K']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['EX-10.1']
['EX-10.2']
['EX-10.3']
['EX-10.4']
['EX-10.5']
['EX-10.6']
['EX-12']
['EX-21']
['EX-23']
['EX-31.1']
['EX-31.2']
['EX-32']
['10-K']
['GRAPHIC']
['GRAPHIC']
['GRAPHIC']
['EX-12']
['EX-21']
['EX-23']
['EX-31.1']
['EX-32.2']
['EX-32']
['10-K']
['EX-10.20']
['EX-10.22']
['EX-10.32']
['EX-12']
['EX-13']
['EX-21']
['EX-23']
['EX-99.1']
['EX-99.2']
['10-K']
['EX-4.5']
['EX-10.21']
['EX-12']
['EX-13']
['EX-21']
['EX-23']
['10-K']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['EX-101.INS']
['EX-101.SCH']
['EX-101.CAL']
['EX-101.DEF']
['EX-101.LAB']
['EX-101.PRE']
['GRAPHIC']
['XML']
['XML']
['XML'

['10-K']
['EX-10.26']
['EX-10.27']
['EX-10.28']
['EX-10.29']
['EX-10.30']
['EX-10.31']
['EX-10.32']
['EX-10.33']
['EX-10.34']
['EX-10.35']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['10-K']
['EX-10.23']
['EX-10.24']
['EX-10.25']
['EX-10.26']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['10-K']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['10-K']
['EX-3.2']
['EX-10.32']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['GRAPHIC']
['10-K']
['EX-10.12']
['EX-10.17']
['EX-10.19']
['EX-10.26']
['EX-10.27']
['EX-10.28']
['EX-10.29']
['EX-10.30']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['10-K']
['EX-3.2']
['EX-4.5']
['EX-21.1']
['EX-23.1']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['10-K']
['EX-3.2']
['EX-23.1']
['EX-99.1']
['EX-99.2']
['10-K']
['EX-10.26']
['EX-10.27']
['EX-23']
['EX-27']
['10-K']
['EX-27']
['10-K']
['EX-23']
['EX-27']
['10-K']
['

['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['ZIP']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML']
['10-K']
['EX-10.5']
['EX-10.34']
['EX-10.35']
['EX-10.37']
['EX-10.38']
['EX-10.39']
['EX-12']
['EX-18']
['EX-21']
['EX-23']
['EX-24']
['EX-31.1']
['EX-31.2']
['EX-32.1']
['EX-32.2']
['GRAPHIC']
['GRAPHIC']
['EX-101.INS']
['EX-101.SCH']
['EX-101.CAL']
['EX-101.DEF']
['EX-101.LAB']
['EX-101.PRE']
['XML']
['XML']
['XML', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID><UnitType>Standard</UnitType><StandardMeasure><MeasureSchema>http://www.x'

['XML', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>Shares</UnitID><UnitType>Standard</UnitType><StandardMeasure><MeasureSchema>http://www.xbrl.org/2003/', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID><UnitType>Standard</UnitType><StandardMeasure><MeasureSchema>http://www.x', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID><UnitType>Standard</Unit', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>Shares</UnitID><UnitType>Standard</UnitType><StandardMeasure><Meas', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID><UnitType>Standard</UnitType><StandardMeasure', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios />

['EXCEL']
['XML']
['XML']
['XML']
['XML']
['XML']
['XML', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID><UnitType>Standard</UnitType><StandardMeasure><MeasureSchema>http://www.xbrl', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID><UnitType>Standard</UnitType><StandardMeasure><MeasureSchema>http://www.xbrl', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID><UnitType>Standard</UnitType><StandardMeasure>', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID><UnitType>Standard</UnitType><StandardMeasure><MeasureSchema>ht', 'explicitMember</type></DimensionInfo></anyType></Segments><Scenarios /></contextRef><UPS><UnitProperty><UnitID>USD</UnitID><UnitType>Standard</UnitType><StandardMeasure><Me

## Preprocess the Data
### Clean Up
As you can see, the text for the documents are very messy. To clean this up, we'll remove the html and lowercase all the text.

In [28]:
def remove_html_tags(text):
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    return text


def clean_text(text):
    text = text.lower()
    text = remove_html_tags(text)
    
    return text

Using the `clean_text` function, we'll clean up all the documents.

In [26]:
for ticker, ten_ks in ten_ks_by_ticker.items():
    for ten_k in tqdm(ten_ks, desc='Cleaning {} 10-Ks'.format(ticker), unit='10-K'):
        ten_k['file_clean'] = clean_text(ten_k['file'])


project_helper.print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['file_clean'])

Cleaning AMZN 10-Ks: 100%|██████████| 18/18 [00:37<00:00,  2.11s/10-K]
Cleaning BMY 10-Ks: 100%|██████████| 24/24 [01:22<00:00,  3.42s/10-K]
Cleaning CNP 10-Ks: 100%|██████████| 16/16 [01:00<00:00,  3.81s/10-K]
Cleaning CVX 10-Ks: 100%|██████████| 22/22 [02:03<00:00,  5.64s/10-K]
Cleaning FL 10-Ks: 100%|██████████| 17/17 [00:28<00:00,  1.69s/10-K]
Cleaning FRT 10-Ks: 100%|██████████| 20/20 [01:01<00:00,  3.06s/10-K]
Cleaning HON 10-Ks: 100%|██████████| 21/21 [01:03<00:00,  3.05s/10-K]

[
  {
    file_clean: '\n10-k\n1\namzn-20171231x10k.htm\n10-k\n\n\n\n\n\...},
  {
    file_clean: '\n10-k\n1\namzn-20161231x10k.htm\nform 10-k\n\n\n...},
  {
    file_clean: '\n10-k\n1\namzn-20151231x10k.htm\nform 10-k\n\n\n...},
  {
    file_clean: '\n10-k\n1\namzn-20141231x10k.htm\nform 10-k\n\n\n...},
  {
    file_clean: '\n10-k\n1\namzn-20131231x10k.htm\nform 10-k\n\n\n...},
]





### Lemmatize
With the text cleaned up, it's time to distill the verbs down. Implement the `lemmatize_words` function to lemmatize verbs in the list of words provided.

In [32]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


def lemmatize_words(words):
    """
    Lemmatize words 

    Parameters
    ----------
    words : list of str
        List of words

    Returns
    -------
    lemmatized_words : list of str
        List of lemmatized words
    """
    
    # TODO: Implement
    results=[WordNetLemmatizer().lemmatize(x , pos="v") for x in words]
    return results


project_tests.test_lemmatize_words(lemmatize_words)

Tests Passed


With the `lemmatize_words` function implemented, let's lemmatize all the data.

In [34]:
word_pattern = re.compile('\w+')

for ticker, ten_ks in ten_ks_by_ticker.items():
    for ten_k in tqdm(ten_ks, desc='Lemmatize {} 10-Ks'.format(ticker), unit='10-K'):
        ten_k['file_lemma'] = lemmatize_words(word_pattern.findall(ten_k['file_clean']))


project_helper.print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['file_lemma'])

Lemmatize AMZN 10-Ks:   0%|          | 0/18 [00:00<?, ?10-K/s]


KeyError: 'file_clean'

### Remove Stopwords

In [None]:
from nltk.corpus import stopwords


lemma_english_stopwords = lemmatize_words(stopwords.words('english'))

for ticker, ten_ks in ten_ks_by_ticker.items():
    for ten_k in tqdm(ten_ks, desc='Remove Stop Words for {} 10-Ks'.format(ticker), unit='10-K'):
        ten_k['file_lemma'] = [word for word in ten_k['file_lemma'] if word not in lemma_english_stopwords]


print('Stop Words Removed')

## Analysis on 10ks
### Loughran McDonald Sentiment Word Lists
We'll be using the Loughran and McDonald sentiment word lists. These word lists cover the following sentiment:
- Negative 
- Positive
- Uncertainty
- Litigious
- Constraining
- Superfluous
- Modal

This will allow us to do the sentiment analysis on the 10-ks. Let's first load these word lists. We'll be looking into a few of these sentiments.

In [None]:
sentiments = ['negative', 'positive', 'uncertainty', 'litigious', 'constraining', 'interesting']

sentiment_df = pd.read_csv('loughran_mcdonald_master_dic_2016.csv')
sentiment_df.columns = [column.lower() for column in sentiment_df.columns] # Lowercase the columns for ease of use

# Remove unused information
sentiment_df = sentiment_df[sentiments + ['word']]
sentiment_df[sentiments] = sentiment_df[sentiments].astype(bool)
sentiment_df = sentiment_df[(sentiment_df[sentiments]).any(1)]

# Apply the same preprocessing to these words as the 10-k words
sentiment_df['word'] = lemmatize_words(sentiment_df['word'].str.lower())
sentiment_df = sentiment_df.drop_duplicates('word')


sentiment_df.head()

### Bag of Words
using the sentiment word lists, let's generate sentiment bag of words from the 10-k documents. Implement `get_bag_of_words` to generate a bag of words that counts the number of sentiment words in each doc. You can ignore words that are not in `sentiment_words`.

In [None]:
from collections import defaultdict, Counter
from sklearn.feature_extraction.text import CountVectorizer


def get_bag_of_words(sentiment_words, docs):
    """
    Generate a bag of words from documents for a certain sentiment

    Parameters
    ----------
    sentiment_words: Pandas Series
        Words that signify a certain sentiment
    docs : list of str
        List of documents used to generate bag of words

    Returns
    -------
    bag_of_words : 2-d Numpy Ndarray of int
        Bag of words sentiment for each document
        The first dimension is the document.
        The second dimension is the word.
    """
    
    # TODO: Implement
    
    return None


project_tests.test_get_bag_of_words(get_bag_of_words)

Using the `get_bag_of_words` function, we'll generate a bag of words for all the documents.

In [None]:
sentiment_bow_ten_ks = {}

for ticker, ten_ks in ten_ks_by_ticker.items():
    lemma_docs = [' '.join(ten_k['file_lemma']) for ten_k in ten_ks]
    
    sentiment_bow_ten_ks[ticker] = {
        sentiment: get_bag_of_words(sentiment_df[sentiment_df[sentiment]]['word'], lemma_docs)
        for sentiment in sentiments}


project_helper.print_ten_k_data([sentiment_bow_ten_ks[example_ticker]], sentiments)

### Jaccard Similarity
Using the bag of words, let's calculate the jaccard similarity on the bag of words and plot it over time. Implement `get_jaccard_similarity` to return the jaccard similarities between each tick in time. Since the input, `bag_of_words_matrix`, is a bag of words for each time period in order, you just need to compute the jaccard similarities for each neighboring bag of words. Make sure to turn the bag of words into a boolean array when calculating the jaccard similarity.

In [None]:
from sklearn.metrics import jaccard_similarity_score


def get_jaccard_similarity(bag_of_words_matrix):
    """
    Get jaccard similarities for neighboring documents

    Parameters
    ----------
    bag_of_words : 2-d Numpy Ndarray of int
        Bag of words sentiment for each document
        The first dimension is the document.
        The second dimension is the word.

    Returns
    -------
    jaccard_similarities : list of float
        Jaccard similarities for neighboring documents
    """
    
    # TODO: Implement
    
    return None


project_tests.test_get_jaccard_similarity(get_jaccard_similarity)

Using the `get_jaccard_similarity` function, let's plot the similarities over time.

In [None]:
# Get dates for the universe
file_dates = {
    ticker: [ten_k['file_date'] for ten_k in ten_ks]
    for ticker, ten_ks in ten_ks_by_ticker.items()}  

jaccard_similarities = {
    ticker: {
        sentiment_name: get_jaccard_similarity(sentiment_values)
        for sentiment_name, sentiment_values in ten_k_sentiments.items()}
    for ticker, ten_k_sentiments in sentiment_bow_ten_ks.items()}


project_helper.plot_similarities(
    [jaccard_similarities[example_ticker][sentiment] for sentiment in sentiments],
    file_dates[example_ticker][1:],
    'Jaccard Similarities for {} Sentiment'.format(example_ticker),
    sentiments)

### TFIDF
using the sentiment word lists, let's generate sentiment TFIDF from the 10-k documents. Implement `get_tfidf` to generate TFIDF from each document, using sentiment words as the terms. You can ignore words that are not in `sentiment_words`.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


def get_tfidf(sentiment_words, docs):
    """
    Generate TFIDF values from documents for a certain sentiment

    Parameters
    ----------
    sentiment_words: Pandas Series
        Words that signify a certain sentiment
    docs : list of str
        List of documents used to generate bag of words

    Returns
    -------
    tfidf : 2-d Numpy Ndarray of float
        TFIDF sentiment for each document
        The first dimension is the document.
        The second dimension is the word.
    """
    
    # TODO: Implement
    
    return None


project_tests.test_get_tfidf(get_tfidf)

Using the `get_tfidf` function, let's generate the TFIDF values for all the documents.

In [None]:
sentiment_tfidf_ten_ks = {}

for ticker, ten_ks in ten_ks_by_ticker.items():
    lemma_docs = [' '.join(ten_k['file_lemma']) for ten_k in ten_ks]
    
    sentiment_tfidf_ten_ks[ticker] = {
        sentiment: get_tfidf(sentiment_df[sentiment_df[sentiment]]['word'], lemma_docs)
        for sentiment in sentiments}

    
project_helper.print_ten_k_data([sentiment_tfidf_ten_ks[example_ticker]], sentiments)

### Cosine Similarity
Using the TFIDF values, we'll calculate the cosine similarity and plot it over time. Implement `get_cosine_similarity` to return the cosine similarities between each tick in time. Since the input, `tfidf_matrix`, is a TFIDF vector for each time period in order, you just need to computer the cosine similarities for each neighboring vector.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity


def get_cosine_similarity(tfidf_matrix):
    """
    Get cosine similarities for each neighboring TFIDF vector/document

    Parameters
    ----------
    tfidf : 2-d Numpy Ndarray of float
        TFIDF sentiment for each document
        The first dimension is the document.
        The second dimension is the word.

    Returns
    -------
    cosine_similarities : list of float
        Cosine similarities for neighboring documents
    """
    
    # TODO: Implement
    
    return None


project_tests.test_get_cosine_similarity(get_cosine_similarity)

Let's plot the cosine similarities over time.

In [None]:
cosine_similarities = {
    ticker: {
        sentiment_name: get_cosine_similarity(sentiment_values)
        for sentiment_name, sentiment_values in ten_k_sentiments.items()}
    for ticker, ten_k_sentiments in sentiment_tfidf_ten_ks.items()}


project_helper.plot_similarities(
    [cosine_similarities[example_ticker][sentiment] for sentiment in sentiments],
    file_dates[example_ticker][1:],
    'Cosine Similarities for {} Sentiment'.format(example_ticker),
    sentiments)

## Evaluate Alpha Factors
Just like we did in project 4, let's evaluate the alpha factors. For this section, we'll just be looking at the cosine similarities, but it can be applied to the jaccard similarities as well.
### Price Data
Let's get yearly pricing to run the factor against, since 10-Ks are produced annually.

In [None]:
pricing = pd.read_csv('../../data/project_5_yr/yr-quotemedia.csv', parse_dates=['date'])
pricing = pricing.pivot(index='date', columns='ticker', values='adj_close')


pricing

### Dict to DataFrame
The alphalens library uses dataframes, so we we'll need to turn our dictionary into a dataframe. 

In [None]:
cosine_similarities_df_dict = {'date': [], 'ticker': [], 'sentiment': [], 'value': []}


for ticker, ten_k_sentiments in cosine_similarities.items():
    for sentiment_name, sentiment_values in ten_k_sentiments.items():
        for sentiment_values, sentiment_value in enumerate(sentiment_values):
            cosine_similarities_df_dict['ticker'].append(ticker)
            cosine_similarities_df_dict['sentiment'].append(sentiment_name)
            cosine_similarities_df_dict['value'].append(sentiment_value)
            cosine_similarities_df_dict['date'].append(file_dates[ticker][1:][sentiment_values])

cosine_similarities_df = pd.DataFrame(cosine_similarities_df_dict)
cosine_similarities_df['date'] = pd.DatetimeIndex(cosine_similarities_df['date']).year
cosine_similarities_df['date'] = pd.to_datetime(cosine_similarities_df['date'], format='%Y')


cosine_similarities_df.head()

### Alphalens Format
In order to use a lot of the alphalens functions, we need to aligned the indices and convert the time to unix timestamp. In this next cell, we'll do just that.

In [None]:
import alphalens as al


factor_data = {}
skipped_sentiments = []

for sentiment in sentiments:
    cs_df = cosine_similarities_df[(cosine_similarities_df['sentiment'] == sentiment)]
    cs_df = cs_df.pivot(index='date', columns='ticker', values='value')

    try:
        data = al.utils.get_clean_factor_and_forward_returns(cs_df.stack(), pricing, quantiles=5, bins=None, periods=[1])
        factor_data[sentiment] = data
    except:
        skipped_sentiments.append(sentiment)

if skipped_sentiments:
    print('\nSkipped the following sentiments:\n{}'.format('\n'.join(skipped_sentiments)))
factor_data[sentiments[0]].head()

### Alphalens Format with Unix Time
Alphalen's `factor_rank_autocorrelation` and `mean_return_by_quantile` functions require unix timestamps to work, so we'll also create factor dataframes with unix time.

In [None]:
unixt_factor_data = {
    factor: data.set_index(pd.MultiIndex.from_tuples(
        [(x.timestamp(), y) for x, y in data.index.values],
        names=['date', 'asset']))
    for factor, data in factor_data.items()}

### Factor Returns
Let's view the factor returns over time. We should be seeing it generally move up and to the right.

In [None]:
ls_factor_returns = pd.DataFrame()

for factor_name, data in factor_data.items():
    ls_factor_returns[factor_name] = al.performance.factor_returns(data).iloc[:, 0]

(1 + ls_factor_returns).cumprod().plot()

### Basis Points Per Day per Quantile
It is not enough to look just at the factor weighted return. A good alpha is also monotonic in quantiles. Let's looks the basis points for the factor returns.

In [None]:
qr_factor_returns = pd.DataFrame()

for factor_name, data in unixt_factor_data.items():
    qr_factor_returns[factor_name] = al.performance.mean_return_by_quantile(data)[0].iloc[:, 0]

(10000*qr_factor_returns).plot.bar(
    subplots=True,
    sharey=True,
    layout=(5,3),
    figsize=(14, 14),
    legend=False)

### Turnover Analysis
Without doing a full and formal backtest, we can analyze how stable the alphas are over time. Stability in this sense means that from period to period, the alpha ranks do not change much. Since trading is costly, we always prefer, all other things being equal, that the ranks do not change significantly per period. We can measure this with the **Factor Rank Autocorrelation (FRA)**.

In [None]:
ls_FRA = pd.DataFrame()

for factor, data in unixt_factor_data.items():
    ls_FRA[factor] = al.performance.factor_rank_autocorrelation(data)

ls_FRA.plot(title="Factor Rank Autocorrelation")

### Sharpe Ratio of the Alphas
The last analysis we'll do on the factors will be sharpe ratio. Let's see what the sharpe ratio for the factors are. Generally, a Sharpe Ratio of near 1.0 or higher is an acceptable single alpha for this universe.

In [None]:
daily_annualization_factor = np.sqrt(252)

(daily_annualization_factor * ls_factor_returns.mean() / ls_factor_returns.std()).round(2)

That's it! You've successfully done sentiment analysis on 10-ks!
## Submission
Now that you're done with the project, it's time to submit it. Click the submit button in the bottom right. One of our reviewers will give you feedback on your project with a pass or not passed grade. You can continue to the next section while you wait for feedback.