# Public Float Data from 10K statements

This notebook provides a basic description and some sample code for extracting public float data from the annual reports.

First, we need to load in the python libraries that are required for the analysis.  

In [1]:
import re
import bs4 as bs
import html2text as trans
import textwrap
import datefinder
import datetime
import csv
from string import punctuation

The 10K text files saved on KLC are the html pages for the annual reports filed on the SEC website.  These html documents were saved as text files without removing the original formatting. In order to create plain text documents, this function removes the html formatting.

In [2]:
# function to change html to text
def clean_html(html):
    # First we remove inline JavaScript/CSS:
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
    # Then we remove html comments. This has to be done before removing regular
    # tags since comments can contain '>' characters.
    cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
    # Next we can remove the remaining tags:
    cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
    # Finally, we deal with whitespace
    cleaned = re.sub(r"&nbsp;", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    return cleaned.strip()

Here we read in an example 10K for Air Products and Chemicals.

In [3]:
f = open("/kellogg/data/EDGAR/10-K/2018/1961_2_000126493-18-000031.txt", "r") # I randomly selected one 10K
sec = f.read()
f.close() 
sec = clean_html(sec)
sec = trans.html2text(sec) # this is another python library that converts html to text
sec = sec.replace('\n', ' ') # removes any remaining new lines
sec = sec.replace('\t', ' ') # removes any remaining tabs
sec = sec.replace('\r', ' ') # removes any remaining return spaces
sec_long = sec # this contains the text from the entire document
sec = sec[0:50000] # this only contains the first 50,000 characters

FileNotFoundError: [Errno 2] No such file or directory: '/kellogg/data/EDGAR/10-K/2018/1961_2_000126493-18-000031.txt'

Company Name
Here we can use regular expressions within python to retrieve the Company Name. Essentially the regular expression is grabbing all of the text between the "Company Conformed Name" label and the next label.

In [None]:
name = re.findall(r'COMPANY CONFORMED NAME:(.*?)CENTRAL', sec, re.S)
print(str(name)[2:-2]) # this only prints the text portion of the string that is returned

Fiscal Year End
Here we use regular expressions to extract the Fiscal Year End (month, day, and year).

In [None]:
fye = re.findall(r'CONFORMED PERIOD OF REPORT:(.*?)FILED', sec, re.S)
fye = str(fye)[3:-3] 
fye_year = fye[0:4] 
fye_month = fye[4:6] 
fye_day = fye[6:8]
fye = str(fye_month) + '/' + str(fye_day) + '/' + str(fye_year)
print(fye)

Central Index Key

In [None]:
year_end = re.findall(r'CENTRAL INDEX KEY:(.*?)STANDARD', sec, re.S)
print(str(year_end)[2:-2])

Public Float Text String
Here we extract the Public Float text string.  Please note that the regular expressions reflect patterns within the text.  Something like '\s{1,2}' means that there could we one blank space or two between two words.

In [None]:
regex_pfloat = re.compile('(.{0,10})held\s{0,2}by\s{0,2}non(.*?)affiliates(.*?)[.]\s{1,2}', re.I | re.S)
pfloat = regex_pfloat.search(sec_long)
pfloat = pfloat.group()
pfloat = pfloat.replace('  ', ' ') # replaces two blank spaces with one blank space
print(pfloat)

Public Float Dollar Amount

In [None]:
regex_pfloatamt = re.compile('\$[\s]?([\d\.\,]+)[\s]*(thousand|million|billion|trillion)?', re.S)
pfloatamt = regex_pfloatamt.search(pfloat)
pfloatamt = pfloatamt.group()
pfloatamt = str(pfloatamt)
print(pfloatamt)

Public Float Date

In [None]:
pfloat_dates = list(datefinder.find_dates(str(pfloat)))
tempdateList = []
for p_date in pfloat_dates:
    pfloat_full = p_date.strftime('%m/%d/%Y')
    pfloat_year = p_date.strftime('%Y')
    if int(pfloat_year)==2017 or int(pfloat_year)==2018:
        tempdateList.append(pfloat_full)
    else:
        pass
pfloat_dates = str(tempdateList)
print(pfloat_dates)

Find the filing type table.

In [None]:
regex_filers = re.compile('rule\s{0,2}12b-2\s{0,2}of\s{0,2}the\s{0,2}exchange\s{0,2}act(.*?)if\s{0,2}an\s{0,2}emerging\s{0,2}growth\s{0,2}company\s{0,2}', re.S | re.I)
filers = regex_filers.search(sec_long)
filers = filers.group()
filers = filers.replace('  ', ' ')
print(filers)

Large Accelerated Filer

In [None]:
largefiler = re.findall(r'large\s{0,2}accelerated\s{0,2}filer(.*?)\s{1,2}[a-z]{2}', filers, re.S | re.I)
largefiler = str(largefiler)[2:-2]
print(largefiler)