# SEC Finance Data ETL

This notebook is used to **[E]extract** data from SEC website, **[T]transform** the data to be loaded into a Mongo DB Database. It also list the commands to be used to **[L]load** data into a MongoDB database

## Extract

The data is extracted from SEC Website https://data.sec.gov/api/xbrl/companyfacts/ for each CIK IDs

CIK - Central Index Key - Unique key that identifies a company in SEC Database

Here is a sample filing for data element 'Revenues':

"Revenues":{"label":"Revenues","description":"Amount of revenue recognized from goods sold, services rendered, insurance premiums, or other activities that constitute an earning process. Includes, but is not limited to, investment and interest income before deduction of interest expense when recognized as a component of revenue, and sales and trading gain (loss).","units":{"USD":[{"start":"2013-01-01","end":"2013-12-31","val":55519000000,"accn":"0001652044-16-000012","fy":2015,"fp":"FY","form":"10-K","filed":"2016-02-11"}]}

## Transform

The data for the following forms are considered for transformation:
* 8K - Current Report
* 10Q - Quarterly Report
* 10K - Annual Report
* 10Q/A - Quarterly Report Amendment
* 10K/A - Annual Report Amendment  

There are multiple filings found in the data for the same reporting period. The latest filings are only kept to reflect the most recently updated information.  

The reporting period is extracted based on 'start' and 'end' period provided in each filing


## Load

* The JSON data is constructed based on the database design
* The commands are provided to load data into the MongoDB

In [50]:
import requests
import pandas as pd
from datetime import datetime
import pprint
import json

pp = pprint.PrettyPrinter(indent=4)

In [51]:
# SEC API requires an email ID to be provided

email_id = input("what is your email id? [SEC requires you to provide your email ID to pull data from SEC website]")


what is your email id? [SEC requires you to provide your email ID to pull data from SEC website]ramkumarpj@gmail.com


In [52]:
# Base URL to retreive company facts from SEC Data API
base_url = "https://data.sec.gov/api/xbrl/companyfacts/"

# Headers to be set to receive appropriate respnse from SEC Data API
headers = {
    'User-Agent' : 'ramkumarpj@gmail.com',
    'Host' : 'data.sec.gov'
}


In [53]:
# CIK - Central Index Key - Unique key that identifies a company in SEC Database

# List of CIKs under analysis

cik_list = [808362, 1652044, 1637459 ]

# Data elements to be explored for each CIK

data_elements = [ 'Revenues',
                      'SalesRevenueGoodsNet',
                      'SalesRevenueServicesNet',
                      'RevenueFromContractWithCustomerIncludingAssessedTax',
                      'RevenueFromContractWithCustomerExcludingAssessedTax',
                      'GrossProfit',
                      'OperatingIncomeLoss',
                      'NetIncomeLoss',
                      'ResearchAndDevelopmentExpense',
                      'SellingAndMarketingExpense',
                      'ShareBasedCompensation',
                      'Depreciation',
                      'AllocatedShareBasedCompensationExpense',
                      'CostsAndExpenses',
                      'GeneralAndAdministrativeExpense',
                      'InterestExpense',
                      'LeaseAndRentalExpense',
                      'MarketingAndAdvertisingExpense',
                      'OtherAccruedLiabilitiesCurrent',
                      'EntityCommonStockSharesOutstanding',
                      'EntityPublicFloat']


# Select data for reporting period greater than 2016
filter_by_year = 2016


In [54]:
# Check whether this filing is a quaterly or annual filing

def isQuarterlyOrAnnualFiling(start, end):

    start_date = datetime.strptime(start, '%Y-%m-%d')
    end_date = datetime.strptime(end, '%Y-%m-%d')
    
    return end_date.month - start_date.month <3 or end_date.month - start_date.month == 11

# Check whether this filing is a quaterly filing

def isQuarterlyFiling(start, end):

    start_date = datetime.strptime(start, '%Y-%m-%d')
    end_date = datetime.strptime(end, '%Y-%m-%d')
    
    return end_date.month - start_date.month <3 

# Check whether this filing is a annual filing

def isAnnualFiling(start, end):

    start_date = datetime.strptime(start, '%Y-%m-%d')
    end_date = datetime.strptime(end, '%Y-%m-%d')
    
    return end_date.month - start_date.month == 11
    

In [55]:
# Extract Quarter in the filings. 
# Returns 0 if annual filing
# Return 1 if Quarter 1, Return 2 if Quarter 2, etc

def getQuarterInFiling(filing):
    
    if 'start' in filing:
        if isQuarterlyFiling(filing['start'], filing['end']):
            if filing['fp'] and filing['fp'][1] != 'Y' :
                return int(filing['fp'][1])
            else :
                return datetime.strptime(filing['end'], '%Y-%m-%d').month/3
        elif isAnnualFiling(filing['start'], filing['end']):
            return 0
    else:
        if not filing['fp'] or (filing['fp'] and filing['fp'][1] == 'Y'):
            return 0
        else:
            return int(filing['fp'][1])
    

In [56]:
# Return year in the filing

def getYearInFiling(filing):
    
    end_date = datetime.strptime(filing['end'], '%Y-%m-%d') 
    
    return end_date.year

In [57]:
# Extract data from the filings after filtering out for the forms of interest - 10Q, 10k, 8K, 10Q/A, 10K/A

def extractData(tenQ_tenK_filings_list, items, key):
    
         
    if key in items.keys():
        i = 1
        for key2 in items[key]['units'].keys():

            fin_list = items[key]['units'][key2]
            tenQ_tenK_filings = [i for i in fin_list if i['form'] == '10-Q' or i['form'] == '10-K' 
                                 or i['form'] == '8-K' or i['form'] == '10-Q/A' or i['form'] == '10-K/A']

            #print(f"{i}. {key} {key2} 10Qs- {tenQCount}, 10Ks - {tenKCount}")

            i+=1

            tenQ_tenK_filings_list.append({
                'key' : key,
                'units' : key2,
                'filings' : tenQ_tenK_filings
            })
            

In [58]:

# Function to transform data 

def transformData(tenQ_tenK_filings_list, key):
    
    if len(tenQ_tenK_filings_list) > 0 :
        filings = tenQ_tenK_filings_list[0]['filings']
    
    
        tenQ_tenK_filings_indexed = {}

        # Build a dictionary with a key using start and end fields 
        for filing in filings:

            if 'start' not in filing:
                start = '-'
            else:
                start = filing['start']
            
            index = start + ':' + filing['end']
            
    
            if index in tenQ_tenK_filings_indexed :
                tenQ_tenK_filings_indexed[index].append(filing)
            else :
                tenQ_tenK_filings_indexed[index] = [filing]
                
        # Identify multiple filings for same period
        tenQ_tenK_multiple_filings = [filing for filing in tenQ_tenK_filings_indexed.values() 
                                      if len(filing) > 1]
        
        # Identify single filings for same period
        tenQ_tenK_single_filings = [filing[0] for filing in tenQ_tenK_filings_indexed.values() 
                                    if len(filing) < 2]
        
        # Sort multiple filings in descending order of filed date 
        # Append the latest filing to single filings list
        for filing in tenQ_tenK_multiple_filings:
            filing.sort(key = lambda x: 
                        datetime.strptime(x['filed'], '%Y-%m-%d'), reverse = True)
            tenQ_tenK_single_filings.append(filing[0])
 
        # Sort single filings in ascending order of end date
        tenQ_tenK_single_filings.sort(key = lambda x: datetime.strptime(x['end'], '%Y-%m-%d'))
                
        # Filter single filings to keep only quatery and annual filings (eliminate 6 months, 9 months filings)
        tenQ_tenK_single_filings_qtr_annnual_filtered = [filing for filing 
                                                          in tenQ_tenK_single_filings 
                                                          if 'start' in filing and isQuarterlyOrAnnualFiling(
                                                              filing['start'], filing['end'])]
        
    
        if len(tenQ_tenK_single_filings_qtr_annnual_filtered) == 0:
            tenQ_tenK_single_filings_qtr_annnual_filtered = tenQ_tenK_single_filings
                
        return tenQ_tenK_single_filings_qtr_annnual_filtered
        

In [59]:
# Create List of dictionaries to be imported into MongoDB Finance Collection

# { 
# 'cik' : 4949494,
# 'dataType' : 'Revenues',
# 'value' : 4949494949,
# 'qtr' : 1
# 'year' : 2018
# }
    
def getFinanceData(tenQ_tenK_filings_transformed, cik, key):
    
    finance_records = [ {'cik' : cik,
                        'dataType' : key,
                        'value' : filing['val'],
                        'qtr' : getQuarterInFiling(filing),
                        'year' : getYearInFiling(filing)
                        } for filing in tenQ_tenK_filings_transformed]

    #print(f' cik={cik}, key={key},  Length of the finance records {len(finance_records)}')

    # Filter records by year
    finance_records = [record for record in finance_records if record['year'] > filter_by_year]
    
    #print(f' cik={cik}, key={key},  After filtering by year {filter_by_year}: Length of the finance records {len(finance_records)}')
    
    return finance_records
    
    

In [60]:
# Save data to file
def saveDataToFile(data, file_name):
    
    with open(file_name,'w') as fi:
        fi.write(json.dumps(data, indent=4))
    
    print(f"Completed writing to file {file_name}")

## Extract and Transform

In [61]:
# List that keeps the transformed company data
company_data = []

# List that keeps the transformed finance data 
finance_data = []

# Iterate through each CIK in the list

for cik in cik_list:
    
    # Create the URL to retrieve data for specific CIK
    url = base_url + f'CIK{str(cik).zfill(10)}.json'

    print(url)
    
    # Fetch the data from SEC Data API
    response = requests.get(url, headers=headers).json()

    print(f"received data for company- {response['entityName']}, cik = {response['cik']}")
    company_data.append({'cik' : response['cik'],
                        'companyName' : response['entityName']})
    # Get DEI Items from response
    dei = response['facts']['dei']

    # Get US-GAAP Items from response
    us_gaap = response['facts']['us-gaap']
    
    # Extract and Transform data each data element
    for key in data_elements: 
        
        # List to keep filings extracted for curent data element
        tenQ_tenK_filings_list = []
        
        # Extract filing if its available in US_GAAP
        extractData(tenQ_tenK_filings_list, us_gaap, key)
        
        # Extract filing if its available in DEI
        extractData(tenQ_tenK_filings_list, dei, key)
        
        # List to keep tranformed data for curent data element
        tenQ_tenK_filings_transformed = []
        
        # Transform the data - Phase 1 (eliminate duplicates)
        tenQ_tenK_filings_transformed = transformData(tenQ_tenK_filings_list, key)
        
        # Transform the data - Phase 2 (create the data structure and also identify the reporting period)
        if tenQ_tenK_filings_transformed is not None:
            finance_data.extend(getFinanceData(tenQ_tenK_filings_transformed, cik, key))
        

print(f"Length of Finance Data = {len(finance_data)}")

# Save the transformed dictionaries as JSON objects to a file
saveDataToFile(finance_data, "../../data/output/finance_data.json") 
saveDataToFile(company_data, "../../data/output/company_data.json") 


https://data.sec.gov/api/xbrl/companyfacts/CIK0000808362.json
received data for company- Baker Hughes Holdings LLC, cik = 808362
https://data.sec.gov/api/xbrl/companyfacts/CIK0001652044.json
received data for company- Alphabet Inc., cik = 1652044
https://data.sec.gov/api/xbrl/companyfacts/CIK0001637459.json
received data for company- Kraft Heinz Co, cik = 1637459
Length of Finance Data = 749
Completed writing to file ../../data/output/finance_data.json
Completed writing to file ../../data/output/company_data.json


## Load


* Data files to be loaded into the MongoDB database are under the directory - data/output

* Use below commands to import collections to the MongoDB database

* mongoimport --type json -d FinanceDB -c companies --drop --jsonArray company_data.json  
* mongoimport --type json -d FinanceDB -c finance --drop --jsonArray finance_data.json  
