# Web Scraping form S1¶

From **Statement of Operations:**
- Revenue / Total Revenue
- Net Income / Net Loss

From **Balance Sheet:**
- Cash and cash equivalents
- Goodwill
- Intangible assets
- Total assets
- Long Term Debt
- Commercial Paper
- Other Current Borrowings
- Long Term Debt, current portion
- Short-Term Debt

From **Statement of Cash Flow:**
- Net cash used in operating activities
- Purchases of property and equipment / Proceeds from property and equipment

In [1]:
import pandas as pd
import time
from form_parser import *

# pd.set_option('display.max_colwidth', None)

In [2]:
df = pd.read_csv('s1-list.csv')
print(f'There are {len(df)} companies in the list')
print(df.columns)
df[['ticker','companyName', 'linkToFilingDetails']].head()
df.head(2)

There are 10000 companies in the list
Index(['Unnamed: 0', 'id', 'accessionNo', 'cik', 'ticker', 'companyName',
       'companyNameLong', 'formType', 'description', 'filedAt', 'linkToTxt',
       'linkToHtml', 'linkToXbrl', 'linkToFilingDetails', 'entities',
       'documentFormatFiles', 'dataFiles',
       'seriesAndClassesContractsInformation'],
      dtype='object')


Unnamed: 0.1,Unnamed: 0,id,accessionNo,cik,ticker,companyName,companyNameLong,formType,description,filedAt,linkToTxt,linkToHtml,linkToXbrl,linkToFilingDetails,entities,documentFormatFiles,dataFiles,seriesAndClassesContractsInformation
0,0,86a54bcc128dae366ff72b596ae1b9c3,0001829126-20-000154,1828957,,DD3 Acquisition Corp. II,DD3 Acquisition Corp. II (Filer),S-1,Form S-1 - General form for registration of se...,2020-11-19T17:28:19-05:00,https://www.sec.gov/Archives/edgar/data/182895...,https://www.sec.gov/Archives/edgar/data/182895...,,https://www.sec.gov/Archives/edgar/data/182895...,[{'companyName': 'DD3 Acquisition Corp. II (Fi...,"[{'sequence': '1', 'description': 'S-1', 'docu...",[],[]
1,1,66894d57b0692c458313da9c59778dbf,0000950103-20-022555,1826991,,Trepont Acquistion Corp I,Trepont Acquistion Corp I (Filer),S-1/A,Form S-1/A - General form for registration of ...,2020-11-19T17:25:40-05:00,https://www.sec.gov/Archives/edgar/data/182699...,https://www.sec.gov/Archives/edgar/data/182699...,,https://www.sec.gov/Archives/edgar/data/182699...,[{'companyName': 'Trepont Acquistion Corp I (F...,"[{'sequence': '1', 'description': 'FORM S-1/A'...",[],[]


## Extract data
формы S1 - плохо структурированы, поэтому парсинг финансовых отчётов делается следующим образом:
- нахоим все `<div>` теги, с назыанием финансового отчёта (Statement of Operations', 'Balance Sheet' ...)
- для каждого отчёта парсим список таблиц с тегом `<table>`
- в списке таблиц отчёта ищем нужные имена полей и в случае совпадения берем их значения в той же строке

В случае если поля могут иметь разные имена, то они задаются списком имён

In [22]:
fields_dict = {'Statement of Operations':['Revenue',
                                         ['Net Income', 'Net Loss', 'net income (loss)']],
              'Balance Sheet': ['Cash and cash equivalents',
                                'Goodwill', 
                                'Intangible assets', 
                                'Total assets', 
                                'Commercial Paper', 
                                'Other Current Borrowings', 
                                ['Long Term Debt', 'current portion'],
                                'Short-Term Debt'],
              'Statement of Cash Flow': ['Net cash used in operating activities', 
                                         ['Purchases of property and equipment', 'Proceeds from property and equipment']]}
result = []

url = df['linkToFilingDetails'][1]
print(url)
iterate_fields_dict(html=get_html(url), fields_dict=fields_dict)


https://www.sec.gov/Archives/edgar/data/1826991/000095010320022555/dp141189_s1a.htm




{'Revenue': None,
 'Net Income': None,
 'Cash and cash equivalents': None,
 'Goodwill': None,
 'Intangible assets': None,
 'Total assets': 59841.0,
 'Commercial Paper': None,
 'Other Current Borrowings': None,
 'Long Term Debt': None,
 'Short-Term Debt': None,
 'Net cash used in operating activities': None,
 'Purchases of property and equipment': None}

## Парсинг по всем url и сохранение в DataFrame

In [18]:
for n, url in enumerate(df['linkToFilingDetails'][:20]):
    print(f'({n}) {url}')
    try:
        dict = iterate_fields_dict(html=get_html(url), fields_dict=fields_dict)
        result.append(dict)
        print(dict)
    except Exception as e:
        print(e)

(0) https://www.sec.gov/Archives/edgar/data/1828957/000182912620000154/dd3acqcorpii_s1.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': None, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(1) https://www.sec.gov/Archives/edgar/data/1826991/000095010320022555/dp141189_s1a.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 59841.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(2) https://www.sec.gov/Archives/edgar/data/1826889/000121390020038278/fs12020a1_forestroadacq.htm




{'Revenue': None, 'Net Income': -761.0, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 41739.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(3) https://www.sec.gov/Archives/edgar/data/1315098/000119312520298230/d87104ds1.htm




{'Revenue': 312773.0, 'Net Income': None, 'Cash and cash equivalents': 801646.0, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 1489541.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(4) https://www.sec.gov/Archives/edgar/data/1776661/000119312520298162/d25439ds1.htm




{'Revenue': None, 'Net Income': 2469141.0, 'Cash and cash equivalents': 486396.0, 'Goodwill': 2153855.0, 'Intangible assets': None, 'Total assets': 454689.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(5) https://www.sec.gov/Archives/edgar/data/1558569/000110465920127325/tm2035427-1_s1.htm




{'Revenue': 4298350.0, 'Net Income': -4727050.0, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 4214588.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(6) https://www.sec.gov/Archives/edgar/data/1722438/000121390020038221/fs12020a1_capitolinvest5.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 297405.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(7) https://www.sec.gov/Archives/edgar/data/1015383/000149315220022015/forms-1a.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': None, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(8) https://www.sec.gov/Archives/edgar/data/1831992/000110465920127110/tm2036073-1_s1.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': None, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(9) https://www.sec.gov/Archives/edgar/data/1822966/000110465920127045/tm2029458-6_s1a.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 84559.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(10) https://www.sec.gov/Archives/edgar/data/355379/000110465920127037/tm2034654-4_s1a.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': 43716205.0, 'Goodwill': None, 'Intangible assets': 700000.0, 'Total assets': 536649.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(11) https://www.sec.gov/Archives/edgar/data/1015383/000149315220021984/forms-1a.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': None, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(12) https://www.sec.gov/Archives/edgar/data/1716166/000149315220021982/forms1a.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': None, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(13) https://www.sec.gov/Archives/edgar/data/1646188/000121390020038066/ea130146-s1a2_ondas.htm




{'Revenue': 614026.0, 'Net Income': -9353706.0, 'Cash and cash equivalents': 2129013.0, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 4815408.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': -4875137.0, 'Purchases of property and equipment': None}
(14) https://www.sec.gov/Archives/edgar/data/1583771/000110465920126937/tm2034164d4_s1a.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': None, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(15) https://www.sec.gov/Archives/edgar/data/1820953/000110465920126927/tm2026663-4_s1.htm




{'Revenue': None, 'Net Income': -120455.0, 'Cash and cash equivalents': 684423.0, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 2250549.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(16) https://www.sec.gov/Archives/edgar/data/1827090/000110465920126924/tm2030105-7_s1.htm




{'Revenue': None, 'Net Income': 5050.0, 'Cash and cash equivalents': 29937.0, 'Goodwill': 515587.0, 'Intangible assets': None, 'Total assets': 1020380.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(17) https://www.sec.gov/Archives/edgar/data/1822479/000119312520297076/d93452ds1a.htm




{'Revenue': None, 'Net Income': -20425.0, 'Cash and cash equivalents': 62863.0, 'Goodwill': 1035865.0, 'Intangible assets': None, 'Total assets': 2580674.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(18) https://www.sec.gov/Archives/edgar/data/1821769/000121390020038006/fs12020a1_liveoakacq2.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 129955.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(19) https://www.sec.gov/Archives/edgar/data/1799858/000091957420007244/d8647509_s1a-3.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': None, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}


In [20]:
output = pd.DataFrame()
for item in result:
    output = output.append(item, ignore_index=True)
output


Unnamed: 0,Cash and cash equivalents,Commercial Paper,Goodwill,Intangible assets,Long Term Debt,Net Income,Net cash used in operating activities,Other Current Borrowings,Purchases of property and equipment,Revenue,Short-Term Debt,Total assets
0,,,,,,,,,,,,
1,,,,,,,,,,,,59841.0
2,,,,,,-761.0,,,,,,41739.0
3,801646.0,,,,,,,,,312773.0,,1489541.0
4,486396.0,,2153855.0,,,2469141.0,,,,,,454689.0
5,,,,,,-4727050.0,,,,4298350.0,,4214588.0
6,,,,,,,,,,,,297405.0
7,,,,,,,,,,,,
8,,,,,,,,,,,,
9,,,,,,,,,,,,84559.0


In [16]:
# result = pd.concat([df, output], axis=1, sort=False)
# result.to_csv('s1-result_.csv')

In [19]:
# result.info()