# Web Scraping using BeautifulSoup


Суть полей, которые нас интересуют, отражена ниже. Они могут назваться разными похожими вариациями. Весь их перечень я смогу дать с течением времени, когда буду встречать новые варианты при анализе форм S-1

From **Statement of Operations:**
- Revenue / Total Revenue
- Net Income / Net Loss

From **Balance Sheet:**
- Cash and cash equivalents
- Goodwill
- Intangible assets
- Total assets
- Long Term Debt
- Commercial Paper
- Other Current Borrowings
- Long Term Debt, current portion
- Short-Term Debt

From **Statement of Cash Flow:**
- Net cash used in operating activities
- Purchases of property and equipment / Proceeds from property and equipment

In [1]:
import pandas as pd
import time

from form_parser import *

# pd.set_option('display.max_colwidth', None)

In [2]:
df = pd.read_csv('s1-list.csv')
print(f'There are {len(df)} companies in the list')
print(df.columns)
df[['ticker','companyName', 'linkToFilingDetails']].head()
df.head(3)

There are 10000 companies in the list
Index(['Unnamed: 0', 'id', 'accessionNo', 'cik', 'ticker', 'companyName',
       'companyNameLong', 'formType', 'description', 'filedAt', 'linkToTxt',
       'linkToHtml', 'linkToXbrl', 'linkToFilingDetails', 'entities',
       'documentFormatFiles', 'dataFiles',
       'seriesAndClassesContractsInformation'],
      dtype='object')


Unnamed: 0.1,Unnamed: 0,id,accessionNo,cik,ticker,companyName,companyNameLong,formType,description,filedAt,linkToTxt,linkToHtml,linkToXbrl,linkToFilingDetails,entities,documentFormatFiles,dataFiles,seriesAndClassesContractsInformation
0,0,86a54bcc128dae366ff72b596ae1b9c3,0001829126-20-000154,1828957,,DD3 Acquisition Corp. II,DD3 Acquisition Corp. II (Filer),S-1,Form S-1 - General form for registration of se...,2020-11-19T17:28:19-05:00,https://www.sec.gov/Archives/edgar/data/182895...,https://www.sec.gov/Archives/edgar/data/182895...,,https://www.sec.gov/Archives/edgar/data/182895...,[{'companyName': 'DD3 Acquisition Corp. II (Fi...,"[{'sequence': '1', 'description': 'S-1', 'docu...",[],[]
1,1,66894d57b0692c458313da9c59778dbf,0000950103-20-022555,1826991,,Trepont Acquistion Corp I,Trepont Acquistion Corp I (Filer),S-1/A,Form S-1/A - General form for registration of ...,2020-11-19T17:25:40-05:00,https://www.sec.gov/Archives/edgar/data/182699...,https://www.sec.gov/Archives/edgar/data/182699...,,https://www.sec.gov/Archives/edgar/data/182699...,[{'companyName': 'Trepont Acquistion Corp I (F...,"[{'sequence': '1', 'description': 'FORM S-1/A'...",[],[]
2,2,b8dfd567decca580e602fe0a2650c271,0001213900-20-038278,1826889,FRX,Forest Road Acquisition Corp.,Forest Road Acquisition Corp. (Filer),S-1/A,Form S-1/A - General form for registration of ...,2020-11-19T17:24:33-05:00,https://www.sec.gov/Archives/edgar/data/182688...,https://www.sec.gov/Archives/edgar/data/182688...,,https://www.sec.gov/Archives/edgar/data/182688...,[{'companyName': 'Forest Road Acquisition Corp...,"[{'sequence': '1', 'description': 'REGISTRATIO...",[],[]


## Extract data

In [None]:
fields_dict = {'Statement of Operations':['Revenue',
                                         ['Net Income', 'Net Loss', 'net income (loss)']],
              'Balance Sheet': ['Cash and cash equivalents',
                                'Goodwill', 
                                'Intangible assets', 
                                'Total assets', 
                                'Commercial Paper', 
                                'Other Current Borrowings', 
                                ['Long Term Debt', 'current portion'],
                                'Short-Term Debt'],
              'Statement of Cash Flow': ['Net cash used in operating activities', 
                                         ['Purchases of property and equipment', 'Proceeds from property and equipment']]}
result = []
for n, url in enumerate(df['linkToFilingDetails']):
    print(f'({n}) {url}')
    dict = iterate_fields_dict(html=get_html(url), fields_dict=fields_dict)
    result.append(dict)
    print(dict)


(0) https://www.sec.gov/Archives/edgar/data/1828957/000182912620000154/dd3acqcorpii_s1.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': None, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(1) https://www.sec.gov/Archives/edgar/data/1826991/000095010320022555/dp141189_s1a.htm




{'Revenue': None, 'Net Income': None, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 59841.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(2) https://www.sec.gov/Archives/edgar/data/1826889/000121390020038278/fs12020a1_forestroadacq.htm




{'Revenue': None, 'Net Income': -761.0, 'Cash and cash equivalents': None, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 41739.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(3) https://www.sec.gov/Archives/edgar/data/1315098/000119312520298230/d87104ds1.htm




{'Revenue': 312773.0, 'Net Income': None, 'Cash and cash equivalents': 801646.0, 'Goodwill': None, 'Intangible assets': None, 'Total assets': 1489541.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(4) https://www.sec.gov/Archives/edgar/data/1776661/000119312520298162/d25439ds1.htm




{'Revenue': None, 'Net Income': 2469141.0, 'Cash and cash equivalents': 486396.0, 'Goodwill': 2153855.0, 'Intangible assets': None, 'Total assets': 454689.0, 'Commercial Paper': None, 'Other Current Borrowings': None, 'Long Term Debt': None, 'Short-Term Debt': None, 'Net cash used in operating activities': None, 'Purchases of property and equipment': None}
(5) https://www.sec.gov/Archives/edgar/data/1558569/000110465920127325/tm2035427-1_s1.htm




In [None]:
output = pd.DataFrame()
for item in result:
    output = output.append(item, ignore_index=True)
output


In [11]:
result = pd.concat([df, output], axis=1, sort=False)
# result.to_csv('s1-result.csv')

In [19]:
result.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 30 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Unnamed: 0                             200 non-null    int64  
 1   id                                     200 non-null    object 
 2   accessionNo                            200 non-null    object 
 3   cik                                    200 non-null    int64  
 4   ticker                                 143 non-null    object 
 5   companyName                            200 non-null    object 
 6   companyNameLong                        200 non-null    object 
 7   formType                               200 non-null    object 
 8   description                            200 non-null    object 
 9   filedAt                                200 non-null    object 
 10  linkToTxt                              200 non-null    object 
 11  linkTo