# Webscraping Tutorial


This notebook will attempt to scrape the corporate information from a list of selected companies. This was done as a favor for a friend.


# Importing Libraries and Text File


In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import requests
from tqdm import tqdm_notebook


First, I will look at the URL list from the text file and import it into python as a list


In [2]:
# Import URL text file into list
url_list = []
url = open("sgx_links.txt", 'r', newline=None)
for line in url:
    url_list.append(line)


In [3]:
# Remove \n from the each URL
url_list = [i.replace('\n', '') for i in url_list]


In [5]:
len(url_list)


707

Based on the text file, there are 707 URLs in the list.

# Webscraping 

First, I will create a dataframe to hold the information scraped.

In [4]:
info = pd.DataFrame(columns=['company_name', 'incorporated', 'incorporated_on', 'ISIN_code', 'registered_office', 'telephone', 'fax', 'email',
                    'secretary', 'website', 'listing', 'listing_board', 'other_stock_exchange', 'registrars', 'registrars_address', 'auditors', 'background'])


As I will be scraping directly from the SGX website, I will be using Beautiful Soup HTML parser to scrape the information that are needed.

In [7]:
#Webscrape with BeautifulSoup

info_list = []

for url in tqdm_notebook(url_list):

    URL = url
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")

    company_name = soup.find(id="ctl07_compFullNameLabel").text
    incorporated = soup.find(id="ctl07_incorporatedLabel").text
    incorporated_on = soup.find(id='ctl07_incorpOnLabel').text
    isin_code = soup.find(id='ctl07_isinCodeLabel').text
    registered_office = soup.find(
        'dt', text='Registered Office :').findNext("dd").text
    telephone = soup.find('dt', text='Telephone :').findNext("dd").text
    fax = soup.find('dt', text='Fax :').findNext("dd").text
    email = soup.find('dt', text='Email :').findNext("dd").text
    secretary = soup.find('dt', text='Secretary :').findNext("dd").text
    website = soup.find(
        'dt', text='Link to Internet Website :').findNext("dd").text
    listing = soup.find(id='ctl07_listingDateLabel').text
    listing_board = soup.find(id='ctl07_lbllistingBoard').text

    # Not all url has this row so the webscrape will try to scrape it and return NAN if none
    try:
        other_stock_exchange = soup.find(
            text='OTHER STOCK EXCHANGE LISTINGS').findNext('dd').text
    except (AttributeError):
        other_stock_exchange = np.nan

    registrars = soup.find(
        text='REGISTRARS / TRANSFER AGENTS & ADDRESS').findNext('dd').text
    registrars_address = soup.find(
        text='REGISTRARS / TRANSFER AGENTS & ADDRESS').findNext('dd').findNext('dd').text
    auditors = soup.find(text='AUDITORS').findNext('dd').text
    background = soup.find(id='litIPOCompany').text

    info_list.append([company_name, incorporated, incorporated_on, isin_code, registered_office, telephone, fax, email,
                     secretary, website, listing, listing_board, other_stock_exchange, registrars, registrars_address, auditors, background])


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for url in tqdm_notebook(url_list):


  0%|          | 0/707 [00:00<?, ?it/s]

In [12]:
#Append it to the pandas dataframe
info = info.append(pd.DataFrame(info_list, columns=info.columns))


  info = info.append(pd.DataFrame(info_list, columns = info.columns))


Finally, I will need to clean the data to remove all the \r and \n markings

In [18]:
info = info.replace('\n', '', regex=True)
info = info.replace('\r', '', regex=True)


In [24]:
info


Unnamed: 0,company_name,incorporated,incorporated_on,ISIN_code,registered_office,telephone,fax,email,secretary,website,listing,listing_board,other_stock_exchange,registrars,registrars_address,auditors,background
0,3CNERGY LIMITED,SINGAPORE,24 Feb 1973,SG0502000029,82 Ubi Avenue 4#05-04 Edward Boustead CentreSi...,65 69708117,,enquiries@3cnergy.com.sg,Cheok Hui Yee,http://www.3cnergy.com.sg/,Listed on 6 July 1987 on SGX Sesdaq,CATALIST,*SINGAPORE EXCHANGE (CATALIST),TRICOR BARBINDER SHARE REGISTRATIO...,80 Robinson Road #02-00 Singapore ...,MAZARS LLP,"On 30 July 1981, the Company was incor..."
1,5E RESOURCES LIMITED,SINGAPORE,18 Oct 2021,SGXE78399073,30 Cecil Street#19-08Prudential TowerSingapore...,,,finance@5e-resources.com,Tan Sey Liy Shirley,http://www.5e-resources.com,Listed on 12 May 2022 on CATALIST,CATALIST,*SINGAPORE EXCHANGE (CATALIST),IN.CORP CORPORATE SERVICES PTE. LT...,"30 Cecil Street, #19-08 Prudential...",PRICEWATERHOUSECOOPERS LLP,"5E Resources Limited (""5E Resources"" o..."
2,8TELECOM INTL HOLDINGS CO LTD,BERMUDA,05 Jan 2004,BMG3087Y2074,Clarendon House2 Church StreetHamilton HM 11Be...,(86) 57188225288,(86) 57188225291,,,http://www.8telecom.cn,Listed on 23 July 2004 on SGX Mainboard,MAINBOARD,*SINGAPORE EXCHANGE LTD.,CONYERS CORPORATE SERVICES (BERMUD...,Clarendon House 2 Church Street ...,FOO KON TAN LLP,The Company was incorporated in Bermud...
3,9R LIMITED,SINGAPORE,04 Nov 1993,SGXE45420721,"105 Cecil Street,#12-02 The Octagon,Singapore ...",65 6601 9500,65 6601 9600,,Lai Kuan Loong Victor,http://www.vikingom.com,Listed on 10 August 2021 on CATALIST,CATALIST,*SINGAPORE EXCHANGE (CATALIST),M & C SERVICES PRIVATE LIMITED ...,"112 Robinson Road #05-01, Singapor...",ERNST & YOUNG LLP,Viking Offshore and Marine Limited was...
4,ABACUS CAPITAL (S) PTE LTD,SINGAPORE,,SG1BE4000002,,,,,,,Listed on 22 July 2015 on SGX Mainboard,MAINBOARD,*SINGAPORE EXCHANGE LTD.,,,,The Funds will NOT be traded on SGX-ST...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
702,YUUZOO NETWORKS GROUP CORPORATION,"VIRGIN ISLANDS, BRITISH",22 Mar 2005,VGG9889R1001,Jayla PlaceWickhams Cay 1Road TownTortola Brit...,65 62713468,65 62758469,,,http://www.yuuzoo.com,Listed on 27 December 2005 on SGX Mainboard,MAINBOARD,*SINGAPORE EXCHANGE LTD.,ESTERA MANAGEMENT (BERMUDA) LTD \B...,"Victoria Place, 5th Floor, 31 Vict...",,Headquartered and listed on the SGX ma...
703,ZHENENG JINJIANG ENVIRONMENT HOLDING COMPANY L...,CAYMAN ISLANDS,08 Sep 2010,KYG9898S1075,"Grand Pavillion, Hibiscus Way802 West Bay Road...",86 571 87699700,86 571 88388848.,,Toh Li Ping AngelaHoon Chi Tern,,Listed on 3 August 2016 on SGX Mainboard,MAINBOARD,*SINGAPORE EXCHANGE LTD.,BOARDROOM CORPORATE & ADVISORY SER...,"1 Harbourfront Avenue, Keppel Bay ...",PRICEWATERHOUSECOOPERS LLP,ZHENENG JINJIANG ENVIRONMENT HOLDING C...
704,ZHONGMIN BAIHUI RETAIL GROUP LTD.,SINGAPORE,17 Sep 2004,SG2C76966531,160 Robinson RoadSBF Centre#15-06Singapore 068...,86 592 5863888 (Xiamen) 65 644,86 592 5182791 (Xiamen) 65 644,,Chia Foon Yeow,http://www.zhongminbaihui.com.sg,Listed on 3 September 2013 on SGX Mainboard (2...,MAINBOARD,*SINGAPORE EXCHANGE LTD.,BOARDROOM CORPORATE & ADVISORY SER...,"1 Harbourfront Avenue, Keppel Bay ...",ERNST & YOUNG LLP,The Company was incorporated in Singap...
705,ZHONGXIN FRUIT AND JUICE LIMITED,SINGAPORE,27 Sep 2002,SG1P25916898,No 25 International Business Park#02-53 German...,65 65572308,,,Lee Wei Hsiung,,Listed on 24 March 2004 on SGX Sesdaq,CATALIST,*SINGAPORE EXCHANGE (CATALIST),BOARDROOM CORPORATE & ADVISORY SER...,"1 Harbourfront Avenue, Keppel Bay ...",MOORE STEPHENS LLP,The Company was incorporated in Singap...


Finally, I will be exporting the information to a CSV file

In [22]:
info.to_csv('data_cleaned.csv')
