### Muhammad Satrio Pinoto Negoro

**Linkedin:** https://www.linkedin.com/in/satriopino/

# Webscrapping The World's Largest Companies by Forbes.com (2023) using BeautifulSoup

Web scraping, also known as web harvesting or web data extraction, is the process of automatically extracting data from websites. It involves fetching a web page and extracting data from it. The data can be parsed, searched, reformatted, and copied into a spreadsheet or loaded into a database. Web scraping can be done manually, but in most cases, automated tools are preferred as they can be less costly and work at a faster rate. Web scraping is used for various purposes, including lead  generation, price monitoring, market research, and content aggregation. However, some websites use methods to prevent web scraping, such as detecting and disallowing bots from crawling their pages. In response, there are web scraping systems that rely on using techniques in DOM parsing, computer vision, and natural language processing to simulate human browsing to enable gathering web page content for offline parsing.


## Dependencies

Actually to follow this module you only need to install beautifulsoup4 with `pip install beautifulsoup4` and you are good to go. But here some libraries that needed to be installed first that I use at bis module : 

- beautifulSoup4
- pandas

## Background

At this project I try to scrap Company Name, Country, Sales, Profit, and Assets from The World's Largest Companies by Forbes.com (2023) data. Forbes, founded in 1917, is a renowned American business magazine and online platform. Known for its influential annual lists like the Forbes 400 and Forbes Global 2000, Forbes provides comprehensive coverage of business, technology, and lifestyle. Undergoing a digital transformation, it engages a global audience through insightful articles and expert contributors. The Forbes brand extends beyond media, including exclusive communities like Forbes Councils and impactful events. With a legacy of family ownership, Forbes remains a trusted source of authoritative business journalism, shaping conversations in the ever-evolving landscape of finance and entrepreneurship.

A lot of you might ask why I need to scrap this data from the Forbes sites. The data cannot be downloaded and I want to make a report and gain some insight from that data and maybe can be useful for others. To do that I need to have the data, and scrapping is a good way to collect the data I don't have from the public.

I will scrap 5 points from this sites. That is Company Name, Country, Sales, Profit, and Assets. 

## What is BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers
like lxml and html5lib.

Since beautifulsoup used to pull the data out of a HTML, so first we need to pull out the html first. How we do it? We will use default library `request`. 

So all this code is doing is sending a GET request to spesific address we give. This is the same type of request your browser sent to view this page, but the only difference is that Requests can't actually render the HTML, so instead you will just get the raw HTML and the other response information.

I'm using the .get() function here, but Requests allows you to use other functions like .post() and .put() to send those requests as well. At this case we will going to the Global 2000 ranking the world’s largest companies by Forbes.com page, you can click [here](https://www.forbes.com/lists/global2000/?sh=710d4f0c5ac0) to follow what exactly that link goes to. 

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}

In [3]:
url = 'https://www.forbes.com/lists/global2000/?sh=710d4f0c5ac0'
response = requests.get(url, headers=headers)

In [4]:
response.content[:500]

b'<!DOCTYPE html><html lang="en"><head><link rel="preload" as="font" href="https://i.forbesimg.com/assets/fonts/merriweather/merriweather-bold-webfont.woff2" type="font/woff2" crossorigin><link rel="preload" as="font" href="https://i.forbesimg.com/assets/fonts/work-sans/worksans-regular-webfont.woff2" type="font/woff2" crossorigin><link rel="preload" as="font" href="https://i.forbesimg.com/assets/fonts/merriweather/merriweather-regular-webfont.woff2" type="font/woff2" crossorigin><link rel="preloa'

In [5]:
soup = BeautifulSoup(response.content, 'html.parser')
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [6]:
print(soup.prettify()[:500])

<!DOCTYPE html>
<html lang="en">
 <head>
  <link as="font" crossorigin="" href="https://i.forbesimg.com/assets/fonts/merriweather/merriweather-bold-webfont.woff2" rel="preload" type="font/woff2"/>
  <link as="font" crossorigin="" href="https://i.forbesimg.com/assets/fonts/work-sans/worksans-regular-webfont.woff2" rel="preload" type="font/woff2"/>
  <link as="font" crossorigin="" href="https://i.forbesimg.com/assets/fonts/merriweather/merriweather-regular-webfont.woff2" rel="preload" type="font/w


In [27]:
soup.find("div", attrs={'class':'table-row-group'}).prettify()

'<div class="table-row-group">\n <a aria-label="JPMorgan Chase" canexpand="" class="table-row active premiumProfile-jpmorgan-chase" href="https://www.forbes.com/companies/jpmorgan-chase/?list=global2000" rel="noopener noreferrer" style="background-color:" target="#_blank" uri="jpmorgan-chase">\n  <div class="rank first table-cell rank">\n   1.\n  </div>\n  <div class="organizationName second table-cell name">\n   JPMorgan Chase\n  </div>\n  <div class="country table-cell country">\n   United States\n  </div>\n  <div class="revenue table-cell sales ($)">\n   $179.93 B\n  </div>\n  <div class="profits table-cell profit ($)">\n   $41.8 B\n  </div>\n  <div class="assets table-cell assets ($)">\n   $3,744.3 B\n  </div>\n  <div class="marketValue table-cell market value ($)">\n   $399.59 B\n  </div>\n </a>\n <a aria-label="Saudi Arabian Oil Company (Saudi Aramco)" canexpand="" class="table-row active premiumProfile-saudi-arabian-oil-company-saudi-aramco" href="https://www.forbes.com/companie

In [8]:
soup.find_all("div", attrs={'class':'organizationName second table-cell name'})[1950].text.strip()

'United Therapeutics'

In [9]:
soup.find_all("div", attrs={'class':'country table-cell country'})[1999].text.strip()

'France'

In [10]:
soup.find_all("div", attrs={'class':'revenue table-cell sales ($)'})[1999].text.strip()

'$3.18 B'

In [11]:
soup.find_all("div", attrs={'class':'profits table-cell profit ($)'})[1999].text.strip()

'$681.7 M'

In [12]:
soup.find_all("div", attrs={'class':'assets table-cell assets ($)'})[1999].text.strip()

'$5.99 B'

In [13]:
soup.find_all("div", attrs={'class':'marketValue table-cell market value ($)'})[1950].text

'$551 M'

In [14]:
rowlength = len(soup.find_all("div", attrs={'class':'organizationName second table-cell name'}))
rowlength

2000

In [15]:
data = []

In [16]:
for i in range(0,rowlength):
    name = soup.find_all("div", attrs={'class':'organizationName second table-cell name'})[i].text.strip()
    country = soup.find_all("div", attrs={'class':'country table-cell country'})[i].text.strip()
    sales = soup.find_all("div", attrs={'class':'revenue table-cell sales ($)'})[i].text.strip()
    profits = soup.find_all("div", attrs={'class':'profits table-cell profit ($)'})[i].text.strip()
    assets = soup.find_all("div", attrs={'class':'assets table-cell assets ($)'})[i].text.strip()
    
    data.append((name,country, sales, profits, assets))

In [18]:
data[:10]

[('JPMorgan Chase', 'United States', '$179.93 B', '$41.8 B', '$3,744.3 B'),
 ('Saudi Arabian Oil Company (Saudi Aramco)',
  'Saudi Arabia',
  '$589.47 B',
  '$156.36 B',
  '$660.99 B'),
 ('ICBC', 'China', '$216.77 B', '$52.47 B', '$6,116.82 B'),
 ('China Construction Bank', 'China', '$203.08 B', '$48.25 B', '$4,977.48 B'),
 ('Agricultural Bank of China',
  'China',
  '$186.14 B',
  '$37.92 B',
  '$5,356.86 B'),
 ('Bank of America', 'United States', '$133.84 B', '$28.62 B', '$3,194.66 B'),
 ('Alphabet', 'United States', '$282.85 B', '$58.59 B', '$369.49 B'),
 ('ExxonMobil', 'United States', '$393.16 B', '$61.69 B', '$369.37 B'),
 ('Microsoft', 'United States', '$207.59 B', '$69.02 B', '$380.09 B'),
 ('Apple', 'United States', '$385.1 B', '$94.32 B', '$332.16 B')]

In [22]:
import pandas as pd

In [63]:
data_company = pd.DataFrame(data, columns=('Company Name','Country', 'Sales', 'Profit', 'Assets'))
data_company.head(10)

Unnamed: 0,Company Name,Country,Sales,Profit,Assets
0,JPMorgan Chase,United States,$179.93 B,$41.8 B,"$3,744.3 B"
1,Saudi Arabian Oil Company (Saudi Aramco),Saudi Arabia,$589.47 B,$156.36 B,$660.99 B
2,ICBC,China,$216.77 B,$52.47 B,"$6,116.82 B"
3,China Construction Bank,China,$203.08 B,$48.25 B,"$4,977.48 B"
4,Agricultural Bank of China,China,$186.14 B,$37.92 B,"$5,356.86 B"
5,Bank of America,United States,$133.84 B,$28.62 B,"$3,194.66 B"
6,Alphabet,United States,$282.85 B,$58.59 B,$369.49 B
7,ExxonMobil,United States,$393.16 B,$61.69 B,$369.37 B
8,Microsoft,United States,$207.59 B,$69.02 B,$380.09 B
9,Apple,United States,$385.1 B,$94.32 B,$332.16 B


In [64]:
data_company.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Company Name  2000 non-null   object
 1   Country       2000 non-null   object
 2   Sales         2000 non-null   object
 3   Profit        2000 non-null   object
 4   Assets        2000 non-null   object
dtypes: object(5)
memory usage: 78.2+ KB


In [76]:
data_company['Sales'] = data_company['Sales'].str.replace('$', '').str.replace(' ', '').str.replace(',', '')
data_company['Profit'] = data_company['Profit'].str.replace('$', '').str.replace(' ', '').str.replace(',', '')
data_company['Assets'] = data_company['Assets'].str.replace('$', '').str.replace(' ', '').str.replace(',', '')

  data_company['Sales'] = data_company['Sales'].str.replace('$', '').str.replace(' ', '').str.replace(',', '')
  data_company['Profit'] = data_company['Profit'].str.replace('$', '').str.replace(' ', '').str.replace(',', '')
  data_company['Assets'] = data_company['Assets'].str.replace('$', '').str.replace(' ', '').str.replace(',', '')


In [78]:
def convert_total_votes(votes_str):
    multiplier = 1
    if votes_str[-1] == 'M':
        multiplier = 1e6
    elif votes_str[-1] == 'B':
        multiplier = 1e9
    return int(float(votes_str[:-1]) * multiplier)

data_company['Sales'] = data_company['Sales'].apply(convert_total_votes)
data_company['Profit'] = data_company['Profit'].apply(convert_total_votes)
data_company['Assets'] = data_company['Assets'].apply(convert_total_votes)

In [81]:
data_company[['Sales', 'Profit', 'Assets']] = data_company[['Sales', 'Profit', 'Assets']].astype('float')
data_company['Country'] = data_company['Country'].astype('category')

In [82]:
data_company.head(10)

Unnamed: 0,Company Name,Country,Sales,Profit,Assets
0,JPMorgan Chase,United States,179930000000.0,41800000000.0,3744300000000.0
1,Saudi Arabian Oil Company (Saudi Aramco),Saudi Arabia,589470000000.0,156360000000.0,660990000000.0
2,ICBC,China,216770000000.0,52470000000.0,6116820000000.0
3,China Construction Bank,China,203080000000.0,48250000000.0,4977480000000.0
4,Agricultural Bank of China,China,186140000000.0,37920000000.0,5356860000000.0
5,Bank of America,United States,133840000000.0,28620000000.0,3194660000000.0
6,Alphabet,United States,282850000000.0,58590000000.0,369490000000.0
7,ExxonMobil,United States,393160000000.0,61690000000.0,369370000000.0
8,Microsoft,United States,207590000000.0,69020000000.0,380090000000.0
9,Apple,United States,385100000000.0,94320000000.0,332160000000.0


In [83]:
data_company.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   Company Name  2000 non-null   object  
 1   Country       2000 non-null   category
 2   Sales         2000 non-null   float64 
 3   Profit        2000 non-null   float64 
 4   Assets        2000 non-null   float64 
dtypes: category(1), float64(3), object(1)
memory usage: 67.1+ KB


In [84]:
data_company.to_csv("The World’s Largest Companies.csv", index=False)

In [85]:
pd.read_csv("The World’s Largest Companies.csv").head()

Unnamed: 0,Company Name,Country,Sales,Profit,Assets
0,JPMorgan Chase,United States,179930000000.0,41800000000.0,3744300000000.0
1,Saudi Arabian Oil Company (Saudi Aramco),Saudi Arabia,589470000000.0,156360000000.0,660990000000.0
2,ICBC,China,216770000000.0,52470000000.0,6116820000000.0
3,China Construction Bank,China,203080000000.0,48250000000.0,4977480000000.0
4,Agricultural Bank of China,China,186140000000.0,37920000000.0,5356860000000.0
