<h1>Scrapping Data from Websites</h1>
<h3>Scrapping Data of the list of the largest companies by their revenue around the world</h3>

First lets import all the necessary libraries that is required to scrap the data, I am going to be scrapping the data of the companies with the largest revenue from <href>wikipedia</href> for this tutorial using BeautifulSoup

In [4]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

Let's put the URL of the site that we are going to extract data from in the 'url' variable and use the request library to fetch the Webpage's HTML content and store it on the 'page' library and finally  the 'soup' variable will hold a BeautifulSoup object that can be used to extract specific information from the HTML content of the webpage.

In [5]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')

Here, we will find the table's column header that we are supossed to extract and store it in the table variable

In [6]:
soup.find_all('table')[0]

<table class="wikitable sortable" style="text-align:left;">
<tbody><tr>
<th rowspan="2" scope="col">Rank
</th>
<th rowspan="2" scope="col">Name
</th>
<th rowspan="2" scope="col">Industry
</th>
<th scope="col">Revenue
</th>
<th scope="col">Profit
</th>
<th rowspan="2" scope="col">Employees
</th>
<th rowspan="2" scope="col">Headquarters<sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[note 1]</a></sup>
</th>
<th rowspan="2" scope="col"><a href="/wiki/State-owned_enterprise" title="State-owned enterprise">State-owned</a>
</th>
<th class="unsortable" rowspan="2" scope="col"><abbr title="Reference(s)">Ref.</abbr>
</th></tr>
<tr>
<th colspan="2" scope="col"><small>USD millions</small>
</th></tr>
<tr>
<th scope="column">1
</th>
<td><a href="/wiki/Walmart" title="Walmart">Walmart</a></td>
<td><a href="/wiki/Retail" title="Retail">Retail</a></td>
<td style="text-align:center;"><span typeof="mw:File"><span title="Increase"><img alt="Increase" class="mw-file-element" data-file-height

In [7]:
table = soup.find_all('table')[0]

In [53]:
print(table)

<table class="wikitable sortable" style="text-align:left;">
<tbody><tr>
<th rowspan="2" scope="col">Rank
</th>
<th rowspan="2" scope="col">Name
</th>
<th rowspan="2" scope="col">Industry
</th>
<th scope="col">Revenue
</th>
<th scope="col">Profit
</th>
<th rowspan="2" scope="col">Employees
</th>
<th rowspan="2" scope="col">Headquarters<sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[note 1]</a></sup>
</th>
<th rowspan="2" scope="col"><a href="/wiki/State-owned_enterprise" title="State-owned enterprise">State-owned</a>
</th>
<th class="unsortable" rowspan="2" scope="col"><abbr title="Reference(s)">Ref.</abbr>
</th></tr>
<tr>
<th colspan="2" scope="col"><small>USD millions</small>
</th></tr>
<tr>
<th scope="column">1
</th>
<td><a href="/wiki/Walmart" title="Walmart">Walmart</a></td>
<td><a href="/wiki/Retail" title="Retail">Retail</a></td>
<td style="text-align:center;"><span typeof="mw:File"><span title="Increase"><img alt="Increase" class="mw-file-element" data-file-height

Now we extract the titles as column headers that we are intrested in.

In [8]:
titles = table.find_all('th', scope = 'col')

In [9]:
print(titles)

[<th rowspan="2" scope="col">Rank
</th>, <th rowspan="2" scope="col">Name
</th>, <th rowspan="2" scope="col">Industry
</th>, <th scope="col">Revenue
</th>, <th scope="col">Profit
</th>, <th rowspan="2" scope="col">Employees
</th>, <th rowspan="2" scope="col">Headquarters<sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[note 1]</a></sup>
</th>, <th rowspan="2" scope="col"><a href="/wiki/State-owned_enterprise" title="State-owned enterprise">State-owned</a>
</th>, <th class="unsortable" rowspan="2" scope="col"><abbr title="Reference(s)">Ref.</abbr>
</th>, <th colspan="2" scope="col"><small>USD millions</small>
</th>]


we will now store all the column headers in the table titles, we use title.text.strip() to extract the name and store it in a list called table titles

In [10]:
table_titles = [title.text.strip() for title in titles]

In [11]:
print(table_titles)

['Rank', 'Name', 'Industry', 'Revenue', 'Profit', 'Employees', 'Headquarters[note 1]', 'State-owned', 'Ref.', 'USD millions']


From the table titles I only require headers of a few to check the revenue of the company so I will only take these headers

In [12]:
print(table_titles[0:5])  

['Rank', 'Name', 'Industry', 'Revenue', 'Profit']


Now we will instert the titles into the table using pandas

In [13]:
df = pd.DataFrame(columns = table_titles[0:5])

In [14]:
df

Unnamed: 0,Rank,Name,Industry,Revenue,Profit


In [15]:
df.rename(columns={'Revenue': 'Revenue (USD millions)', 'Profit': 'Profit (USD millions)'}, inplace=True)

In [16]:
df

Unnamed: 0,Rank,Name,Industry,Revenue (USD millions),Profit (USD millions)


Let's now work on getting the row data, If we observe the code in page inspect we got

In [17]:
column_data = table.find_all('tr')
print(column_data)

[<tr>
<th rowspan="2" scope="col">Rank
</th>
<th rowspan="2" scope="col">Name
</th>
<th rowspan="2" scope="col">Industry
</th>
<th scope="col">Revenue
</th>
<th scope="col">Profit
</th>
<th rowspan="2" scope="col">Employees
</th>
<th rowspan="2" scope="col">Headquarters<sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[note 1]</a></sup>
</th>
<th rowspan="2" scope="col"><a href="/wiki/State-owned_enterprise" title="State-owned enterprise">State-owned</a>
</th>
<th class="unsortable" rowspan="2" scope="col"><abbr title="Reference(s)">Ref.</abbr>
</th></tr>, <tr>
<th colspan="2" scope="col"><small>USD millions</small>
</th></tr>, <tr>
<th scope="column">1
</th>
<td><a href="/wiki/Walmart" title="Walmart">Walmart</a></td>
<td><a href="/wiki/Retail" title="Retail">Retail</a></td>
<td style="text-align:center;"><span typeof="mw:File"><span title="Increase"><img alt="Increase" class="mw-file-element" data-file-height="300" data-file-width="300" decoding="async" height="11" src="/

In [44]:
for row in column_data[2:]:
    row_header = row.find_all('th')
    row_data = row.find_all('td')
    full_row_data = [header.text.strip() for header in row_header]+[data.text.strip() for data in row_data[0:4]]
    print(full_row_data)
    
    length = len(df)
    df.loc[length] = full_row_data

['1', 'Walmart', 'Retail', '$572,754', '$13,673']
['2', 'Amazon.com, Inc.', 'Retail', '$469,822', '$33,364']
['3', 'State Grid Corporation of China', 'Electricity', '$460,616.9', '$7,137.8']
['4', 'China National Petroleum Corporation', 'Oil and gas', '$411,692.9', '$9,637.5']
['5', 'China Petrochemical Corporation', 'Oil and gas', '$401,313.5', '$8,316.1']
['6', 'Saudi Aramco', 'Oil and gas', '$400,399.1', '$105,369.1']
['7', 'Apple Inc.', 'Electronics', '$365,817', '$94,680']
['8', 'Volkswagen Group', 'Automotive', '$295,819.8', '$18,186.6']
['9', 'China State Construction Engineering', 'Construction', '$293,712.4', '$4,443.8']
['10', 'CVS Health', 'Healthcare', '$292,111', '$7,910']
['11', 'UnitedHealth Group', 'Healthcare', '$287,597', '$17,285']
['12', 'ExxonMobil', 'Oil and gas', '$285,640', '$23,050']
['13', 'Toyota', 'Automotive', '$279,337.7', '$25,371.4']
['14', 'Berkshire Hathaway', 'Financials', '$276,094', '$89,795']
['15', 'Shell plc', 'Oil and gas', '$272,657', '$20,101'

In [46]:
df

Unnamed: 0,Rank,Name,Industry,Revenue (USD millions),Profit (USD millions)
0,1,Walmart,Retail,"$572,754","$13,673"
1,2,"Amazon.com, Inc.",Retail,"$469,822","$33,364"
2,3,State Grid Corporation of China,Electricity,"$460,616.9","$7,137.8"
3,4,China National Petroleum Corporation,Oil and gas,"$411,692.9","$9,637.5"
4,5,China Petrochemical Corporation,Oil and gas,"$401,313.5","$8,316.1"
5,6,Saudi Aramco,Oil and gas,"$400,399.1","$105,369.1"
6,7,Apple Inc.,Electronics,"$365,817","$94,680"
7,8,Volkswagen Group,Automotive,"$295,819.8","$18,186.6"
8,9,China State Construction Engineering,Construction,"$293,712.4","$4,443.8"
9,10,CVS Health,Healthcare,"$292,111","$7,910"


Save the file as a csv in your local by the following, index = False will remove the row numbers in its first column.

In [50]:
df.to_csv(r'C:\Users\91944\datascraping_project.csv', index = False)

df.to_csv(r'your desired location', index = False)