# Scraping World's Top Automobile Companies by Market Value using Python

![](https://i.imgur.com/uCfkM0b.jpg)

**Data** is the collection of facts!

_**Web Scraping**_ is a technique used to automatically extract large amounts of data from websites and save it to a file or database. The data scraped will usually be in tabular or spreadsheet format(e.g : CSV file)


Here, in this web scrapping we will scrap data from [value.today](https://www.value.today/world-top-companies/automobile).

We'll use the Python libraries `requests` and `beautifulsoup4` to perform scrapping from the webpage.



Here's an outline of the steps we'll follow:

1. Download the webpage using `requests`
2. Parse the HTML source code using `beautifulsoup4`
3. Extract Company name ,Headquarters country,CEO ,Market Cap (in billion USD),Annual revenue(in million USD),Number of employees,Company website
4. Compile the extracted information into and Python lists and dictionaries
5. Extract and combine data from multiple pages
6. Save the extracted information to a CSV file.


By the end of the project, we'll create a CSV file in the following format:
![](https://i.imgur.com/y8VVs41.png)


## Download the webpage using `requests`


We'll use the `requests` library to download the web page.

The library can be installed using `pip`

In [48]:
!pip install requests --upgrade --quiet

In [49]:
import requests

The library is now installed and imported.

To download a page, we can use the `get` function from requests, which returns a response object.

In [50]:
webpage = ' https://www.value.today/world-top-companies/automobile'

In [51]:
response = requests.get(webpage)

`requests.get` returns a response object containing the data from the web page and some other information.

The `.status_code` property can be used to check if the response was successful. A successful response will have an [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) between 200 and 299.

In [52]:
response.status_code

200

The request was successful. We can get the contents of the page using `response.text`.

In [53]:
page_contents = response.text

Let's check the number of characters of the page. 

In [54]:
len(response.text)

184047

The page contains over 173254 characters!

Here are the first 1000 characters of the page:

In [55]:
page_contents[:1000]

'<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">\n  <head>\n    <meta charset="utf-8"/>\n<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-2407955258669770" crossorigin="anonymous"></script>\n<script>(adsbygoogle=window.adsbygoogle||[]).push({google_ad_client:"ca-pub-2407955258669770",enable_page_level_ads:true});</script><script>window.google_analytics_uacct="UA-121331115-1";(function(i,s,o,g,r,a,m){i["GoogleAnalyticsObject"]=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.p

In the above cell `page_content[:1000]` contains the [HTML](https://en.wikipedia.org/wiki/HTML) of the webpage [value.today](https://www.value.today/world-top-companies/automobile)

We can also save it to a file and view the page locally within Jupyter using "File > Open".

In [56]:
with open('webpage.html','w') as f:
    f.write(page_contents)

This page looks similar to the original page.

![](https://i.imgur.com/rbWMVHS.png)

In the section, we used the requests library to download a web page as HTML. We have successfully downloaded the webpage using `requests` library.

## Parse the HTML source code using `beautifulsoup4`


In [57]:
!pip install beautifulsoup4 --upgrade --quiet

In [58]:
from bs4 import BeautifulSoup

In [59]:
doc = BeautifulSoup(page_contents, 'html.parser')

With this `doc` object, we can navigate and search through the `HTML` for data that we want.

In [60]:
type(doc)

bs4.BeautifulSoup

The `doc` object contains several properties and methods for extracting information from the HTML document.

[the documentation of BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [61]:
title_tag = doc.find('title')

Here, `doc.find('title')` will give the title of the web page.

In [62]:
title_tag

<title>World Top Automobile Companies by Market Value as on 2022</title>

To only get text out of the tag we use .text

In [63]:
title_tag.text

'World Top Automobile Companies by Market Value as on 2022'

## Extract company names, Headquarters country, CEOs, Market capitalization, Annual revenue, number of employees, company website

Upon inspecting the html code we get an idea that all the information that we need to scrape is under `li` tag with `class` attribute set to `row well clearfix`

![](https://i.imgur.com/6OJ6rFN.png)

Let's find all the `li` tags matching this class.

We create a variable for value of Key-class as company_selector.

In [64]:
company_selector = "row well clearfix"
company_tags = doc.find_all('li' , {'class' : company_selector})

In [65]:
len(company_tags)

0

The web page contain 10 boxes of `li` tag

### Company Name

![](https://i.imgur.com/jBswKpc.png)

Now Let's create a `function` to extract all the `company names ` of first page using the `for loop`

In [66]:
def name_of_companies(company_tags):
    company_names = []
    
    for tag in company_tags:
        c_name = tag.find('div' , {'class' : "field field--name-node-title field--type-ds field--label-hidden field--item"})
        h2_tags = (c_name.find('h2' , {'class' : "text-primary"}))
        company_names.append(h2_tags.find('a').text)
    
    return company_names

We can call function `name_of_companies` to get the companies names. 

In [67]:
#Let's check the function
name_of_companies(company_tags)

[]

### Headquarters Country

![](https://i.imgur.com/MHJtL8L.png)

Now Let's create a `function` to extract all the `headquarters` of first page using the `for loop`

In [68]:
def headquarters_country(company_tags):
    headquarter_country = []
    
    for tag in company_tags:
        headquarters = tag.find('div' , {'class' : "clearfix col-sm-12 field field--name-field-headquarters-of-company field--type-entity-reference field--label-above"})
        try:
            hq = headquarters.find('div' , {'class' : "field--item"})
            headquarter_country.append(hq.find('a').text)
        except AttributeError:
            headquarter_country.append(None)
    return headquarter_country

We can call function `headquarters_country` to get the headquarters. 

In [69]:
#Let's check the function
headquarters_country(company_tags)

[]

### CEOs

![](https://i.imgur.com/hJViahw.png)

Now Let's create a `function` to extract all the `CEO names` of first page using the `for loop`

In [70]:
def name_of_ceos(company_tags):
    CEO_name = []
    
    for tag in company_tags:
        ceo = tag.find('div' , {'class' : "clearfix col-sm-12 field field--name-field-ceo field--type-entity-reference field--label-above"})
        try:
            name = ceo.find('div' , {'class' : "field--item"})
            CEO_name.append(name.find('a').text)
        
        except AttributeError:
            CEO_name.append(None)
        
    return CEO_name

We can call function `name_of_ceos` to get the CEOs names. 

In [71]:
#let's call the function
name_of_ceos(company_tags)

[]

### Market Capitalization

![](https://i.imgur.com/4SmLn51.png)

Now Let's create a `function` to extract all the `market cap` of first page using the `for loop`

In [72]:
def market_capitalisation_in_billion_USdollars(company_tags):
    market_capitalisation = []
    
    for tag in company_tags:
        market_cap = tag.find('div' , {'class' : "clearfix col-sm-6 field field--name-field-market-value-jan072022 field--type-float field--label-above"})
        try:
            cap = market_cap.find('div' , {'class' : "field--item"}).text
           
            replace_USD = cap.replace(' Billion USD' , "").replace(',' , "")
            
            market_capitalisation.append(float(replace_USD))
        
        except AttributeError:
            market_capitalisation.append(None)
        
    return market_capitalisation

We can call function `market_capitalisation_in_billion_USdollars` to get the market capitalization. 

In [73]:
#let's call the function
market_capitalisation_in_billion_USdollars(company_tags)

[]

### Annual Revenue

![](https://i.imgur.com/OnpKeBT.png)

Now Let's create a `function` to extract all the `annual revenues` of first page using the `for loop`

In [74]:
def annual_revenue_in_million_USdollars(company_tags):
    annual_revenue = []
    
    for tag in company_tags:
        revenue = tag.find('div' , {'class' : "clearfix col-sm-12 field field--name-field-revenue-in-usd field--type-float field--label-inline"})
        try:
            annual = revenue.find('div' , {'class' : "field--item"}).text
           
            replace_USD = annual.replace(' Million USD' , "").replace(',' , "")
            annual_revenue.append(float(replace_USD))
        
        except AttributeError:
            annual_revenue.append(None)
        
    return annual_revenue

We can call function `annual_revenue_in_million_USdollars` to get the annual revenue. 

In [75]:
#Let's check the function
annual_revenue_in_million_USdollars(company_tags)

[]

### Number of Employees

![](https://i.imgur.com/xbbn1XI.png)

Now Let's create a `function` to extract all the `number of employees` of first page using the `for loop`

In [76]:
def number_of_employees(company_tags):
    employees_count = []
    
    for tag in company_tags:
        employees = tag.find('div' , {'class' : "clearfix col-sm-12 field field--name-field-employee-count field--type-integer field--label-inline"})
        try:
            count = employees.find('div' , {'class' : "field--item"}).text
            n_replace = count.replace(',' , "")
            employees_count.append(int(n_replace))
        
        except AttributeError:
            employees_count.append(None)
        
    return employees_count

We can call function `number_of_employees` to get the number of employees. 

In [77]:
#Let;s check the function
number_of_employees(company_tags)

[]

### Company Website

![](https://i.imgur.com/tL3wze8.png)

Now Let's create a `function` to extract all the `company website` of first page using the `for loop`

In [78]:
def company_website(company_tags):
    website = []
    
    for tag in company_tags:
        c_url = tag.find('div' , {'class' : "clearfix col-sm-12 field field--name-field-company-website field--type-link field--label-above"})
        try:
            url = c_url.find('div' , {'class' : "field--item"})
            website.append(url.find('a')['href'])
        
        except AttributeError:
            website.append(None)
        
    return website

We can call function `company_website` to get the company website. 

In [79]:
#Let's check the function
company_website(company_tags)

[]

So far, we have created `7` function. These are `name_of_companies`, `headquarters_country`, `name_of_ceos`, `market_capitalisation_in_billion_USdollars`, `annual_revenue_in_million_USdollars`, `number_of_employees`, `company_website`. And now we have developed an approach to extract the data from a block.

## Compile the extracted information into and Python lists and dictionaries

## Extract and combine data from multiple pages

Let's define a function to get any webpage and parse it using beautiful soup.

In [80]:
def get_page(url):
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception('Unable to download page {}' .format(url))
    page_contents = response.text
    doc = BeautifulSoup(page_contents, 'html.parser')
    
    return doc
    

We can use the function `get_page` to downlaod any web page and parse it using beautiful soup.

Let's create a `dictionary` using all the functions

![](https://i.imgur.com/5cMet0o.png)

As there are `90` pages on wesite, We will need to `loop` through all the pages. So that we can extract the data from all the pages.

In [81]:
def scrape_page():
    all_info_dict = {}
   
    all_info_dict = {
        'companies_name':[],
        'headquarters_country':[],
        'CEOs_name':[],
        'market_capitalisation_in_billion_USdollars':[],
        'annual_revenue_in_million_USdollars':[],
        'number_of_employees':[],
        'company_website':[]
            }
    for page in range (0,90):

        url = f"https://www.value.today/world-top-companies/automobile?title=&field_headquarters_of_company_target_id&field_company_category_primary_target_id&field_company_website_uri=&field_market_value_jan072022_value=&page={page}"
        company_tags = get_page(url).find_all('li',class_='row well clearfix')

        all_info_dict['companies_name'] += name_of_companies(company_tags)
        all_info_dict['headquarters_country'] +=  headquarters_country(company_tags)
        all_info_dict['CEOs_name'] += name_of_ceos(company_tags)
        all_info_dict['market_capitalisation_in_billion_USdollars'] += market_capitalisation_in_billion_USdollars(company_tags)
        all_info_dict['annual_revenue_in_million_USdollars'] += annual_revenue_in_million_USdollars(company_tags)
        all_info_dict['number_of_employees'] += number_of_employees(company_tags)
        all_info_dict['company_website'] += company_website(company_tags)
        page = page + 1
    return all_info_dict

In [82]:
# Create pandas dataframe from dictionary
import pandas as pd

In [83]:
scrape_page_dataframe = pd.DataFrame(scrape_page())

In [84]:
# Let's view the first 5 and last 5 rows
scrape_page_dataframe

Unnamed: 0,companies_name,headquarters_country,CEOs_name,market_capitalisation_in_billion_USdollars,annual_revenue_in_million_USdollars,number_of_employees,company_website


## Save the extracted information to a CSV file

In [85]:
scrape_page_dataframe.to_csv('scrape_page_dataframe.csv',index=None)

## Summary

Here's what we've covered in this notebook

1. Downloaded the webpage using `requests`
2. Parsed the HTML source code using `beautifulsoup4`
3. Extracted Company name, Headquarters country, CEO, Market Cap (in billion USD), Annual revenue(in million USD), Number of employees, Company website
4. Compiled the extracted information into and Python lists and dictionaries
5. Extracted and combine data from multiple pages
6. Saved the extracted information to a CSV file.


The CSV file we created has this format:

![](https://i.imgur.com/Tvu88n0.png)

Here's the complete code for this project:

In [86]:
def name_of_companies(company_tags):
    company_names = []
    
    for tag in company_tags:
        c_name = tag.find('div' , {'class' : "field field--name-node-title field--type-ds field--label-hidden field--item"})
        h2_tags = (c_name.find('h2' , {'class' : "text-primary"}))
        company_names.append(h2_tags.find('a').text)
    
    return company_names


def headquarters_country(company_tags):
    headquarter_country = []
    
    for tag in company_tags:
        headquarters = tag.find('div' , {'class' : "clearfix col-sm-12 field field--name-field-headquarters-of-company field--type-entity-reference field--label-above"})
        try:
            hq = headquarters.find('div' , {'class' : "field--item"})
            headquarter_country.append(hq.find('a').text)
        except AttributeError:
            headquarter_country.append(None)
    return headquarter_country


def name_of_ceos(company_tags):
    CEO_name = []
    
    for tag in company_tags:
        ceo = tag.find('div' , {'class' : "clearfix col-sm-12 field field--name-field-ceo field--type-entity-reference field--label-above"})
        try:
            name = ceo.find('div' , {'class' : "field--item"})
            CEO_name.append(name.find('a').text)
        
        except AttributeError:
            CEO_name.append(None)
        
    return CEO_name


def market_capitalisation_in_billion_USdollars(company_tags):
    market_capitalisation = []
    
    for tag in company_tags:
        market_cap = tag.find('div' , {'class' : "clearfix col-sm-6 field field--name-field-market-value-jan072022 field--type-float field--label-above"})
        try:
            cap = market_cap.find('div' , {'class' : "field--item"}).text
           
            replace_USD = cap.replace(' Billion USD' , "").replace(',' , "")
            
            market_capitalisation.append(float(replace_USD))
        
        except AttributeError:
            market_capitalisation.append(None)
        
    return market_capitalisation


def annual_revenue_in_million_USdollars(company_tags):
    annual_revenue = []
    
    for tag in company_tags:
        revenue = tag.find('div' , {'class' : "clearfix col-sm-12 field field--name-field-revenue-in-usd field--type-float field--label-inline"})
        try:
            annual = revenue.find('div' , {'class' : "field--item"}).text
           
            replace_USD = annual.replace(' Million USD' , "").replace(',' , "")
            annual_revenue.append(float(replace_USD))
        
        except AttributeError:
            annual_revenue.append(None)
        
    return annual_revenue


def number_of_employees(company_tags):
    employees_count = []
    
    for tag in company_tags:
        employees = tag.find('div' , {'class' : "clearfix col-sm-12 field field--name-field-employee-count field--type-integer field--label-inline"})
        try:
            count = employees.find('div' , {'class' : "field--item"}).text
            n_replace = count.replace(',' , "")
            employees_count.append(int(n_replace))
        
        except AttributeError:
            employees_count.append(None)
        
    return employees_count


def company_website(company_tags):
    website = []
    
    for tag in company_tags:
        c_url = tag.find('div' , {'class' : "clearfix col-sm-12 field field--name-field-company-website field--type-link field--label-above"})
        try:
            url = c_url.find('div' , {'class' : "field--item"})
            website.append(url.find('a')['href'])
        
        except AttributeError:
            website.append(None)
        
    return website


def get_page(url):
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception('Unable to download page {}' .format(url))
    page_contents = response.text
    doc = BeautifulSoup(page_contents, 'html.parser')
    
    return doc


def scrape_page():
    all_info_dict = {}
   
    all_info_dict = {
        'companies_name':[],
        'headquarters_country':[],
        'CEOs_name':[],
        'market_capitalisation_in_billion_USdollars':[],
        'annual_revenue_in_million_USdollars':[],
        'number_of_employees':[],
        'company_website':[]
            }
    for page in range (0,90):

        url = f"https://www.value.today/world-top-companies/automobile?title=&field_headquarters_of_company_target_id&field_company_category_primary_target_id&field_company_website_uri=&field_market_value_jan072022_value=&page={page}"
        company_tags = get_page(url).find_all('li',class_='row well clearfix')

        all_info_dict['companies_name'] += name_of_companies(company_tags)
        all_info_dict['headquarters_country'] +=  headquarters_country(company_tags)
        all_info_dict['CEOs_name'] += name_of_ceos(company_tags)
        all_info_dict['market_capitalisation_in_billion_USdollars'] += market_capitalisation_in_billion_USdollars(company_tags)
        all_info_dict['annual_revenue_in_million_USdollars'] += annual_revenue_in_million_USdollars(company_tags)
        all_info_dict['number_of_employees'] += number_of_employees(company_tags)
        all_info_dict['company_website'] += company_website(company_tags)
        page = page + 1
    return all_info_dict

## Future Work

* We can now fetch individual topic pages, and get the list of top automobile manufacturers 
* We can scrape the page to get the additioanal information
* We can use this data for further analysis
* We can extract the data of two or more different audit month and perform the analysis

## References

* [Jovian](https://jovian.ai/) A platform to learn Data Science

* This project is made under the guidence of [Aakash N S](https://aakashns.medium.com/) 

* A Youtube video by `Aakash N S` [Let's Build a Python Web Scraping Project from Scratch | Hands-On Tutorial](https://www.youtube.com/watch?v=RKsLLG-bzEY&t=6677s)

* [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
                                 
* [Pandas Documentation](https://pandas.pydata.org/docs/)