# **Web scraping using beautiful soup**

This notebook includes data scraping, which takes a website URL as an input and extracts the information listed below as an output from that webpage.


1.   Specific HTML tags along with titles and meta description
2.   Extract specific tags, heading tags from h1-h6 along with titles and meta description
3. Extracting ALT tags
4. For counting words inside a web page
5. Inspection of broken links inside a webpage
6. Extracting the source code of the webpage in google colab
7. Extracting all URLs from a website without duplication
8. Measuring the forntend and backend performance of website






In [15]:
!pip install beautifulsoup4



**1. For scraping specific HTML tags along with titles and meta description**

In [25]:
#Importing libraries
from bs4 import BeautifulSoup
import urllib
from urllib import request
import urllib.request as ur

In [26]:
# Getting input for webiste from user
urlinput = input("Enter url :")
print(" This is the website link that you entered", urlinput)
s = ur.urlopen(urlinput)
soup = BeautifulSoup(s.read())
# For extracting specific tags from webpage
def getTags(tag):
  s = ur.urlopen(urlinput)
  soup = BeautifulSoup(s.read())
  return soup.findAll(tag)

# For extracting specific title & meta description from webpage
def titleandmetaTags():
    s = ur.urlopen(urlinput)
    soup = BeautifulSoup(s.read())
    #----- Extracting Title from website ------#
    title = soup.title.string
    print ('Website Title is :', title)
    #-----  Extracting Meta description from website ------#
    meta_description = soup.find_all('meta')
    for tag in meta_description:
        if 'name' in tag.attrs.keys() and tag.attrs['name'].strip().lower() in ['description', 'keywords']:
            #print ('NAME    :',tag.attrs['name'].lower())
            print ('CONTENT :',tag.attrs['content'])

#------------- Main ---------------#
if __name__ == '__main__':
  titleandmetaTags()
  tags = getTags('h1')
  for tag in tags:
     print(tag) # display tags 
     print(tag.contents) # display contents of the tags
        

Enter url :https://en.wikipedia.org/wiki/Mahatma_Gandhi
 This is the website link that you entered https://en.wikipedia.org/wiki/Mahatma_Gandhi
Website Title is : Mahatma Gandhi - Wikipedia
<h1 class="firstHeading mw-first-heading" id="firstHeading">Mahatma Gandhi</h1>
['Mahatma Gandhi']


In [27]:
##### get all the text from the entered URL

In [28]:
print(soup.get_text())




Mahatma Gandhi - Wikipedia










































Mahatma Gandhi

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search
Indian nationalist leader and nonviolence advocate (1869–1948)
"Gandhi" redirects here. For other uses, see Gandhi (disambiguation).


MahātmāGandhiGandhi in London, 1931BornMohandas Karamchand Gandhi(1869-10-02)2 October 1869Porbandar, Kathiawar Agency, British RajDied30 January 1948(1948-01-30) (aged 78)New Delhi,  Dominion of IndiaCause of deathAssassination (gunshot wounds)MonumentsRaj GhatGandhi SmritiCitizenshipBritish Raj (1869–1947)Dominion of India (1947–1948)Alma materAlfred High School, Rajkot (1880 – November 1887)Samaldas Arts College, Bhavnagar (January 1888 – July 1888)Inner Temple, London (September 1888–1891)(Informal auditing student at University College London between 1888 and 1891)OccupationLawyeranti-colonialistpolitical ethicistYears active1893–1948EraBritish RajKnown forLeadership of the campaign for I

**2. For extracting specific tags, all heading tags from h1-h6 along with titles and meta description**

In [29]:
# Importing libraries
from bs4 import BeautifulSoup
import urllib
from urllib import request
import urllib.request as ur

In [30]:
# Getting input for webiste from user
url_input = input("Enter url :")
print(" This is the website link that you entered", url_input)


# For extracting specific tags from webpage
def getTags(tag):
  s = ur.urlopen(url_input)
  soup = BeautifulSoup(s.read())
  return soup.findAll(tag)

# For extracting all h1-h6 heading tags from webpage
def headingTags(headingtags):
  h = ur.urlopen(url_input)
  soup = BeautifulSoup(h.read())
  print("List of headings from headingtags function h1, h2, h3, h4, h5, h6 :")
  for heading in soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]):
    print(heading.name + ' ' + heading.text.strip())

# For extracting specific title & meta description from webpage
def titleandmetaTags():
    s = ur.urlopen(urlinput)
    soup = BeautifulSoup(s.read())
    #----- Extracting Title from website ------#
    title = soup.title.string
    print ('Website Title is :', title)
    #-----  Extracting Meta description from website ------#
    meta_description = soup.find_all('meta')
    for tag in meta_description:
        if 'name' in tag.attrs.keys() and tag.attrs['name'].strip().lower() in ['description', 'keywords']:
            #print ('NAME    :',tag.attrs['name'].lower())
            print ('CONTENT :',tag.attrs['content'])



#------------- Main ---------------#
if __name__ == '__main__':
  titleandmetaTags()
  tags = getTags('p')
  headtags = headingTags('h1')
  for tag in tags:
     print(" Here are the tags from getTags function:", tag.contents)
        



Enter url :https://en.wikipedia.org/wiki/Mahatma_Gandhi
 This is the website link that you entered https://en.wikipedia.org/wiki/Mahatma_Gandhi
Website Title is : Mahatma Gandhi - Wikipedia
List of headings from headingtags function h1, h2, h3, h4, h5, h6 :
h1 Mahatma Gandhi
h2 Contents
h2 Biography
h3 Early life and background
h3 Three years in London
h4 Student of law
h4 Vegetarianism and committee work
h4 Called to the bar
h3 Civil rights activist in South Africa (1893–1914)
h4 Europeans, Indians and Africans
h3 Struggle for Indian independence (1915–1947)
h4 Role in World War I
h4 Champaran agitations
h4 Kheda agitations
h4 Khilafat movement
h4 Non-co-operation
h4 Salt Satyagraha (Salt March)
h4 Gandhi as folk hero
h4 Negotiations
h4 Round Table Conferences
h4 Congress politics
h4 World War II and Quit India movement
h4 Partition and independence
h3 Death
h4 Funeral and memorials
h2 Principles, practices, and beliefs
h3 Influences
h4 Leo Tolstoy
h4 Shrimad Rajchandra
h4 Religious t

**3. For extracting ALT tags (Image Alter tags)**

In [31]:
import urllib.request as ur

url_input = input("Enter url :")
print("The website link that you entered is:", url_input)

def alt_tag():
  url =  ur.urlopen(url_input)
  htmlSource = url.read()
  url.close()
  soup = BeautifulSoup(htmlSource)
  print('\n The alt tag along with the text in the web page')
  print(soup.find_all('img',alt= True))
  

#------------- Main ---------------#
if __name__ == '__main__':
  alt_tag()


Enter url :https://en.wikipedia.org/wiki/Mahatma_Gandhi
The website link that you entered is: https://en.wikipedia.org/wiki/Mahatma_Gandhi

 The alt tag along with the text in the web page
[<img alt="This is a good article. Click here for more information." data-file-height="185" data-file-width="180" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/19px-Symbol_support_vote.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/29px-Symbol_support_vote.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/39px-Symbol_support_vote.svg.png 2x" width="19"/>, <img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-pr

**4. For counting words inside a web page**

In [32]:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

# Getting content from web page
r = requests.get("https://techoid.co/contact-us")
soup = BeautifulSoup(r.content)

# For getting words within paragrphs
text_paragraph = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
count_paragraph = Counter((x.rstrip(punctuation).lower() for y in text_paragraph for x in y.split()))

# For getting words inside div tags
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
count_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))

# Adding two counters for getting a list with words count (from most to less common)
total = count_div + count_paragraph
list_most_common_words = total.most_common() 

In [33]:
# For reviewing alt tags in seperate lines
soup.find_all('img',alt= True)

[<img alt="techoid.co" data-lazy-src="https://techoid.co/wp-content/uploads/elementor/thumbs/techoid.co_-p1qnbeammunnhcxquhzsb110je32a8etpsdnmlsxjs.png" src="data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%200%200'%3E%3C/svg%3E" title="techoid.co"/>,
 <img alt="techoid.co" src="https://techoid.co/wp-content/uploads/elementor/thumbs/techoid.co_-p1qnbeammunnhcxquhzsb110je32a8etpsdnmlsxjs.png" title="techoid.co"/>,
 <img alt="techoid.co" data-lazy-src="https://techoid.co/wp-content/uploads/elementor/thumbs/techoid.co_-p1qnbeammunnhcxquhzsb110je32a8etpsdnmlsxjs.png" src="data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%200%200'%3E%3C/svg%3E" title="techoid.co"/>,
 <img alt="techoid.co" src="https://techoid.co/wp-content/uploads/elementor/thumbs/techoid.co_-p1qnbeammunnhcxquhzsb110je32a8etpsdnmlsxjs.png" title="techoid.co"/>,
 <img alt="" class="attachment-medium size-medium" data-lazy-sizes="(max-width: 273px) 100vw, 273px" dat

In [34]:
# Total words inside a webpage
len(total)

247

In [35]:
# List of common words
list_most_common_words

[('', 160),
 ('management', 136),
 ('to', 93),
 ('development', 85),
 ('the', 84),
 ('message', 72),
 ('us', 70),
 ('name', 67),
 ('our', 62),
 ('email', 56),
 ('your', 52),
 ('app', 51),
 ('design', 51),
 ('based', 51),
 ('get', 50),
 ('company', 48),
 ('a', 45),
 ('in', 45),
 ('you', 42),
 ('send', 37),
 ('project', 36),
 ('subscribe', 36),
 ('services', 35),
 ('usservicesweb', 34),
 ('developmentmobile', 34),
 ('developmentcms', 34),
 ('developmentdigital', 34),
 ('marketinggraphics', 34),
 ('designingui/ux', 34),
 ('servicescontent', 34),
 ('writingartificial', 34),
 ('intelligenceiot', 34),
 ('solutionsdatabase', 34),
 ('developmentit', 34),
 ('outsourcingsoftwaresasset', 34),
 ('tracking', 34),
 ('softwareinventory', 34),
 ('systemproject', 34),
 ('systementerprise', 34),
 ('resource', 34),
 ('planning', 34),
 ('(erp)employee', 34),
 ('system', 34),
 ('(ems)hospital', 34),
 ('systemcareerscontact', 34),
 ('7718', 34),
 ('307359', 34),
 ('+92', 34),
 ('no', 34),
 ('303', 34),
 ('b

**5. For inspecting Broken links inside a webpage**

We want to retrieve the response code 200 if the site is fully functional. We'll get the 404 response code if it's not available.

In [36]:
# Importing libraries
from bs4 import BeautifulSoup, SoupStrainer
import requests

# Getting URL from user
url = input("Enter your url: ")

def broken_page():
  # For making request to get the URL
  user_req_page = requests.get(url)

  # For getting the response code of given URL
  response_code = str(user_req_page.status_code)

  # For displaying the text of the URL in str
  data =user_req_page.text

  # For using BeautifulSoup to access the built-in methods
  soup = BeautifulSoup(data)

  # Iterate over all links on the given URL with the response code next to it i.e 404 for PAGE NOT FOUND, 200 if website is functional/available
  for link in soup.find_all('a'):
    print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")


#----- NOTE ------#
# --------- TO VERIFY PAGE NOT FOUND 404 ERROR, enter below web link as a input URL --------#
#https://roine.github.com/p1

#------------- Main ---------------#
if __name__ == '__main__':
  broken_page()

Enter your url: https://en.wikipedia.org/wiki/Mahatma_Gandhi
Url: None | Status Code: 200
Url: /wiki/Wikipedia:Good_articles | Status Code: 200
Url: /wiki/Wikipedia:Protection_policy#semi | Status Code: 200
Url: #mw-head | Status Code: 200
Url: #searchInput | Status Code: 200
Url: /wiki/Gandhi_(disambiguation) | Status Code: 200
Url: /wiki/Mah%C4%81tm%C4%81 | Status Code: 200
Url: /wiki/File:Mahatma-Gandhi,_studio,_1931.jpg | Status Code: 200
Url: /wiki/Porbandar_State | Status Code: 200
Url: /wiki/Kathiawar_Agency | Status Code: 200
Url: /wiki/British_Raj | Status Code: 200
Url: /wiki/Dominion_of_India | Status Code: 200
Url: /wiki/Assassination_of_Mahatma_Gandhi | Status Code: 200
Url: /wiki/Raj_Ghat_and_associated_memorials | Status Code: 200
Url: /wiki/Gandhi_Smriti | Status Code: 200
Url: /wiki/British_Raj | Status Code: 200
Url: /wiki/Dominion_of_India | Status Code: 200
Url: /wiki/Alfred_High_School_(Rajkot) | Status Code: 200
Url: /wiki/Rajkot | Status Code: 200
Url: /wiki/Sama

Url: https://archive.org/details/greatpartitionma00khan | Status Code: 200
Url: /wiki/Yale_University_Press | Status Code: 200
Url: https://archive.org/details/greatpartitionma00khan/page/n30 | Status Code: 200
Url: /wiki/ISBN_(identifier) | Status Code: 200
Url: /wiki/Special:BookSources/978-0-300-12078-3 | Status Code: 200
Url: #cite_ref-Brown1991-p380_13-0 | Status Code: 200
Url: #cite_ref-Brown1991-p380_13-1 | Status Code: 200
Url: #cite_ref-Brown1991-p380_13-2 | Status Code: 200
Url: #Brown1991 | Status Code: 200
Url: #cite_ref-talbot-singh-delhi_14-0 | Status Code: 200
Url: https://books.google.com/books?id=utKmPQAACAAJ&pg=PA118 | Status Code: 200
Url: /wiki/ISBN_(identifier) | Status Code: 200
Url: /wiki/Special:BookSources/978-0-521-85661-4 | Status Code: 200
Url: #cite_ref-CushRobinson2008_15-0 | Status Code: 200
Url: https://books.google.com/books?id=i_T0HeWE-EAC&pg=PA544 | Status Code: 200
Url: /wiki/ISBN_(identifier) | Status Code: 200
Url: /wiki/Special:BookSources/978-0-7

Url: /wiki/Special:BookSources/978-1-134-92796-8 | Status Code: 200
Url: #cite_ref-Watson_257-0 | Status Code: 200
Url: #cite_ref-richards1_258-0 | Status Code: 200
Url: #cite_ref-richards1_258-1 | Status Code: 200
Url: https://www.jstor.org/stable/20006253 | Status Code: 200
Url: #cite_ref-Parel2006_259-0 | Status Code: 200
Url: https://books.google.com/books?id=MQhz0fW0HZUC&pg=PA195 | Status Code: 200
Url: /wiki/Cambridge_University_Press | Status Code: 200
Url: /wiki/ISBN_(identifier) | Status Code: 200
Url: /wiki/Special:BookSources/978-0-521-86715-3 | Status Code: 200
Url: #cite_ref-gier40_260-0 | Status Code: 200
Url: https://books.google.com/books?id=tVLt99uleLwC&pg=PA40 | Status Code: 200
Url: /wiki/State_University_of_New_York_Press | Status Code: 200
Url: /wiki/ISBN_(identifier) | Status Code: 200
Url: /wiki/Special:BookSources/978-0-7914-5949-2 | Status Code: 200
Url: #cite_ref-261 | Status Code: 200
Url: https://www.britannica.com/event/Salt-March | Status Code: 200
Url: ht

Url: /wiki/Special:BookSources/978-81-8475-317-2 | Status Code: 200
Url: /wiki/Ramachandra_Guha | Status Code: 200
Url: /wiki/Gandhi_Before_India | Status Code: 200
Url: /wiki/Penguin_Books_Limited | Status Code: 200
Url: /wiki/ISBN_(identifier) | Status Code: 200
Url: /wiki/Special:BookSources/978-93-5118-322-8 | Status Code: 200
Url: https://archive.org/details/gandhireadersou00gand | Status Code: 200
Url: /wiki/ISBN_(identifier) | Status Code: 200
Url: /wiki/Special:BookSources/978-0-8021-3161-4 | Status Code: 200
Url: https://books.google.com/books?id=dRQcKsx-YgQC | Status Code: 200
Url: /wiki/ISBN_(identifier) | Status Code: 200
Url: /wiki/Special:BookSources/978-0-7391-1143-7 | Status Code: 200
Url: https://books.google.com/books?id=svxDMQZ7fakC&pg=PA7 | Status Code: 200
Url: /wiki/ISBN_(identifier) | Status Code: 200
Url: /wiki/Special:BookSources/978-1-4381-0662-5 | Status Code: 200
Url: https://books.google.com/books?id=oc47gUOPZfcC | Status Code: 200
Url: /wiki/Cambridge_Univ

Url: /wiki/Conservatism | Status Code: 200
Url: /wiki/Contractualism | Status Code: 200
Url: /wiki/Cosmopolitanism | Status Code: 200
Url: /wiki/Culturalism | Status Code: 200
Url: /wiki/Elitism | Status Code: 200
Url: /wiki/Fascism | Status Code: 200
Url: /wiki/Feminist_political_theory | Status Code: 200
Url: /wiki/Gandhism | Status Code: 200
Url: /wiki/Hindu_nationalism | Status Code: 200
Url: /wiki/Hindutva | Status Code: 200
Url: /wiki/Individualism | Status Code: 200
Url: /wiki/Political_aspects_of_Islam | Status Code: 200
Url: /wiki/Islamism | Status Code: 200
Url: /wiki/Legalism_(Chinese_philosophy) | Status Code: 200
Url: /wiki/Liberalism | Status Code: 200
Url: /wiki/Libertarianism | Status Code: 200
Url: /wiki/Mohism | Status Code: 200
Url: /wiki/National_liberalism | Status Code: 200
Url: /wiki/Republicanism | Status Code: 200
Url: /wiki/Social_constructionism | Status Code: 200
Url: /wiki/Social_constructivism | Status Code: 200
Url: /wiki/Social_Darwinism | Status Code: 2

Url: https://id.ndl.go.jp/auth/ndlna/00440485 | Status Code: 200
Url: https://aleph.nkp.cz/F/?func=find-c&local_base=aut&ccl_term=ica=jn20000601721&CON_LNG=ENG | Status Code: 200
Url: https://nla.gov.au/anbd.aut-an35111345 | Status Code: 200
Url: https://data.nlg.gr/resource/authority/record152363 | Status Code: 200
Url: https://data.nlg.gr/resource/authority/record75266 | Status Code: 200
Url: https://librarian.nl.go.kr/LI/contents/L20101000000.do?id=KAC199609638 | Status Code: 200
Url: http://katalog.nsk.hr/F/?func=direct&doc_number=000033749&local_base=nsk10 | Status Code: 200
Url: http://data.bibliotheken.nl/id/thes/p068712030 | Status Code: 200
Url: http://data.bibliotheken.nl/id/thes/p352073632 | Status Code: 200
Url: http://mak.bn.org.pl/cgi-bin/KHW/makwww.exe?BM=1&NU=1&IM=4&WI=9810675952005606 | Status Code: 200
Url: https://libris.kb.se/auth/275000 | Status Code: 200
Url: https://opac.vatlib.it/auth/detail/495_79916 | Status Code: 200
Url: https://collections.tepapa.govt.nz/ag

**6. For getting the source code of the webpage**

Here, we will be using 'page_source' method is used retrieve the page source of the webpage the user is currently accessing.

*NOTE: (Page source : The source code/page source is the programming behind any webpage)*

In [37]:
# install chromium, its driver, and selenium
!apt update
!apt install chromium-chromedriver
!pip install selenium

# set options to be headless, ..
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

'apt' is not recognized as an internal or external command,
operable program or batch file.
'apt' is not recognized as an internal or external command,
operable program or batch file.




**7. Extraction of all URLs from a website without duplication**

In [41]:
#---- Importing libraries ----#
import re
import requests
from bs4 import BeautifulSoup

all_links = set() #------ Creating a unique set of links ------#

for i in range(7):
   r = requests.get(("https://en.wikipedia.org/wiki/Mahatma_Gandhi={}").format(i))
   soup = BeautifulSoup(r.content , "html.parser")
   for link in soup.find_all("a",href=re.compile('/')):
            link = (link.get('href'))
            #----- For the removal of duplicate URLs, We will simply add a link to that set; this assures that it's distinct ------#
            if link not in all_links:
              print(link)
            all_links.add(link)

/wiki/Special:SiteMatrix
/wiki/File:Wiktionary-logo-v2.svg
https://en.wiktionary.org/wiki/Special:Search/Mahatma_Gandhi%3D0
/wiki/File:Wikibooks-logo.svg
https://en.wikibooks.org/wiki/Special:Search/Mahatma_Gandhi%3D0
/wiki/File:Wikiquote-logo.svg
https://en.wikiquote.org/wiki/Special:Search/Mahatma_Gandhi%3D0
/wiki/File:Wikisource-logo.svg
https://en.wikisource.org/wiki/Special:Search/Mahatma_Gandhi%3D0
/wiki/File:Wikiversity_logo_2017.svg
https://en.wikiversity.org/wiki/Special:Search/Mahatma_Gandhi%3D0
/wiki/File:Commons-logo.svg
https://commons.wikimedia.org/wiki/Special:Search/Mahatma_Gandhi%3D0
/wiki/File:Wikivoyage-Logo-v3-icon.svg
https://en.wikivoyage.org/wiki/Special:Search/Mahatma_Gandhi%3D0
/wiki/File:Wikinews-logo.svg
https://en.wikinews.org/wiki/Special:Search/Mahatma_Gandhi%3D0
/wiki/File:Wikidata-logo.svg
https://www.wikidata.org/wiki/Special:Search/Mahatma_Gandhi%3D0
/wiki/File:Wikispecies-logo.svg
https://species.wikimedia.org/wiki/Special:Search/Mahatma_Gandhi%3D0
ht

https://en.wiktionary.org/wiki/Special:Search/Mahatma_Gandhi%3D5
https://en.wikibooks.org/wiki/Special:Search/Mahatma_Gandhi%3D5
https://en.wikiquote.org/wiki/Special:Search/Mahatma_Gandhi%3D5
https://en.wikisource.org/wiki/Special:Search/Mahatma_Gandhi%3D5
https://en.wikiversity.org/wiki/Special:Search/Mahatma_Gandhi%3D5
https://commons.wikimedia.org/wiki/Special:Search/Mahatma_Gandhi%3D5
https://en.wikivoyage.org/wiki/Special:Search/Mahatma_Gandhi%3D5
https://en.wikinews.org/wiki/Special:Search/Mahatma_Gandhi%3D5
https://www.wikidata.org/wiki/Special:Search/Mahatma_Gandhi%3D5
https://species.wikimedia.org/wiki/Special:Search/Mahatma_Gandhi%3D5
https://en.wikipedia.org/w/index.php?search=Mahatma+Gandhi%3D5&title=Special%3ASearch&fulltext=1
https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Mahatma+Gandhi%26%2361%3B5
https://en.wikipedia.org/w/index.php?search=Mahatma+Gandhi%3D5&title=Special%3ASearch&fulltext=1&ns0=1
/wiki/Special:WhatLinksHere/Mahatma_Gandhi%3D5
/w