# **Web Scrapping using BeatifulSoup**

BeautifulSoup is a Python library used for extracting data from HTML and XML files. It creates a parse tree from the page's source code, which makes it easy to navigate and search the data.



# **Installing BeautifulSoup and Requests**


Before using BeautifulSoup, you'll need to install it and the requests library, which helps in fetching web pages.

In [1]:
!pip install beautifulsoup4
!pip install requests



In [2]:
import requests
from bs4 import BeautifulSoup as bs

# **Basic Workflow of Web Scraping**

1.   Fetch the web page using requests.
2.   Parse the page using BeautifulSoup.
3.   Extract specific data by navigating the parsed HTML.




In [22]:
link = 'https://jamil226.github.io/CUI-Web/Navbar/about.html'

In [29]:
# Create headers with a user-agent
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

In [30]:
# Send the request with headers
response = requests.get(link, headers=headers)

In [31]:

if response.status_code == 200:
    soup = bs(response.text, 'html.parser')
    print(soup.prettify())
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")


<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   About | Comsats University Sahiwal Campus
  </title>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="assets/images/favicon/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
  <link href="assets/images/favicon/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
  <link href="assets/images/favicon/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
  <link href="assets/images/favicon/site.webmanifest" rel="manifest"/>
  <link href="assets/css/style2.css" rel="stylesheet"/>
 </head>
 <body>
  <header>
   <div class="nav-container">
    <nav>
     <ul class="navbar">
      <li class="navbar-item">
       <a class="navbar-item-link" href="index.html">
        Home
       </a>
      </li>
      <li class="navbar-item">
       <a class="navbar-item-link" href="about.html">
        Abo

In [None]:
list(soup.children)

['html',
 '\n',
 <html lang="en">
 <head>
 <title>About | Comsats University Sahiwal Campus</title>
 <meta charset="utf-8"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
 <link href="assets/images/favicon/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
 <link href="assets/images/favicon/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
 <link href="assets/images/favicon/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
 <link href="assets/images/favicon/site.webmanifest" rel="manifest"/>
 <link href="assets/css/style2.css" rel="stylesheet"/>
 </head>
 <body>
 <header>
 <div class="nav-container">
 <nav>
 <ul class="navbar">
 <li class="navbar-item">
 <a class="navbar-item-link" href="index.html">Home</a>
 </li>
 <li class="navbar-item">
 <a class="navbar-item-link" href="about.html">About</a>
 </li>
 <li class="navbar-item">
 <a class="navbar-item-link" href="cam

In [33]:
list(soup.children)[0]

'html'

In [37]:
list(soup.children)[1]

'\n'

In [39]:
html = list(soup.children)[2]

In [40]:
list(html.children)

['\n',
 <head>
 <title>About | Comsats University Sahiwal Campus</title>
 <meta charset="utf-8"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
 <link href="assets/images/favicon/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
 <link href="assets/images/favicon/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
 <link href="assets/images/favicon/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
 <link href="assets/images/favicon/site.webmanifest" rel="manifest"/>
 <link href="assets/css/style2.css" rel="stylesheet"/>
 </head>,
 '\n',
 <body>
 <header>
 <div class="nav-container">
 <nav>
 <ul class="navbar">
 <li class="navbar-item">
 <a class="navbar-item-link" href="index.html">Home</a>
 </li>
 <li class="navbar-item">
 <a class="navbar-item-link" href="about.html">About</a>
 </li>
 <li class="navbar-item">
 <a class="navbar-item-link" href="campus-life.html">Camp

In [42]:
head = list(html.children)[1]

In [44]:
list(head.children)[1]

<title>About | Comsats University Sahiwal Campus</title>

In [45]:
title = list(head.children)[1]

In [46]:
data = title.get_text()

In [47]:
print(data)

About | Comsats University Sahiwal Campus


In [48]:
soup.find_all('p')

[<p>
                 COMSATS University Islamabad, Sahiwal campus is situated half-way between Lahore and Multan on COMSATS Road off G.T Road Sahiwal, was formally inaugurated on September 23, 2006. The campus is purpose built and is spread over area of 36 acres on land.
             </p>,
 <p>
                 CUI Sahiwal is committed to provide state of the art training and education to our students and prepare them for successful career in their respective fields. Our mission is to encourage learning and promote research activities in order to facilitate our students to fulfill their aims and aspirations objectively.
             </p>,
 <p>
                 We are offering a broad range of programs, especially in the areas of Management Sciences, Computer Science, Biosciences, Engineering and Humanities. Our programs are focused on creating crucially needed transitions toward sustainable practice. CUI Sahiwal strives to prepare a knowledge workforce which is comparable to the best 

In [51]:
pragraph_two_data = soup.find_all('p')[1].get_text()

In [52]:
print(pragraph_two_data)


                CUI Sahiwal is committed to provide state of the art training and education to our students and prepare them for successful career in their respective fields. Our mission is to encourage learning and promote research activities in order to facilitate our students to fulfill their aims and aspirations objectively.
            


In [53]:
nabvbar_item_content = soup.find_all('li', class_='navbar-item')
print(nabvbar_item_content)

[<li class="navbar-item">
<a class="navbar-item-link" href="index.html">Home</a>
</li>, <li class="navbar-item">
<a class="navbar-item-link" href="about.html">About</a>
</li>, <li class="navbar-item">
<a class="navbar-item-link" href="campus-life.html">Campus Life</a>
</li>, <li class="navbar-item">
<a class="navbar-item-link" href="faqs.html">FAQs</a>
</li>, <li class="navbar-item">
<a class="navbar-item-link" href="contact.html">Contact</a>
</li>]


# **Retutning the content having an ID (May be unique or not)**

In [54]:
nabvbar_item_content = soup.find_all('li', id='navbar-item')
print(nabvbar_item_content)

[]


# **Fetching Data from CSS Selector**

In [60]:
selecting_h5_from_selector = soup.select("div h5")
print(selecting_h5_from_selector)

[<h5>All Rights Reserved © 2023 | Designed by CUI Sahiwal </h5>]


In [61]:
for h5 in selecting_h5_from_selector:
    print(h5.get_text(strip=True))

All Rights Reserved © 2023 | Designed by CUI Sahiwal


In [68]:
soup.find_all('a')

[<a class="navbar-item-link" href="index.html">Home</a>,
 <a class="navbar-item-link" href="about.html">About</a>,
 <a class="navbar-item-link" href="campus-life.html">Campus Life</a>,
 <a class="navbar-item-link" href="faqs.html">FAQs</a>,
 <a class="navbar-item-link" href="contact.html">Contact</a>]

In [65]:
hyper_references = soup.find_all('a')

In [67]:
links = []
for link in hyper_references:
  links.append(link.get('href'))
links

['index.html', 'about.html', 'campus-life.html', 'faqs.html', 'contact.html']