**Author**: Sushrina Dhakal <br>
**Date**: 18<sup>th</sup> September, 2024 <br>
**Symbiosis Solutions**
<br>
# Web Scraping with Python: A Comprehensive Guide

## 1. Introduction to Web Scraping

Web scraping is a technique used to extract data from websites. In this guide, we’ll use Python libraries such as requests and BeautifulSoup to fetch and parse web content.

### Why Use Web Scraping?

* To collect large datasets from websites
* To automate data extraction
* To analyze website information for research or business intelligence

## 2. Required Libraries

To start, you’ll need to install the following Python libraries:

* <strong> requests </strong>: Used to send HTTP requests to fetch web content.
* <strong> BeautifulSoup </strong>: Parses and extracts data from HTML or XML files.

In [23]:
!pip install requests beautifulsoup4 --quiet

## 3. Fetching Web Content Using requests <br>
We use the requests library to fetch the HTML content of a webpage.

In [24]:
import requests

url = "https://edusanjal.com/college/british-college/"

response = requests.get(url)

print("Response Status:", response.status_code)

Response Status: 200


* requests.get(url) fetches the web page content.
* response.status_code checks if the request was successful (200 indicates success).

## 4. Parsing HTML Content with BeautifulSoup <br>
Once the content is fetched, we can parse it using BeautifulSoup.

In [25]:
from bs4 import BeautifulSoup

html_content = BeautifulSoup(response.content, "html.parser")

# (optional, for checking structure)
# print(html_content.prettify())

* We use html.parser to parse the content. You can also use lxml or html5lib if needed.
* prettify() formats the HTML in a readable way.


## 5. Extracting Specific Data from HTML <br>
### 5.1 Finding a Specific Tag <br>
Suppose we want to extract the salient features of the college, which are inside a <div> tag with id='salient_features'.
<br>
**You can use Inspect in the website to find the tags you want to get the data from**

* find() returns the first matching tag. In this case, it finds the <div> with a specific id.

In [26]:
divs = html_content.find('div', id='salient_features')

print(divs.text.strip())

Salient FeaturesFeatures of The British College
The British college encourages its students to challenge ideas where students get exposed into stirring environment of wide perspectives to help them think differently, have reasoning sensibility and learn in an effective manner. The students in this manner get toned with a extended version in the best possible skills in extracurricular activities as well,
TBC aims to inspire and guide its students through outstanding learning experience, as well as through partnership with national and international universities and organizations in presence of exceptional student facilities that ensures a wide range of opportunities, both academically and professionally.

Get value for money with a quality UK Degree at a local cost.
Gain the same prestigious UK degree at one-fifth of the price.
 Our overall quality (student experience, academic standards, and programme delivery) is underpinned by adherence to the UK Quality Assurance Agency’s Quality Co

### 5.2 Extracting Headers (H2 Tags)
To extract all <h2> tags from the HTML content, we can use the find_all() method: <br>
* find_all() returns a list of all matching tags. We loop through them to print their contents.

In [27]:
h2_tags = html_content.find_all('h2')

for tag in h2_tags:
    print(tag.text.strip())

OFFERED PROGRAMS
Salient Features
Features of The British College
Admission Guidelines
Applying to The British College (TBC)
Postgraduate Programmes
Undergraduate Programmes:
Cambridge GCE A Level
ACCA
SCHOLARSHIP INFORMATION
More Information
News
Posts
Recognitions
Documents
Get Directions
NETWORK INSTITUTIONS
Gallery
ABOUT US
Message from the Executive Principal
Videos
INFORMATION
CLIENT PORTAL
LINKS
Contact


## 6. Creating Functions for Reusability <br>
### 6.1 Function to Fetch HTML Content <br>
It’s a good practice to define functions for repeated tasks. Let’s create a function to fetch and parse HTML content.

In [28]:
def fetch_html(url):
    """
    Fetches the HTML content of a given URL.

    Parameters:
    url (str): The URL of the webpage to scrape.

    Returns:
    BeautifulSoup: Parsed HTML content of the webpage.
    """
    response = requests.get(url)
    if response.status_code == 200:
        return BeautifulSoup(response.content, "html.parser")
    else:
        print(f"Failed to retrieve content. Status code: {response.status_code}")
        return None

* This function accepts a URL and returns the parsed HTML content or an error message if the request fails.

### 6.2 Function to Extract Specific div Content

In [29]:
def extract_div_content(html_content, div_id):
    """
    Extracts the text content from a specified div.

    Parameters:
    html_content (BeautifulSoup): Parsed HTML content.
    div_id (str): The id of the div to extract content from.

    Returns:
    str: The text content of the div.
    """
    div = html_content.find('div', id=div_id)
    return div.text.strip() if div else "Div not found"


### 6.3 Function to Extract h2 Tags


In [30]:
def extract_h2_tags(html_content):
    """
    Extracts and prints all h2 tag contents from the HTML.

    Parameters:
    html_content (BeautifulSoup): Parsed HTML content.

    Returns:
    List of strings containing the text of all h2 tags.
    """
    h2_tags = html_content.find_all('h2')
    return [tag.text.strip() for tag in h2_tags]

## 7. Combining Everything <br>
Now, let’s combine all the functions to scrape the webpage and extract both the salient features and headers.

In [31]:
url = "https://edusanjal.com/college/british-college/"

html_content = fetch_html(url)

if html_content:
    salient_features = extract_div_content(html_content, 'salient_features')
    print("Salient Features:")
    print(salient_features)

    h2_texts = extract_h2_tags(html_content)
    print("\nH2 Tags:")
    for text in h2_texts:
        print(text)


Salient Features:
Salient FeaturesFeatures of The British College
The British college encourages its students to challenge ideas where students get exposed into stirring environment of wide perspectives to help them think differently, have reasoning sensibility and learn in an effective manner. The students in this manner get toned with a extended version in the best possible skills in extracurricular activities as well,
TBC aims to inspire and guide its students through outstanding learning experience, as well as through partnership with national and international universities and organizations in presence of exceptional student facilities that ensures a wide range of opportunities, both academically and professionally.

Get value for money with a quality UK Degree at a local cost.
Gain the same prestigious UK degree at one-fifth of the price.
 Our overall quality (student experience, academic standards, and programme delivery) is underpinned by adherence to the UK Quality Assurance A

## Further Learning <br>
Visit this link of BeautifulSoup Documentation: <br>
<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> Click This Link </a>