1. Import necessary libraries:
   - `requests`: Used to send HTTP requests to the webpage.
   - `BeautifulSoup`: A library for parsing HTML content and navigating the document.

In [1]:
!pip install bs4

Collecting bs4
  Obtaining dependency information for bs4 from https://files.pythonhosted.org/packages/51/bb/bf7aab772a159614954d84aa832c129624ba6c32faa559dfb200a534e50b/bs4-0.0.2-py2.py3-none-any.whl.metadata
  Using cached bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Using cached bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
pip install requests


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


2. Define the target URL: The script specifies the URL of the Wikipedia Main Page.

3. Send an HTTP GET request to the URL using the `requests.get(url)` method and store the response in the `response` variable.

4. Check the response status using `print(response)`. This is just to verify that the request was successful.

In [4]:
import requests

url = "https://en.wikipedia.org/wiki/Main_Page"

response = requests.get(url)

# print(response.text)

5. Create a BeautifulSoup object to parse the HTML content of the page. `html_content` stores the raw HTML text, and `soup` is the BeautifulSoup object.
6. Use `soup.prettify()` to make the HTML content more readable, although this step is optional.

In [7]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Main_Page"

response = requests.get(url)
html_content = response.text

soup = BeautifulSoup(html_content, "html.parser")

soup.prettify()

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-appearance-disabled vector-feature-appearance-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-not-available" dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Wikipedia, the free encyclopedia\n  </title>\n  <script>\n   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-mai

7. Extract Featured Articles:
   - Use `soup.find_all("div", class_="mp-featured-article")` to locate all elements with the specified class that represent featured articles.
   - Iterate through the found elements, find the article title using `.find("h2").text`, and print each title.

8. Extract "Did You Know" items:
   - Use `soup.find("div", id="mp-dyk")` to locate the section containing "Did You Know" items.
   - Iterate through the list items within this section using `.find_all("li")`, and print each item's text.


In [17]:
# Example: Extracting all the featured article titles
# featured_articles = soup.find_all("h2", class_="mp-h2")

# for article in featured_articles:
#     title = article.find("span",class_='mw-headline').text
#     print(title)

# Example: Extracting the "Did you know" section
did_you_know = soup.find("div", id="mp-tfa")

for item in did_you_know.find_all("p"):
    print(item.text)

The horned sungem (Heliactin bilophus) is a species of hummingbird native to Brazil, Bolivia and Suriname. It prefers open habitats such as savanna, grassland and garden, and expanded its range into southern Amazonas and Espírito Santo, probably due to deforestation. It is a small hummingbird with a long tail and a short, black bill. The sexes differ in appearance, with males having two shiny red, golden, and green feather "horns" above the eyes, a shiny blue head crest and a black throat with a pointed "beard". The female is plainer, with a brown or yellow–buff throat. It is a nomadic species, responding to the seasonal flowering of its food plants. If a flower's shape is unsuited to the bird's short bill, it may rob nectar through a hole at its base. It also eats small insects. Only the female builds the small cup nest, incubates the two white eggs, and rears the chicks. The species is currently classified as least concern, and its population is thought to be increasing. (Full articl

9. Error Handling:
   - Before sending the request, a try-except block is used to catch any potential HTTP errors using `requests.exceptions.HTTPError`.

10. Exception Handling:
   - Inside the try-except block, there's another try-except block to handle possible attribute errors that may occur when parsing the HTML.


In [1]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Main_Page"

try:
    response = requests.get(url)
    response.raise_for_status()
except requests.exceptions.HTTPError as e:
    print(e)
    exit()

html_content = response.text

try:
    soup = BeautifulSoup(html_content, "html.parser")
    featured_articles = soup.find_all("h2", class_="mp-h2")
    for article in featured_articles:
        title = article.find("h2").text
        print(title)

    did_you_know = soup.find("div", id="mp-dyk")
    for item in did_you_know.find_all("li"):
        print(item.text)
except AttributeError as error:
    print(error)
    exit()

... that Prince Philip (pictured) was the first member of the British royal family to fly in a helicopter?
... that the 1910–1916 publication Raḥamim was the first newspaper in the Judeo-Tajik language?
... that football player Dick Harris was selected in professional drafts four times, including twice as a first-round pick, but never played professionally?
... that Pulp singer Jarvis Cocker helped fundraise to save a Merseyside flat that has been called "the first example of outsider art to be nationally listed"?
... that in 1911 the Butterfly Theater featured a pipe organ worth $10,000 (equivalent to $327,000 in 2023)?
... that environmental economist V. Kerry Smith has been described as a "Renaissance Man of Economics"?
... that a year after objecting to the unauthorised use of his own AI-generated vocals, Drake used vocals of other rappers generated that way to respond to a diss against him?
... that in 1919 nurse Hilda Hope McMaugh became the first Australian woman to qualify as a

11. Saving Data:
   - The extracted data, including featured article titles and "Did You Know" items, is stored in a Python dictionary called `data`.
   - This data is then saved as a JSON file named "data.json" using the `json.dump(data, outfile)` method.
   
12. The script includes proper error handling and exit statements to gracefully handle HTTP errors and attribute errors during the web scraping process.

In [2]:
import requests
from bs4 import BeautifulSoup
import json

url = "https://en.wikipedia.org/wiki/Main_Page"

try:
    response = requests.get(url)
    response.raise_for_status()
except requests.exceptions.HTTPError as error:
    print(error)
    exit()

html_content = response.text

try:
    soup = BeautifulSoup(html_content, "html.parser")
    featured_articles = soup.find_all("h2", class_="mp-h2")
    article_titles = [article.find("span",class_='mw-headline').text for article in featured_articles]

    did_you_know = soup.find("div", id="mp-dyk")
    items = [item.text for item in did_you_know.find_all("li")]

    data = {"featured_articles": article_titles, "did_you_know": items}

    with open("usha.json", "w") as outfile:
        json.dump(data, outfile)

except AttributeError as error:
    print(error)
    exit()

Ethical web scraping involves responsible and respectful behavior towards websites, their data, and their users. It's crucial to strike a balance between extracting valuable information and respecting the rights and wishes of website owners and data subjects.