<a href="https://colab.research.google.com/github/wambui-nduta/nduts/blob/main/Web_Scrapping_Checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def get_soup(url):
    response = requests.get(url)
    return BeautifulSoup(response.text, 'html.parser')

def extract_title(soup):
    title = soup.find('h1', {'id': 'firstHeading'})
    return title.text.strip() if title else "Title Not Found"

def extract_text_with_headings(soup):
    content = soup.find('div', {'id': 'mw-content-text'})
    data = {}
    current_heading = "Introduction"

    for tag in content.find_all(['h2', 'p']):
        if tag.name == 'h2':
            heading = tag.text.strip().replace("[edit]", "")
            current_heading = heading
        elif tag.name == 'p' and tag.text.strip():
            if current_heading not in data:
                data[current_heading] = []
            data[current_heading].append(tag.text.strip())

    return data

def extract_internal_links(soup, base_url):
    links = set()
    for a in soup.find_all('a', href=True):
        href = a['href']
        if href.startswith('/wiki/') and not (':' in href):
            full_url = urljoin(base_url, href)
            links.add(full_url)
    return list(links)

def scrape_wikipedia(url):
    soup = get_soup(url)
    title = extract_title(soup)
    text_data = extract_text_with_headings(soup)
    internal_links = extract_internal_links(soup, url)

    return {
        "title": title,
        "content": text_data,
        "internal_links": internal_links
    }

test_url = "https://en.wikipedia.org/wiki/Web_scraping"
result = scrape_wikipedia(test_url)

print("Title:", result["title"])
print("\nArticle Content:", result["content"])
print("\nInternal Wikipedia Links:", result["internal_links"])


Title: Web scraping

Article Content: {'Introduction': ['Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.', 'Scraping a web page involves fetching it and then extracting data from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Having fetched, extraction can take place. The content of a page may be parsed, searched and reformatted, and its data copied in