# BeautifulSoup Overview
BeautifulSoup is a Python library used for web scraping and parsing HTML or XML documents. It allows easy navigation, searching, and modification of the document structure to extract the desired data.

Key Features:

Parses HTML/XML documents quickly using parsers like html.parser, lxml, or html5lib.
Allows easy selection of elements using tag names, attributes, or CSS selectors.
Automatically fixes poorly formatted HTML for better parsing.
Flexible and easy to integrate with other libraries like requests for fetching web pages.
Workflow:

Fetch the webpage using requests or similar libraries.
Parse the HTML using BeautifulSoup.
Navigate and extract the desired elements using tags, classes, or attributes.
Process and store the extracted data.

In [7]:
!pip install beautifulsoup4 requests




In [8]:
from bs4 import BeautifulSoup
import requests

def scrape_with_beautifulsoup(url, fields):
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch the URL: {url} (Status Code: {response.status_code})")
        return

    soup = BeautifulSoup(response.content, 'html.parser')
    data = []

    # Example: Extract paragraphs from the Wikipedia article
    paragraphs = soup.select(".mw-parser-output > p")
    for para in paragraphs[:5]:  # Limit to first 5 paragraphs
        item = {
            "name": fields[0],  # Placeholder field
            "date": "N/A",      # Placeholder field
            "reference": para.get_text(strip=True)  # Extracted text
        }
        data.append(item)

    # Display scraped data
    print("Scraped Data:")
    for entry in data:
        print(entry)

# Example Input
url = "https://en.wikipedia.org/wiki/Adolf_Hitler"  # Target URL
fields = ["name", "date", "reference"]  # Fields to extract
scrape_with_beautifulsoup(url, fields)


Scraped Data:
{'name': 'name', 'date': 'N/A', 'reference': ''}
{'name': 'name', 'date': 'N/A', 'reference': 'Adolf Hitler[a](20 April 1889 – 30 April 1945) was an Austrian-born German politician who was the dictator ofNazi Germanyfrom 1933 untilhis suicidein 1945.He rose to poweras the leader of theNazi Party,[c]becomingthe chancellorin 1933 and then taking the title ofFührer und Reichskanzlerin 1934.[d]Hisinvasion of Polandon 1\xa0September 1939 marked the start of theSecond World War. He was closely involved in military operations throughout the war and was central to the perpetration ofthe Holocaust: thegenocideofabout six million Jews and millions of other victims.'}
{'name': 'name', 'date': 'N/A', 'reference': "Hitler was born inBraunau am InninAustria-Hungaryand moved toGermanyin 1913. He was decorated during his service in the German Army inWorld War I, receiving theIron Cross. In 1919, he joined theGerman Workers' Party(DAP), the precursor of the Nazi Party, and in 1921 was app

# Scrapy Overview
Scrapy is a Python framework for web scraping and crawling. It is used to extract data from websites efficiently and can handle large-scale scraping tasks. Scrapy works on an event-driven architecture and processes data using spiders, which are Python classes that define how to navigate and scrape websites.

Key Features:

Extract data using CSS selectors or XPath.
Handles retries, redirects, and cookies automatically.
Exports data in formats like JSON, CSV, or stores it in a database.
Asynchronous and high-performance, using Twisted for networking.
Workflow:

Create a project with scrapy startproject.
Write a spider to define scraping logic.
Run the spider to fetch and parse data.
Store the data in files or databases.


In [9]:
!pip install scrapy



Collecting scrapy
  Downloading Scrapy-2.12.0-py2.py3-none-any.whl.metadata (5.3 kB)
Collecting Twisted>=21.7.0 (from scrapy)
  Downloading twisted-24.11.0-py3-none-any.whl.metadata (20 kB)
Collecting cssselect>=0.9.1 (from scrapy)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting itemloaders>=1.0.1 (from scrapy)
  Downloading itemloaders-1.3.2-py3-none-any.whl.metadata (3.9 kB)
Collecting parsel>=1.5.0 (from scrapy)
  Downloading parsel-1.10.0-py2.py3-none-any.whl.metadata (11 kB)
Collecting queuelib>=1.4.2 (from scrapy)
  Downloading queuelib-1.7.0-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting service-identity>=18.1.0 (from scrapy)
  Downloading service_identity-24.2.0-py3-none-any.whl.metadata (5.1 kB)
Collecting w3lib>=1.17.0 (from scrapy)
  Downloading w3lib-2.3.1-py3-none-any.whl.metadata (2.3 kB)
Collecting zope.interface>=5.1.0 (from scrapy)
  Downloading zope.interface-7.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_6

In [16]:
import os
import json
import re

# Step 1: Write the Scrapy spider to a file
scrapy_script = """
import scrapy

class WikipediaSpider(scrapy.Spider):
    name = "wikipedia"
    start_urls = ["https://en.wikipedia.org/wiki/Adolf_Hitler"]

    def parse(self, response):
        paragraphs = response.css(".mw-parser-output > p")
        for i, paragraph in enumerate(paragraphs[:5]):
            text = " ".join(paragraph.css("::text").getall()).strip()
            if text:
                yield {
                    "paragraph_number": i + 1,
                    "text": text
                }
"""

with open("wikipedia_spider.py", "w") as f:
    f.write(scrapy_script)

# Step 2: Run the spider using subprocess and save as .jl
os.system("scrapy runspider wikipedia_spider.py -o output.jl")

# Step 3: Helper Functions for Classification
def extract_names(text):
    """
    Extract proper nouns (names) using capitalization patterns and exclude months or generic terms.
    """
    all_names = re.findall(r'\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*\b', text)
    excluded_keywords = {"April", "September", "Nazi", "Second", "He", "His", "Holocaust"}  # Add any non-name terms
    return [name for name in all_names if name not in excluded_keywords]


def extract_dates(text):
    """Extract dates in common formats (e.g., '20 April 1889')."""
    return re.findall(r'\b\d{1,2}\s(?:January|February|March|April|May|June|July|August|September|October|November|December)\s\d{4}\b', text)

def extract_references(text):
    """Extract references or footnotes (e.g., '[3]', '[a]')."""
    return re.findall(r'\[\d+\]|\[[a-z]+\]', text)

# Step 4: Process and display the output
if os.path.exists("output.jl"):
    print("Classified and Segregated Output:")
    with open("output.jl") as f:
        for line in f:
            try:
                item = json.loads(line)
                paragraph_number = item.get("paragraph_number", "N/A")
                text = item.get("text", "")

                # Extract classified information
                names = extract_names(text)
                dates = extract_dates(text)
                references = extract_references(text)

                # Display structured output
                print(f"Paragraph {paragraph_number}:")
                print(f"  Text: {text}")
                print(f"  Names: {', '.join(names) if names else 'None'}")
                print(f"  Dates: {', '.join(dates) if dates else 'None'}")
                print(f"  References: {', '.join(references) if references else 'None'}\n")
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON line: {line}\n{e}")


Classified and Segregated Output:
Paragraph 2:
  Text: Adolf Hitler [ a ]  (20 April 1889 – 30 April 1945) was an Austrian-born German politician who was the dictator of  Nazi Germany  from 1933 until  his suicide  in 1945.  He rose to power  as the leader of the  Nazi Party , [ c ]  becoming  the chancellor  in 1933 and then taking the title of  Führer und Reichskanzler  in 1934. [ d ]  His  invasion of Poland  on 1 September 1939 marked the start of the  Second World War . He was closely involved in military operations throughout the war and was central to the perpetration of  the Holocaust : the  genocide  of  about six million Jews and millions of other victims .
  Names: Adolf Hitler, Austrian, German, Nazi Germany, Nazi Party, Reichskanzler, Poland, Second World War, Jews
  Dates: 20 April 1889, 30 April 1945, 1 September 1939
  References: None

Paragraph 3:
  Text: Hitler was born in  Braunau am Inn  in  Austria-Hungary  and moved to  Germany  in 1913. He was decorated during h

## Multi-Agent Scraper Code
This code implements a multi-agent system designed to automate web scraping tasks dynamically based on user input. The system guides users through extracting specific data fields (e.g., names, dates) from a webpage by generating, validating, and debugging Python scraping scripts. It consists of five interconnected agents:

**Page Analyzer**: Analyzes the given URL to determine if the webpage is static or dynamic and extracts its structure.

**Strategy Planner**: Selects the appropriate scraping approach (e.g., BeautifulSoup for static pages or Selenium for dynamic pages).

**Code Generator**: Creates a Python script tailored to extract the requested field (e.g., proper nouns for "names" or patterns for "dates").

**Validator**: Runs the generated script and checks whether it successfully extracts the desired data.

**Debugger**: Provides feedback to improve the code if validation fails.

The system ensures the agents communicate effectively, automatically refining the generated scripts based on validation feedback. It outputs a functional scraping script and extracts the requested data fields. This setup is versatile and can be used for various web scraping tasks by specifying the URL and field type, making it highly user-friendly and adaptable for dynamic use cases.

In [17]:
!pip install beautifulsoup4 selenium requests




In [25]:
import requests
from bs4 import BeautifulSoup
import re

# Agent 1: Page Analyzer
def analyze_page(url):
    print("Analyzing the page...")
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        return soup, "static"
    else:
        raise Exception("Failed to fetch the page. Ensure the URL is correct.")

# Agent 2: Strategy Planner
def plan_strategy(field_type, page_type):
    print("Planning scraping strategy...")
    if page_type == "static":
        strategy = "BeautifulSoup"
    else:
        strategy = "Selenium"
    return strategy

# Agent 3: Code Generator
def generate_code(url, field_type, strategy):
    print("Generating code...")
    if strategy == "BeautifulSoup":
        if field_type.lower() == "names":
            code = f"""
import requests
from bs4 import BeautifulSoup
import re

url = "{url}"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extracting names (proper nouns)
data = {{
    "names": [match.group() for match in re.finditer(r'\\b[A-Z][a-z]+(?:\\s[A-Z][a-z]+)*\\b', soup.get_text())]
}}
print(data)
"""
        elif field_type.lower() == "dates":
            code = f"""
import requests
from bs4 import BeautifulSoup
import re

url = "{url}"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extracting dates
data = {{
    "dates": re.findall(r'\\b\\d{{1,2}} [A-Za-z]+ \\d{{4}}\\b', soup.get_text())
}}
print(data)
"""
        else:
            code = f"""
import requests
from bs4 import BeautifulSoup

url = "{url}"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extracting {field_type}
data = {{
    "{field_type}": [tag.text.strip() for tag in soup.find_all()]
}}
print(data)
"""
    else:
        code = "Dynamic page strategies (using Selenium) are not implemented yet."
    return code

# Agent 4: Validator
def validate_code(code):
    print("Validating the code...")
    try:
        exec_globals = {}
        exec(code, exec_globals)
        return exec_globals["data"], True
    except Exception as e:
        return str(e), False

# Agent 5: Debugger
def debug_code(error_message):
    print("Debugging Feedback:", error_message)
    print("Ensure the field type exists on the page and try again.")
    return None

# Multi-Agent System
def multi_agent_scraper():
    url = input("Enter the URL to scrape: ").strip()
    field_type = input("Specify the data field to extract (e.g., names, dates): ").strip()

    # Step 1: Analyze Page
    try:
        soup, page_type = analyze_page(url)
        print(f"Page Type: {page_type.capitalize()}")
    except Exception as e:
        print("Error during page analysis:", e)
        return

    # Step 2: Plan Strategy
    strategy = plan_strategy(field_type, page_type)
    print(f"Strategy Selected: {strategy}")

    # Step 3: Generate Code
    code = generate_code(url, field_type, strategy)
    print("Generated Code:\n", code)

    # Step 4: Validate Code
    extracted_data, is_valid = validate_code(code)
    if is_valid:
        print("Successfully validated the code!")
        print("Extracted Data:", extracted_data)
    else:
        print("Validation Failed:", extracted_data)
        # Step 5: Debugging
        debug_code(extracted_data)

# Run the scraper
multi_agent_scraper()




Enter the URL to scrape: https://www.astronomer.io/airflow/
Specify the data field to extract (e.g., names, dates): names


DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.astronomer.io:443
DEBUG:urllib3.connectionpool:https://www.astronomer.io:443 "GET /airflow/ HTTP/1.1" 200 21886
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.astronomer.io:443
DEBUG:urllib3.connectionpool:https://www.astronomer.io:443 "GET /airflow/ HTTP/1.1" 200 21886


Analyzing the page...
Page Type: Static
Planning scraping strategy...
Strategy Selected: BeautifulSoup
Generating code...
Generated Code:
 
import requests
from bs4 import BeautifulSoup
import re

url = "https://www.astronomer.io/airflow/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extracting names (proper nouns)
data = {
    "names": [match.group() for match in re.finditer(r'\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*\b', soup.get_text())]
}
print(data)

Validating the code...
{'names': ['Apache Airflow', 'Express Data Flows', 'Code', 'De Facto Standard', 'Astro Observe', 'Learn More', 'Support', 'Log In', 'Apache', 'Airflow', 'Why Airflow', 'Astro', 'Astro', 'See Our', 'Apache', 'Blocks', 'Explore', 'Astro', 'Read', 'Case', 'All', 'Download', 'Started Free', 'Apache Airflow', 'The De Facto Standard', 'Data Workflow Automation', 'Apache Airflow', 'Learning Center', 'What', 'Airflow', 'Airflow Use Cases', 'Scalability', 'Security', 'Multi', 'Tenancy', 'Re

## multi-agent web scraper
This multi-agent web scraper is designed to summarize the key content of any website based on a given URL. The system uses multiple agents to analyze the structure of the website, plan a scraping strategy, extract relevant text data, and summarize it into a coherent, human-readable paragraph.

Here’s how it works:

URL Input: The user provides the URL of the website to summarize.
Agent Workflow:
The Page Analyzer Agent examines the structure of the website (static or dynamic).
The Strategy Planner Agent determines the best scraping approach (e.g., BeautifulSoup or Selenium).
The Content Extractor Agent scrapes the text content.
The Summarizer Agent processes the extracted content to produce a concise summary.
The Validator Agent ensures the accuracy and coherence of the generated summary.
Output: The final output is a beautifully structured paragraph summarizing the main content of the website.
This code is flexible and can summarize a wide variety of web pages, from news articles to informational sites. It dynamically adapts to the website's structure, ensuring the best strategy is chosen for effective data extraction.

In [1]:
pip install transformers beautifulsoup4




In [4]:
from transformers import pipeline
from bs4 import BeautifulSoup
import requests
import re
import time

def extract_meaningful_content(soup):
    """Extracts relevant content from the website."""
    paragraphs = soup.find_all(['p', 'h1', 'h2', 'h3'])  # Focus on main content tags
    content = []
    for para in paragraphs:
        text = para.get_text(strip=True)
        if len(text) > 50:  # Skip very short or irrelevant text
            content.append(text)
    return " ".join(content)

def clean_text(text):
    """Cleans extracted text to remove unwanted parts."""
    text = re.sub(r'\[\d+\]', '', text)  # Remove citation markers like [1], [2]
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    return text.strip()

def summarize_content(content):
    """Summarizes the content using a pre-trained model."""
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn", min_length=50, max_length=100)
    chunk_size = 1024  # Model's input size limit
    chunks = [content[i:i+chunk_size] for i in range(0, len(content), chunk_size)]
    summaries = []
    for i, chunk in enumerate(chunks):
        print(f"Summarizing chunk {i+1}/{len(chunks)}...")
        summary = summarizer(chunk)[0]['summary_text']
        summaries.append(summary)
    return " ".join(summaries)

def multi_agent_web_summarizer():
    print("Welcome to the Improved Website Summarizer!")
    url = input("Enter the URL to summarize: ").strip()

    start_time = time.time()

    # Step 1: Fetch Website Content
    print("Fetching website content...")
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
    except Exception as e:
        print("Error fetching the website content:", str(e))
        return

    # Step 2: Extract Meaningful Content
    print("Extracting meaningful content...")
    raw_content = extract_meaningful_content(soup)
    cleaned_content = clean_text(raw_content)

    if not cleaned_content:
        print("No meaningful content found.")
        return

    print(f"Content extracted. Length: {len(cleaned_content)} characters.")

    # Step 3: Summarize Content
    print("Summarizing content...")
    try:
        final_summary = summarize_content(cleaned_content)
    except Exception as e:
        print("Error during summarization:", str(e))
        return

    print("Summarization completed in:", round(time.time() - start_time, 2), "seconds")

    # Step 4: Output Summary
    # Step 4: Output Summary
    print("\n--- Summary of the Website ---")
    formatted_summary = final_summary.replace('. ', '.\n')  # Add line breaks for better readability
    print(formatted_summary)


# Run the scraper
multi_agent_web_summarizer()



Welcome to the Improved Website Summarizer!
Enter the URL to summarize: https://en.wikipedia.org/wiki/Adolf_Hitler
Fetching website content...
Extracting meaningful content...
Content extracted. Length: 78771 characters.
Summarizing content...


Device set to use cuda:0


Summarizing chunk 1/77...
Summarizing chunk 2/77...
Summarizing chunk 3/77...
Summarizing chunk 4/77...
Summarizing chunk 5/77...
Summarizing chunk 6/77...
Summarizing chunk 7/77...
Summarizing chunk 8/77...
Summarizing chunk 9/77...
Summarizing chunk 10/77...
Summarizing chunk 11/77...
Summarizing chunk 12/77...
Summarizing chunk 13/77...
Summarizing chunk 14/77...
Summarizing chunk 15/77...
Summarizing chunk 16/77...
Summarizing chunk 17/77...
Summarizing chunk 18/77...
Summarizing chunk 19/77...
Summarizing chunk 20/77...
Summarizing chunk 21/77...
Summarizing chunk 22/77...
Summarizing chunk 23/77...
Summarizing chunk 24/77...
Summarizing chunk 25/77...
Summarizing chunk 26/77...
Summarizing chunk 27/77...
Summarizing chunk 28/77...
Summarizing chunk 29/77...
Summarizing chunk 30/77...
Summarizing chunk 31/77...
Summarizing chunk 32/77...
Summarizing chunk 33/77...
Summarizing chunk 34/77...
Summarizing chunk 35/77...
Summarizing chunk 36/77...
Summarizing chunk 37/77...
Summarizin