User Input
Function: get_user_input

Prompt the user for a search topic using a GUI dialog.

In [1]:
import tkinter as tk
from tkinter import simpledialog

def get_user_input():
    root = tk.Tk()
    root.withdraw()  # Hide the root window
    query = simpledialog.askstring("Input", "Enter the topic to search for:")
    return query


In [None]:
help('modules')

In [None]:
help('modules')

Automated Website Identification
Function: search_google

Perform a Google search using the provided topic.
Fetch the search results page.

In [2]:
import requests

def search_google(query):
    url = f"https://www.google.com/search?q={query}"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        raise Exception(f"Failed to retrieve search results. Status code: {response.status_code}")


 Automated Link Compilation
Function: extract_links

Parse the HTML content of the search results page.
Extract and compile a list of the top 100 links.

In [3]:
from bs4 import BeautifulSoup

def extract_links(html):
    soup = BeautifulSoup(html, "html.parser")
    links = []
    for result in soup.select('.tF2Cxc')[:100]:
        link_tag = result.select_one('a')
        if link_tag and 'href' in link_tag.attrs:
            links.append(link_tag['href'])
    return links


Test Web Scraping Capability
Function: test_scraping

Attempt to scrape each identified link.
Check for the presence of relevant content.
Handle different content structures.

In [4]:
def test_scraping(link):
    try:
        response = requests.get(link)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, "html.parser")
        # Check for relevant content (this will depend on the specific use case)
        if soup.title:
            return True, None  # Successfully scraped
        else:
            return False, "No relevant content found"
    except requests.RequestException as e:
        return False, str(e)


Error Handling
Function: handle_errors

Log errors encountered during the scraping process.
Categorize errors with appropriate error codes.
(This is integrated within the test_scraping function.)

Generate Comprehensive Report
Function: generate_report

Compile the results into a detailed report.
List websites that were successfully scraped and those that failed, including error details.
Output the report in a user-friendly format (e.g., CSV).

In [5]:
import csv

def generate_report(results):
    with open('scraping_report.csv', 'w', newline='') as csvfile:
        fieldnames = ['URL', 'Status', 'Error']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        
        writer.writeheader()
        for result in results:
            writer.writerow(result)


Main Function to Tie Everything Together
Function: main

Orchestrates the entire workflow.
Handles user input, search, link extraction, scraping tests, and report generation.

In [6]:
def main():
    query = get_user_input()
    if query:
        try:
            html = search_google(query)
            links = extract_links(html)
            results = []
            for link in links:
                success, error = test_scraping(link)
                result = {'URL': link, 'Status': 'Success' if success else 'Failed', 'Error': error if error else 'None'}
                results.append(result)
            generate_report(results)
            print("Report generated successfully.")
        except Exception as e:
            print(f"An error occurred: {e}")
    else:
        print("No query provided.")

if __name__ == "__main__":
    main()
