# 📓 Web Scraper & Automation - Enhanced Version (Jupyter Notebook Format)

### 📝 Notebook Structure:
#### - Introduction & Dependencies
#### - Configuration & Setup
#### - Static Scraper with Requests & BeautifulSoup
#### - Dynamic Scraper with Selenium
#### - Saving Data to CSV
#### - Email Notification
#### - Main Scraping Function
#### - (Optional) Scheduler Automation
#### - Execution


#### 🕸️ Enhanced Web Scraper & Automation Project
##### 
In this project, we'll scrape news headlines from a news website, save the data to a CSV file, and send email notifications upon completion.##### 

We'll demonstra##### te:
- Static scraping using **Requests + BeautifulSo##### up**
- Dynamic scraping using **Selen##### ium**
- Error handling & retry#####  logic
- Pagination s##### craping
- CSV#####  storage
- Emai##### l aerts

---


In [1]:
# 📦 Dependencies Installation (if not already installed)
!pip install requests beautifulsoup4 pandas selenium

Collecting selenium
  Downloading selenium-4.29.0-py3-none-any.whl.metadata (7.1 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.29.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting websocket-client~=1.8 (from selenium)
  Downloading websocket_client-1.8.0-py3-none-any.whl.metadata (8.0 kB)
Collecting attrs>=23.2.0 (from trio~=0.17->selenium)
  Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
Collecting sortedcontainers (from trio~=0.17->selenium)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.29.0-py3-none-any.whl (9.5 MB)
   ------------------------------

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
notebook 7.0.8 requires jupyterlab<4.1,>=4.0.2, but you have jupyterlab 4.2.4 which is incompatible.



   ------------------ --------------------- 4.3/9.5 MB 74.8 kB/s eta 0:01:10
   ------------------ --------------------- 4.3/9.5 MB 74.8 kB/s eta 0:01:10
   ------------------ --------------------- 4.3/9.5 MB 74.8 kB/s eta 0:01:10
   ------------------ --------------------- 4.4/9.5 MB 74.5 kB/s eta 0:01:10
   ------------------ --------------------- 4.4/9.5 MB 74.8 kB/s eta 0:01:10
   ------------------ --------------------- 4.4/9.5 MB 74.8 kB/s eta 0:01:10
   ------------------ --------------------- 4.4/9.5 MB 74.8 kB/s eta 0:01:10
   ------------------ --------------------- 4.4/9.5 MB 74.8 kB/s eta 0:01:10
   ------------------ --------------------- 4.4/9.5 MB 74.8 kB/s eta 0:01:09
   ------------------ --------------------- 4.4/9.5 MB 74.8 kB/s eta 0:01:09
   ------------------ --------------------- 4.4/9.5 MB 74.8 kB/s eta 0:01:09
   ------------------ --------------------- 4.4/9.5 MB 74.8 kB/s eta 0:01:09
   ------------------ --------------------- 4.4/9.5 MB 76.0 kB/s eta 0:01:0

#### 🔧 Step 1: Import Libraries
##### We will import all necessary libraries for HTTP requests, parsing HTML, browser automation, data handling, and email alerts.

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import time
import smtplib
from email.message import EmailMessage
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

#### ⚙️ Step 2: Configuration & Setup
##### Define URLs, email credentials, headers, and file names.

In [4]:
# --- Configuration ---
BASE_URL = "https://www.bbc.com/news"
CSV_FILE = "enhanced_news_headlines.csv"

# Email Settings
EMAIL_SENDER = "your_email@example.com"
EMAIL_RECEIVER = "receiver_email@example.com"
EMAIL_PASSWORD = "your_email_password"

# User-Agent header to avoid getting blocked
HEADERS = {
    "User-Agent": "Mozilla/5.0"
}

#### 🌐 Step 3: Static Scraper (Requests + BeautifulSoup)

##### Scrape headlines from static HTML content across multiple pages.

In [5]:
def scrape_static_pages():
    all_headlines = []
    try:
        print(f"[{datetime.now()}] Scraping static pages...")
        for page in range(1, 4):  # Scrape first 3 pages (modify as needed)
            page_url = f"{BASE_URL}?page={page}"
            response = requests.get(page_url, headers=HEADERS, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.content, "html.parser")
            headlines = soup.find_all('h3')
            
            for h in headlines:
                title = h.get_text(strip=True)
                if title:
                    all_headlines.append({
                        "Headline": title,
                        "Page": page,
                        "Scraped At": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                    })
            time.sleep(2)  # Politeness delay
        
        print(f"[{datetime.now()}] Scraped {len(all_headlines)} headlines from static pages.")
        return all_headlines

    except requests.exceptions.RequestException as e:
        print(f"[Error] Request failed: {e}")
        return []

#### 🔄 Step 4: Dynamic Scraper (Selenium)

##### For dynamic, JavaScript-loaded content, we'll use Selenium with headless Chrome.

In [6]:
def scrape_dynamic_content():
    headlines = []
    try:
        print(f"[{datetime.now()}] Scraping dynamic content...")
        options = Options()
        options.add_argument('--headless')
        options.add_argument('--disable-gpu')
        driver = webdriver.Chrome(options=options)

        driver.get(BASE_URL)
        time.sleep(5)  # Wait for JavaScript to load

        elements = driver.find_elements(By.TAG_NAME, 'h3')
        for elem in elements:
            title = elem.text.strip()
            if title:
                headlines.append({
                    "Headline": title,
                    "Page": "Dynamic",
                    "Scraped At": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                })
        driver.quit()
        print(f"[{datetime.now()}] Scraped {len(headlines)} dynamic headlines.")
        return headlines
    
    except Exception as e:
        print(f"[Error] Selenium scraping failed: {e}")
        return []

#### 💾 Step 5: Save Data to CSV

##### Append the scraped data to a CSV file.

In [7]:
def save_to_csv(data):
    df = pd.DataFrame(data)
    df.to_csv(CSV_FILE, index=False, mode='a', header=not pd.io.common.file_exists(CSV_FILE))
    print(f"[{datetime.now()}] Saved {len(data)} headlines to CSV.\n")

#### 📧 Step 6: Send Email Notification

##### Notify via email when scraping is complete or fails.

In [8]:
def send_email(subject, body):
    msg = EmailMessage()
    msg['Subject'] = subject
    msg['From'] = EMAIL_SENDER
    msg['To'] = EMAIL_RECEIVER
    msg.set_content(body)

    try:
        with smtplib.SMTP_SSL('smtp.gmail.com', 465) as smtp:
            smtp.login(EMAIL_SENDER, EMAIL_PASSWORD)
            smtp.send_message(msg)
        print("[Email] Notification sent successfully!")
    except Exception as e:
        print(f"[Email Error] Failed to send email: {e}")

#### 🚀 Step 7: Main Scraper Function

##### Combines both static and dynamic scrapers, saves data, and sends notification.

In [9]:
def main_scraper():
    static_data = scrape_static_pages()
    dynamic_data = scrape_dynamic_content()
    total_data = static_data + dynamic_data
    
    if total_data:
        save_to_csv(total_data)
        send_email(
            subject="Web Scraper: Completed Successfully ✅",
            body=f"Scraped {len(total_data)} headlines and saved to {CSV_FILE}."
        )
    else:
        send_email(
            subject="Web Scraper: Failed ❌",
            body="No data scraped. Check logs for errors."
        )

#### 🕒 Step 8 (Optional): Scheduler Automation

##### Automatically run scraper at regular intervals.

In [10]:
def schedule_scraper(interval_minutes=120):
    while True:
        main_scraper()
        print(f"Waiting {interval_minutes} minutes before next scrape...\n")
        time.sleep(interval_minutes * 60)

#### ▶️ Step 9: Execute the Scraper

##### Run the scraper once (or enable continuous automation).

In [11]:
# Run once
main_scraper()

# Uncomment to enable automation
# schedule_scraper(interval_minutes=120)

[2025-03-19 14:59:50.133949] Scraping static pages...
[2025-03-19 15:00:16.828661] Scraped 0 headlines from static pages.
[2025-03-19 15:00:16.839397] Scraping dynamic content...


Error sending stats to Plausible: error sending request for url (https://plausible.io/api/event)


[2025-03-19 15:01:02.004027] Scraped 0 dynamic headlines.
[Email Error] Failed to send email: (535, b'5.7.8 Username and Password not accepted. For more information, go to\n5.7.8  https://support.google.com/mail/?p=BadCredentials 5b1f17b1804b1-43d43f47c60sm16713395e9.13 - gsmtp')


## 🎯 Done!