<a href="https://colab.research.google.com/github/silvia-denanni/DI-Bootcamp-nov25/blob/main/W8D2ExercisesXPDynamicWebScraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Exercise 1 : Exploring JavaScript Variables and Data Types
**Instructions**

Create a JavaScript script that defines variables of different data types and logs them to the console.
Instructions

- Create a new HTML file with a script> tag.
Inside the script> tag, declare variables of different data types (String, Number, Boolean, Undefined, Null).

- Use console.log() to print each variable and its type to the browser console.
Open the HTML file in a web browser and inspect the console output.

```
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>JavaScript Variables and Data Types</title>
</head>
<body>
  <script>
    // Declare variables of different data types
    let myString = "Hello, JavaScript!"; // String
    let myNumber = 42;                    // Number
    let myBoolean = true;                 // Boolean
    let myUndefined;                      // Undefined (no value assigned)
    let myNull = null;                    // Null

    // Log each variable and its type to the console
    console.log(myString, typeof myString);
    console.log(myNumber, typeof myNumber);
    console.log(myBoolean, typeof myBoolean);
    console.log(myUndefined, typeof myUndefined);
    console.log(myNull, typeof myNull); // Note: typeof null returns "object" due to JS quirk
  </script>
</body>
</html>
```




#Exercise 2 : JavaScript Page vs. HTML Page
**Instructions**

Compare the behavior of a static HTML page with a JavaScript-enhanced HTML page.

**Instructions**

Create two HTML files – one with only HTML content and another with HTML and JavaScript.

In the first file, create a static page with headings, paragraphs, and a list.

In the second file, add JavaScript to dynamically modify one of the elements on page load (e.g., change the text of a heading).

Open both files in a web browser and observe the differences in behavior and content rendering.

**Expected Outcome**

The static HTML page should display content as is, whereas the JavaScript-enhanced page should show dynamically altered content, illustrating the interactivity added by JavaScript.

#A. Static HTML Page (static.html)
This page contains only HTML elements: headings, paragraphs, and a list. The content is fixed and does not change after loading.

```
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>Static HTML Page</title>
</head>
<body>
  <h1>Welcome to My Static Page</h1>
  <p>This page contains only static HTML content.</p>
  <ul>
    <li>HTML is static</li>
    <li>No interactivity</li>
    <li>Content does not change</li>
  </ul>
</body>
</html>

```



#B. JavaScript-Enhanced HTML Page (dynamic.html)

This page has the same initial HTML content but includes JavaScript that dynamically changes the heading text when the page loads, illustrating interactivity.



```
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>JavaScript-Enhanced Page</title>
</head>
<body>
  <h1 id="main-heading">Welcome to My Static Page</h1>
  <p>This page contains HTML content enhanced with JavaScript.</p>
  <ul>
    <li>HTML is static</li>
    <li>JavaScript adds interactivity</li>
    <li>Content can change dynamically</li>
  </ul>

  <script>
    // Change the heading text dynamically on page load
    document.getElementById('main-heading').innerText = "Welcome to My Dynamic Page!";
  </script>
</body>
</html>
```



#Exercise 3 : Scrape Dynamic Content from Rotten Tomatoes

**Task:**

- Use Selenium to navigate to the Rotten Tomatoes Certified Fresh Movies page.
- Extract the HTML content after it’s fully loaded.
- Use BeautifulSoup to parse and extract the movie titles, scores, and release dates.

**Instructions**
- Set up Selenium WebDriver and navigate to the Rotten Tomatoes page.

- Extract the HTML content using driver.page_source.
- Parse the HTML with BeautifulSoup.
- Find and extract the desired movie information.
Print the extracted data.


The first step is to install Selenium to ensure the necessary libraries are available for WebDriver setup.



In [18]:
pip install selenium



Install webdriver_manager to automatically handle ChromeDriver installation and path configuration, ensuring compatibility.



In [17]:
pip install webdriver-manager



## Install Google Chrome

### Subtask:
Install Google Chrome directly using `wget` and `dpkg -i` to bypass APT repository and GPG key issues, and then resolve any dependencies.


In [22]:
import subprocess

# Download the latest Google Chrome .deb package
subprocess.run(['wget', 'https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb'])

print('Google Chrome .deb package downloaded.')

Google Chrome .deb package downloaded.


**Reasoning**:
Now that the Google Chrome .deb package is downloaded, the next step is to install it using `dpkg -i` as per the instructions.



In [23]:
import subprocess

# Install the downloaded .deb package
subprocess.run(['sudo', 'dpkg', '-i', 'google-chrome-stable_current_amd64.deb'])

print('Google Chrome .deb package installation initiated.')

Google Chrome .deb package installation initiated.


**Reasoning**:
The previous step initiated the installation of Google Chrome, and now it's necessary to resolve any potential broken dependencies as instructed by running `apt-get install -f`.



In [24]:
import subprocess

# Resolve any broken dependencies
subprocess.run(['sudo', 'apt-get', 'install', '-f', '-y'])

print('Attempting to resolve broken dependencies.')

Attempting to resolve broken dependencies.


In [72]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
from webdriver_manager.chrome import ChromeDriverManager

# Set up Selenium WebDriver with ChromeDriverManager
service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # Run in headless mode (optional)
options.add_argument('--no-sandbox') # Required for running as root/headless
options.add_argument('--disable-dev-shm-usage') # Required for running in environments with limited /dev/shm
options.add_argument('--window-size=1920,1080') # Set a specific window size to simulate a larger display
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36') # Add user-agent
driver = webdriver.Chrome(service=service, options=options)

try:
    # Navigate to Rotten Tomatoes Certified Fresh Movies page
    url = 'https://www.rottentomatoes.com/browse/movies_in_theaters/critics:certified_fresh~sort:popular'
    driver.get(url)

    # Maximize window for better rendering and element visibility
    driver.maximize_window()
    print("Window maximized.")

    # Give the page some initial time to load its basic structure
    time.sleep(10) # Increased initial sleep

    # Try to dismiss a potential cookie consent banner or other overlay
    try:
        # Look for a button with text 'Accept' or 'Agree' (common for cookie banners)
        # Also try more generic "Got it" or "Close" selectors
        accept_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, "//button[contains(., 'Accept')] | //button[contains(., 'Agree')] | //button[contains(., 'Got it')] | //button[contains(@aria-label, 'Close')] | //button[text()='I Accept'] | //button[contains(text(), 'I Accept')] | //button[contains(text(), 'Accept All')] "))
        )
        accept_button.click()
        print("Clicked 'Accept/Agree/Got it/Close' button on cookie banner/modal.")
        time.sleep(3) # Give time for the banner to disappear
    except Exception as e:
        print(f"No common 'Accept/Agree/Got it/Close' button found or not clickable within 10 seconds. Proceeding... Error: {e}")
        pass # If no cookie banner, continue

    # Robust scrolling to load all dynamic content
    last_height = driver.execute_script("return document.body.scrollHeight")
    scroll_attempts = 0
    max_scroll_attempts = 30 # Further increased max attempts
    previous_movie_count = 0

    print("Starting continuous scroll...")
    while scroll_attempts < max_scroll_attempts:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5) # Increased sleep after scroll for more content to load

        new_height = driver.execute_script("return document.body.scrollHeight")
        current_movie_count = len(driver.find_elements(By.CSS_SELECTOR, 'a[data-qa="discovery-media-list-item"]'))

        print(f"Scrolled {scroll_attempts+1} times. Current height: {new_height}, Movies found: {current_movie_count}")

        if new_height == last_height and current_movie_count == previous_movie_count:
            # If height and movie count haven't changed, try to scroll again after a short pause
            time.sleep(2)
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            new_height = driver.execute_script("return document.body.scrollHeight")
            current_movie_count = len(driver.find_elements(By.CSS_SELECTOR, 'a[data-qa="discovery-media-list-item"]'))
            if new_height == last_height and current_movie_count == previous_movie_count:
                print(f"Scroll height and movie count did not change after {scroll_attempts+1} attempts. Breaking scroll loop.")
                break # Truly at the end of the scroll or no more content to load

        last_height = new_height
        previous_movie_count = current_movie_count
        scroll_attempts += 1

    print(f"Finished scrolling. Final page height: {last_height}")

    # Wait until at least one movie list item is present after scrolling
    # This also acts as a final check that the page is ready
    wait = WebDriverWait(driver, 120) # Increased wait time significantly to 2 minutes
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a[data-qa="discovery-media-list-item"]')))
    print("Movie list elements are present.")

    # Give a final generous sleep to allow all loaded elements to render completely
    time.sleep(15)

    # Extract the fully loaded page source
    page_source = driver.page_source

    # Parse with BeautifulSoup
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find all movie tiles using the movie item container selector
    movies = soup.select('a[data-qa="discovery-media-list-item"]')
    print(f"Found {len(movies)} movies.")

    # Extract and print movie details
    if not movies:
        print("No movies found. Check selectors or page loading process.")

    for movie in movies:
        # Extract title using the new selector provided by the user
        title_tag = movie.select_one('span.p--small[data-qa="discovery-media-list-item-title"]')
        title = title_tag.text.strip() if title_tag else 'N/A'

        # Extract score (Tomatometer) using the user's new suggestion
        # rt-text is a custom element; BeautifulSoup should be able to get its text content.
        score_tag = movie.select_one('rt-text[slot="criticsScore"]')
        score = score_tag.text.strip().replace('%', '') if score_tag else 'N/A'

        # Extract release date using the user's new suggestion
        release_tag = movie.select_one('span.smaller[data-qa="discovery-media-list-item-start-date"]')
        release_date = release_tag.text.strip().replace('Opened ', '') if release_tag else 'N/A'

        print(f"Title: {title}, Score: {score}, Release Date: {release_date}")

finally:
    driver.quit()

Window maximized.
No common 'Accept/Agree/Got it/Close' button found or not clickable within 10 seconds. Proceeding... Error: Message: 

Starting continuous scroll...
Scrolled 1 times. Current height: 3108, Movies found: 3
Scrolled 2 times. Current height: 3108, Movies found: 3
Scroll height and movie count did not change after 2 attempts. Breaking scroll loop.
Finished scrolling. Final page height: 3108
Movie list elements are present.
Found 3 movies.
Title: Rental Family, Score: 87, Release Date: Nov 21, 2025
Title: The Voice of Hind Rajab, Score: 95, Release Date: Dec 17, 2025
Title: La Grazia, Score: 80, Release Date: Dec 05, 2025


## Summary:

### Data Analysis Key Findings
*   A total of 3 movies were successfully extracted from the Rotten Tomatoes 'Certified Fresh Movies' page.
*   The extracted movies are: "Rental Family" (Score: 87%, Release Date: Nov 21, 2025), "The Voice of Hind Rajab" (Score: 95%, Release Date: Dec 17, 2025), and "La Grazia" (Score: 80%, Release Date: Dec 05, 2025).
*   The Tomatometer scores for the extracted movies range from 80% to 95%, all indicating a 'Certified Fresh' status.
*   All extracted movies have release dates in late 2025 (November and December).

### Insights or Next Steps
*   The current scraping successfully captured a small, highly-rated subset of certified fresh movies with upcoming release dates.
*   To get a broader view, consider expanding the scraping to include more movies, different categories, or a wider range of release dates.


# Exercise 4 : Scrape and Categorize News Articles from a JavaScript-Enabled News Site
**Task:**

Visit this website https://www.bbc.com/innovation/technology

Scrape news article titles and their publication dates.

Categorize articles based on their publication month.

**Instructions:**
Use Selenium to navigate to a specific news section on the website.

Extract and parse the HTML content that is dynamically loaded via JavaScript.

Using BeautifulSoup, extract news article titles and publication dates.

Categorize articles by their publication month (e.g., ‘January’, ‘February’, etc.).
Print the categorized lists of articles.

**Reasoning**:
The first instruction is to verify the Google Chrome installation and its version using the `google-chrome --version` command.



In [59]:
import subprocess

# Verify Google Chrome installation and version
result = subprocess.run(['google-chrome', '--version'], capture_output=True, text=True)
print(result.stdout)
if result.returncode != 0:
    print('Google Chrome is not installed or not found in PATH.')
    print(result.stderr)
else:
    print('Google Chrome version verified.')

Google Chrome 143.0.7499.169 

Google Chrome version verified.


**Reasoning**:
The previous step verified the Google Chrome version. The next instruction is to test if Chrome can run in headless mode by attempting to launch it from the command line and dumping the DOM of a simple page.



In [61]:
import subprocess

# Attempt to launch Chrome in headless mode and dump the DOM with --no-sandbox
result = subprocess.run(
    ['google-chrome', '--headless', '--disable-gpu', '--no-sandbox', '--dump-dom', 'https://www.google.com'],
    capture_output=True,
    text=True
)

print('Headless Chrome test initiated with --no-sandbox.')
if result.returncode != 0:
    print('Error launching headless Chrome or dumping DOM.')
    print(result.stderr)
else:
    print('Headless Chrome successfully launched and dumped DOM (partial output for brevity).')
    # Print only a portion of the output to avoid flooding the console with full HTML
    print(result.stdout[:1000] + ('...' if len(result.stdout) > 1000 else ''))

Headless Chrome test initiated with --no-sandbox.
Headless Chrome successfully launched and dumped DOM (partial output for brevity).
<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta charset="UTF-8"><link href="//www.gstatic.com/images/branding/searchlogo/ico/favicon.ico" rel="icon"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="">window._hst=Date.now();</script><script nonce="">(function(){var _g={kEI:'jfNSaYHwHq2EwbkP2v7v-QI',kEXPI:'31',kBL:'lvLq',kOPI:89978449};(function(){var a;((a=window.google)==null?0:a.stvsc)?google.kEI=_g.kEI:window.google=_g;}).call(this);})();(function(){google.sn='webhp';google.kHL='en';google.rdn=false;})();(function(){
var g=this||self;function k(){return window.google&&window.google.kOPI||null};var l,m=[];function n(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||l}function p(a){for(var b=n

## Refine ChromeOptions for stability

### Subtask:
Add or modify ChromeOptions arguments to enhance stability in headless mode, such as `--disable-gpu`, `--no-sandbox`, and `--disable-dev-shm-usage`, which can mitigate common issues in containerized or cloud environments.


**Reasoning**:
The previous attempt to run Selenium in headless mode failed due to an `SessionNotCreatedException`. To address this and follow the instructions, I will refine the `ChromeOptions` by ensuring all specified arguments (`--headless`, `--disable-gpu`, `--no-sandbox`, `--disable-dev-shm-usage`, `--window-size`, and `user-agent`) are correctly added, and integrate `webdriver_manager` for robust ChromeDriver setup, as was done in Exercise 3.



In [73]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
from datetime import datetime
import time
import json
import re # Import re for enhanced date parsing
from webdriver_manager.chrome import ChromeDriverManager

# Setup headless Chrome with refined options
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu') # Added for stability in headless environments
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--window-size=1920,1080') # Simulate a larger display
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36') # Mimic a real browser

# Use ChromeDriverManager to automatically handle ChromeDriver installation
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

try:
    driver.get('https://www.bbc.com/innovation/technology')

    # Wait for dynamic content to load. Using more general selectors for article containers.
    wait = WebDriverWait(driver, 30) # Increased wait time to ensure page loads fully
    wait.until(EC.presence_of_element_located((
        By.CSS_SELECTOR,
        'main article, div[data-testid*="-card"], div[data-component*="promo"], div[class*="promo"], section[data-component="promo-group"]'
    )))

    # Give a bit more time for all elements to settle after initial load
    time.sleep(10) # Increased sleep after initial wait

    # Get page source after JS loads
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    articles = []

    # Broadened selectors for article containers.
    # Using attribute contains selector `*=` for more flexibility.
    article_elements = soup.select(
        'div[data-testid*="-card"]' # Catches sheffield-card, dundee-card, windsor-card, liverpool-card, etc.
    )

    print(f"Found {len(article_elements)} potential article containers.")

    if not article_elements:
        print("No article containers found with current selectors. Check page source for new selectors.")

    for i, article_el in enumerate(article_elements):
        # Try to find the title using the provided selector within the current article container
        title_el_container = article_el.select_one('div.sc-fa814188-0.hPhBqv')
        title_el = title_el_container.select_one('h2[data-testid="card-headline"], h3[data-testid="card-headline"]') if title_el_container else None

        # Try to find the date using the provided selector within the current article container
        date_el_container = article_el.select_one('div.sc-1907e52a-0.fZLsBL')
        date_el = date_el_container.select_one('span[data-testid="card-metadata-lastupdated"]') if date_el_container else None

        title = title_el.get_text(strip=True) if title_el else 'N/A'

        date_str = 'N/A'
        if date_el:
            # The date can be in 'datetime' attribute of a 'time' tag or directly in text of a 'span'
            if date_el.name == 'time' and 'datetime' in date_el.attrs:
                date_str = date_el['datetime']
            else:
                date_str = date_el.get_text(strip=True)

        # Debugging prints
        # print(f"  Extracted Title: {title}")
        # print(f"  Extracted Date String: {date_str}")

        if title != 'N/A' and date_str != 'N/A' and not date_str.lower().endswith(('ago', 'hours', 'minutes', 'yesterday', 'today', 'now')):
            try:
                pub_date = None
                # ISO format: '2023-12-01T12:00:00Z' or '2023-12-01T12:00:00+00:00'
                if 'T' in date_str and ('Z' in date_str or '+' in date_str):
                    pub_date = datetime.fromisoformat(date_str.replace('Z', '+00:00'))
                # Date like '12 Dec 2023' or 'Dec 12, 2023' or '12 December 2023'
                elif re.match(r"\d{1,2}\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{4}", date_str): # DD Mon YYYY
                    pub_date = datetime.strptime(date_str, '%d %b %Y')
                elif re.match(r"(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s\d{4}", date_str): # Mon DD, YYYY
                    pub_date = datetime.strptime(date_str, '%b %d, %Y')
                elif re.match(r"\d{1,2}\s(?:January|February|March|April|May|June|July|August|September|October|November|December)\s\d{4}", date_str): # DD MonthName YYYY
                    pub_date = datetime.strptime(date_str, '%d %B %Y')
                elif re.match(r"(?:January|February|March|April|May|June|July|August|September|October|November|December)\s\d{1,2},\s\d{4}", date_str): # MonthName DD, YYYY
                    pub_date = datetime.strptime(date_str, '%B %d, %Y')
                # Fallback for simpler YYYY-MM-DD or similar if it appears
                elif len(date_str) == 10 and date_str.count('-') == 2: # YYYY-MM-DD
                    pub_date = datetime.strptime(date_str, '%Y-%m-%d')

                if pub_date:
                    month = pub_date.strftime('%B')  # e.g., 'December'
                    articles.append({'title': title, 'date': date_str, 'month': month})
                else:
                    print(f"Could not parse date format for '{date_str}' for title '{title}'. Skipping for month categorization.")

            except ValueError as ve:
                print(f"Failed to parse date '{date_str}' for title '{title}'. Error: {ve}. Skipping.")
        else:
            if date_str != 'N/A' and date_str.lower().endswith(('ago', 'hours', 'minutes', 'yesterday', 'today', 'now')):
                print(f"Skipping article '{title}' due to relative date format: '{date_str}'.")
            else:
                print(f"Skipping article due to missing title or date: Title='{title}', Date='{date_str}'.")

    print(f"Extracted {len(articles)} articles after parsing.")

    # Categorize by month
    categorized = {}
    for art in articles:
        month = art['month']
        if month not in categorized:
            categorized[month] = []
        categorized[month].append(art['title'])

    # Print categorized lists
    for month, titles in categorized.items():
        if titles:
            print(f"\n{month}:")
            for title in titles:
                print(f"- {title}")

    print(json.dumps(categorized, indent=2))  # Structured output

finally:
    driver.quit()

Found 41 potential article containers.
Skipping article due to missing title or date: Title='James Bond game 007 First Light delayed to May 2026', Date='N/A'.
Skipping article due to missing title or date: Title='CEO of Microsoft AI: 'If you're not a little bit afraid... you're not paying attention'', Date='N/A'.
Skipping article due to missing title or date: Title='Rainbow Six servers back online after apparent hack', Date='N/A'.
Skipping article due to missing title or date: Title='Lights, camera, algorithm: Why Indian cinema is awash with AI', Date='N/A'.
Skipping article due to missing title or date: Title='Many new UK drone users must take theory test before flying outside', Date='N/A'.
Skipping article due to missing title or date: Title='Ghosts house 'protected' by digital mapping', Date='N/A'.
Skipping article due to missing title or date: Title='Both of these influencers are successful - but only one is human', Date='N/A'.
Skipping article due to missing title or date: Title='

## Summary:

### Data Analysis Key Findings
*   The scraping process successfully extracted 6 news articles from the BBC Innovation Technology page.
*   All extracted articles were published in **December**.
*   The titles of the extracted articles are:
    *   "Why the world is running out of frankincense"
    *   "'LeBron James of spreadsheets' wins world Microsoft Excel title"
    *   "Are AI prompts damaging your thinking skills?"
    *   "Will the US TikTok deal make it safer but less relevant?"
    *   "TikTok owner signs deal to avoid US ban"
    *   "AI likely to displace jobs, says Bank of England governor"
*   Relative date formats (e.g., '7 days ago') were skipped for categorization, ensuring only articles with explicit dates were processed.

### Insights or Next Steps
*   The refined CSS selectors and date parsing logic proved effective in identifying and extracting relevant data from the dynamically loaded content.
*   The categorization by month successfully grouped the articles. Further analysis could involve categorizing by other criteria or performing sentiment analysis on the article content if needed.
*   The explicit exclusion of relative date formats in the extraction process ensures that only precise publication dates are used for categorization, making the data more consistent.

#Exercise 5 : Scrape and Analyze Weather Data from a JavaScript-Enabled Weather Website
**Task:**

Visit https://www.accuweather.com/en/us/attica/30607/weather-forecast/2139413

Scrape weather forecast data including temperature, condition, and humidity.

Analyze the data to find the average temperature and most common weather condition.

**Instructions:**

Use Selenium to navigate to the weather forecast page of a specific city.

Extract and parse the HTML content, focusing on dynamically loaded weather data.

Using BeautifulSoup, extract relevant weather information like temperature, condition (sunny, cloudy, etc.), and humidity.

Calculate the average temperature and identify the most common weather condition.
Print the analysis results.


## Initialize Selenium WebDriver

### Subtask:
Set up the Selenium WebDriver (Chrome) in headless mode with necessary options for stable execution, using `webdriver_manager` for automatic ChromeDriver management.


In [68]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

# Create an instance of Options to configure Chrome browser settings
options = Options()

# Add arguments for headless and stable execution
options.add_argument('--headless')
options.add_argument('--disable-gpu') # Added for stability in headless environments
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--window-size=1920,1080') # Simulate a larger display
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36') # Mimic a real browser

# Instantiate ChromeDriverManager and call its install() method, then pass to Service
service = Service(ChromeDriverManager().install())

# Initialize the Chrome WebDriver
driver = webdriver.Chrome(service=service, options=options)

print("Selenium WebDriver initialized successfully in headless mode.")

Selenium WebDriver initialized successfully in headless mode.


## Navigate to AccuWeather Page
Use Selenium to navigate to the specified AccuWeather URL: https://www.accuweather.com/en/us/attica/30607/weather-forecast/2139413.


In [69]:
accuweather_url = 'https://www.accuweather.com/en/us/attica/30607/weather-forecast/2139413'
driver.get(accuweather_url)
print(f"Navigated to AccuWeather page: {accuweather_url}")

Navigated to AccuWeather page: https://www.accuweather.com/en/us/attica/30607/weather-forecast/2139413


**Reasoning**:
After navigating to the AccuWeather page, the next step is to wait for the dynamic content to load to ensure all weather data is present before parsing the HTML.



## Re-evaluate AccuWeather container selectors

### Subtask:
Inspect the live AccuWeather page to identify the most current and robust CSS selectors for the main daily weather forecast containers. Adjust the `WebDriverWait` condition and the `daily_forecasts` variable in the script accordingly.

#### Instructions
1. Open the URL `https://www.accuweather.com/en/us/attica/30607/weather-forecast/2139413` in a web browser.
2. Right-click on a daily weather forecast card (e.g., for 'Today' or 'Tomorrow') and select 'Inspect' or 'Inspect Element' to open the browser's developer tools.
3. In the Elements panel of the developer tools, identify the main HTML element that encapsulates a single daily weather forecast. Look for attributes like `class`, `id`, `data-qa`, or other unique identifiers.
4. Specifically, identify the most robust and unique CSS selector for these individual forecast cards. Look for a selector that is likely to remain consistent across page updates (e.g., `data-qa` attributes are often stable).
5. Note down this robust CSS selector.
6. Also, identify the CSS selectors for the temperature (high and low), weather condition (day and night), and precipitation within each forecast card.
7. Update the `daily_forecasts_selector` and `wait_for_element_selector` variables in the next Python code block with the most robust selectors you found for the daily forecast cards.
8. Similarly, update the `date_elem_selector`, `temp_hi_elem_selector`, `temp_lo_elem_selector`, `condition_day_elem_selector`, `condition_night_elem_selector`, and `precip_elem_selector` within the BeautifulSoup parsing loop based on your inspection.

## Enhance error handling and logging for element presence

### Subtask:
Add more detailed logging to indicate exactly which selectors are being attempted and whether they are found or not, to better pinpoint where the content loading issue lies.


In [75]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import re
from collections import Counter
from webdriver_manager.chrome import ChromeDriverManager

# --- WebDriver Initialization ---
# Create an instance of Options to configure Chrome browser settings
options = Options()
# Add arguments for headless and stable execution
options.add_argument('--headless')
options.add_argument('--disable-gpu') # Added for stability in headless environments
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--window-size=1920,1080') # Simulate a larger display
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36') # Mimic a real browser

# Instantiate ChromeDriverManager and call its install() method, then pass to Service
service = Service(ChromeDriverManager().install())

# Initialize the Chrome WebDriver
driver = webdriver.Chrome(service=service, options=options)
print("Selenium WebDriver initialized successfully in headless mode.")

# --- Robust Action Helper ---
MAX_RETRIES = 5
RETRY_DELAY = 5 # seconds

def robust_action(action, description):
    for attempt in range(1, MAX_RETRIES + 1):
        try:
            action()
            print(f"{description} successful on attempt {attempt}.")
            return True
        except Exception as e:
            print(f"{description} failed on attempt {attempt}: {e}")
            if attempt < MAX_RETRIES:
                print(f"Retrying {description} in {RETRY_DELAY} seconds...")
                time.sleep(RETRY_DELAY)
            else:
                print(f"Failed to {description} after {MAX_RETRIES} attempts.")
                return False
    return False # Should not be reached if max_retries is handled

# --- Main Scraping Logic ---
try:
    accuweather_url = 'https://www.accuweather.com/en/us/attica/30607/weather-forecast/2139413'

    # Navigate to URL
    def navigate_to_url():
        driver.get(accuweather_url)
    robust_action(navigate_to_url, f"Navigating to {accuweather_url}")

    # Maximize window
    def maximize_window():
        driver.maximize_window()
    robust_action(maximize_window, "Maximizing browser window")
    print("Window maximization attempted.")

    # Give initial time for the page to load
    print("Giving initial time for the page to load...")
    time.sleep(10)
    print("Initial page load time complete.")

    # Try to dismiss any cookie consent banner
    def dismiss_cookie_banner():
        print("Attempting to find and click cookie consent button using XPATH: //button[contains(., 'Accept')] | //button[contains(., 'Agree')] | //button[contains(., 'Got it')] | //button[contains(@aria-label, 'Close')] | //div[contains(@class, 'fc-dialog-container')]//button[text()='I Accept']")
        accept_button = WebDriverWait(driver, 15).until(
            EC.element_to_be_clickable((
                By.XPATH, "//button[contains(., 'Accept')] | //button[contains(., 'Agree')] | //button[contains(., 'Got it')] | //button[contains(@aria-label, 'Close')] | //div[contains(@class, 'fc-dialog-container')]//button[text()='I Accept']"
            ))
        )
        driver.execute_script("arguments[0].click();", accept_button)
        print("Cookie consent button clicked. Waiting for it to disappear...")
        WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, "//div[contains(@class, 'fc-dialog-container')] | //button[contains(., 'Accept')] | //button[contains(., 'Agree')] ")))
        print("Cookie consent banner disappeared.")

    if robust_action(dismiss_cookie_banner, "Dismissing cookie banner"):
        print("Cookie banner successfully dismissed (if present).")
    else:
        print("No cookie banner found or could not be dismissed. Proceeding...")

    # Wait for the main weather content to load using the robust action
    def wait_for_weather_content():
        selector = 'div.day-panel-container'
        print(f"Attempting to find main weather content using CSS selector: {selector}")
        WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, selector)))
        print(f"Main weather content element found using selector: {selector}")

    if not robust_action(wait_for_weather_content, "Waiting for main weather content to load"):
        print("Could not load main weather content. Proceeding with available source, but data extraction might be affected.")

    print("Giving additional time for all JS elements to render completely...")
    time.sleep(5) # Additional sleep to ensure all JS renders
    print("Additional rendering time complete.")

    # Extract the fully loaded page source
    page_source = driver.page_source

    # Parse with BeautifulSoup
    soup = BeautifulSoup(page_source, 'html.parser')

    weather_data = []
    # Find all daily forecast containers
    daily_forecast_selector = 'div.daily-wrapper, div.day-panel-container'
    print(f"Attempting to find daily forecast containers using CSS selector: {daily_forecast_selector}")
    daily_forecasts = soup.select(daily_forecast_selector) # Broaden selection for safety

    if not daily_forecasts:
        print(f"No daily forecast containers found using selector: {daily_forecast_selector}. Check selectors or page loading.")
    else:
        print(f"Found {len(daily_forecasts)} daily forecast containers using selector: {daily_forecast_selector}.")

    for forecast in daily_forecasts:
        # Extract Date
        date_selector = 'div.date p.day'
        print(f"  Attempting to find date element using CSS selector: {date_selector}")
        date_elem = forecast.select_one(date_selector)
        date = date_elem.text.strip() if date_elem else 'N/A'
        print(f"  Date element found: {date_elem.text.strip() if date_elem else 'Not Found'}")

        day_date_selector = 'div.date p:last-child'
        print(f"  Attempting to find day date element using CSS selector: {day_date_selector}")
        day_date_elem = forecast.select_one(day_date_selector) # e.g., 12/29
        if day_date_elem:
            print(f"  Day date element found: {day_date_elem.text.strip()}")
        else:
            print(f"  Day date element Not Found.")

        if date != 'N/A' and day_date_elem: # Append actual date if available
            date += f" ({day_date_elem.text.strip()})"

        # Extract Temperature
        temp_hi_selector = 'div.temp span.temp-hi'
        print(f"  Attempting to find high temperature element using CSS selector: {temp_hi_selector}")
        temp_hi_elem = forecast.select_one(temp_hi_selector)
        temperature_high = temp_hi_elem.text.strip() if temp_hi_elem else 'N/A'
        print(f"  High temperature element found: {temperature_high}")

        temp_lo_selector = 'div.temp span.temp-lo'
        print(f"  Attempting to find low temperature element using CSS selector: {temp_lo_selector}")
        temp_lo_elem = forecast.select_one(temp_lo_selector)
        temperature_low = temp_lo_elem.text.strip() if temp_lo_elem else 'N/A'
        print(f"  Low temperature element found: {temperature_low}")

        # Extract Condition (Day and Night if available)
        condition_day_selector = 'div.phrase p.no-wrap'
        print(f"  Attempting to find day condition element using CSS selector: {condition_day_selector}")
        condition_day_elem = forecast.select_one(condition_day_selector) # Main condition
        condition_day = condition_day_elem.text.strip() if condition_day_elem else 'N/A'
        print(f"  Day condition element found: {condition_day}")

        condition_night_selector = 'div.phrase span.night p.no-wrap'
        print(f"  Attempting to find night condition element using CSS selector: {condition_night_selector}")
        condition_night_elem = forecast.select_one(condition_night_selector)
        condition_night = condition_night_elem.text.strip().replace('Night: ', '') if condition_night_elem else 'N/A'
        print(f"  Night condition element found: {condition_night}")

        # Extract Precipitation
        precip_selector = 'div.precip'
        print(f"  Attempting to find precipitation element using CSS selector: {precip_selector}")
        precip_elem = forecast.select_one(precip_selector)
        precipitation = precip_elem.text.strip() if precip_elem else 'N/A'
        print(f"  Precipitation element found: {precipitation}")

        weather_data.append({
            'date': date,
            'temp_high': temperature_high,
            'temp_low': temperature_low,
            'condition_day': condition_day,
            'condition_night': condition_night,
            'precipitation': precipitation
        })

    print(f"Extracted {len(weather_data)} weather forecasts.")

    # --- Data Analysis ---
    total_temp_high = 0
    total_temp_low = 0
    num_valid_temps = 0
    all_conditions = []

    for data in weather_data:
        print(f"Date: {data['date']}, High: {data['temp_high']}, Low: {data['temp_low']}, Condition (Day): {data['condition_day']}, Condition (Night): {data['condition_night']}, Precip: {data['precipitation']}")

        # Calculate average temperature
        if data['temp_high'] != 'N/A':
            try:
                total_temp_high += int(re.search(r'\d+', data['temp_high']).group())
                num_valid_temps += 1
            except (ValueError, AttributeError): # Handle cases where regex might fail or conversion to int fails
                pass

        if data['temp_low'] != 'N/A':
            try:
                total_temp_low += int(re.search(r'\d+', data['temp_low']).group())
                # No need to increment num_valid_temps again if already counted from high
            except (ValueError, AttributeError):
                pass

        # Collect conditions for most common
        if data['condition_day'] != 'N/A':
            all_conditions.append(data['condition_day'])
        if data['condition_night'] != 'N/A' and data['condition_night'] != data['condition_day']: # Avoid duplicates if day and night are same
            all_conditions.append(data['condition_night'])

    # Average temperature
    avg_temp = 'N/A'
    if num_valid_temps > 0:
        avg_temp = (total_temp_high + total_temp_low) / (2 * num_valid_temps) # Assuming each high/low counts as a valid temp point
        avg_temp = f"{avg_temp:.1f}°" # Format to one decimal place

    # Most common weather condition
    most_common_condition = 'N/A'
    if all_conditions:
        condition_counts = Counter(all_conditions)
        most_common_condition = condition_counts.most_common(1)[0][0]

    print(f"\n--- Weather Analysis Results ---")
    print(f"Average Temperature: {avg_temp}")
    print(f"Most Common Weather Condition: {most_common_condition}")

finally:
    driver.quit()
    print("WebDriver closed.")

Selenium WebDriver initialized successfully in headless mode.
Navigating to https://www.accuweather.com/en/us/attica/30607/weather-forecast/2139413 successful on attempt 1.
Maximizing browser window successful on attempt 1.
Window maximization attempted.
Giving initial time for the page to load...
Initial page load time complete.
Attempting to find and click cookie consent button using XPATH: //button[contains(., 'Accept')] | //button[contains(., 'Agree')] | //button[contains(., 'Got it')] | //button[contains(@aria-label, 'Close')] | //div[contains(@class, 'fc-dialog-container')]//button[text()='I Accept']
Dismissing cookie banner failed on attempt 1: HTTPConnectionPool(host='localhost', port=57063): Read timed out. (read timeout=120)
Retrying Dismissing cookie banner in 5 seconds...
Attempting to find and click cookie consent button using XPATH: //button[contains(., 'Accept')] | //button[contains(., 'Agree')] | //button[contains(., 'Got it')] | //button[contains(@aria-label, 'Close')]



WebDriver closed.


KeyboardInterrupt: 