# Introduction to Web Scraping with Selenium
This notebook may accompany the presentation. It introduces the basic steps of automating a browser to collect data.

In [73]:
# 1. Setup
# Ensure you have selenium installed: pip install selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

# Initialize the browser (Firefox)
# This will open a new browser window controlled by Python
# If you don't have Firefox setup, you can try webdriver.Chrome() if you have ChromeDriver
browser = webdriver.Firefox() 

In [74]:
# 2. Navigation
# Tell the browser to go to a specific URL
url = 'https://www.bbc.com/news/world/europe'
browser.get(url)
print(f"Successfully visited: {browser.title}")

Successfully visited: Europe | Latest News & Updates | BBC News


In [75]:
#Cookie clicker

# 1. Switch to the frame first (crucial for cookie banners!)
# Based on common BBC structures:
browser.switch_to.frame(browser.find_element(By.CSS_SELECTOR, "iframe[id^='sp_message_iframe']"))

button1 = browser.find_element(By.CSS_SELECTOR, "button[title='I agree']")
button1.click()
browser.switch_to.default_content()


## 3. Locating Elements
We need to find "hooks" in the HTML to specific content.
Common methods:
* `By.ID`
* `By.CLASS_NAME`
* `By.TAG_NAME`
* `By.CSS_SELECTOR`

In [76]:
# Example: Find all article containers
# We'll use TAG_NAME 'h2' as an example since headlines are usually h2.
# In the assignment, you'll need to inspect the specific page to find the right class or container.

headlines = browser.find_elements(By.TAG_NAME,'h2')

print(f"Found {len(headlines)} headlines.")

# Print the first 5 headlines found
for i, headline in enumerate(headlines[:5]):
    print(f"{i+1}: {headline.text}")

Found 35 headlines.
1: 'Difficult' Russia-Ukraine peace talks end without breakthrough
2: Wave of arrests over killing of French nationalist piles pressure on far left
3: Third Briton dies in French Alps avalanches in one week
4: Benfica claim 'defamation campaign' against Prestianni
5: Spain luxury hotel scammer booked rooms for one cent, police say


In [51]:
# 4. Extracting Data
# Once we have an element, we want its properties like .get_attribute('href')...

# Let's find links (anchor tags <a>) from selenium.webdriver.support.ui import WebDriverWait
#
links = browser.find_elements(By.TAG_NAME, 'a')

print("Sample of extracted data:")
count = 0
for link in links:
    # ... more data ...
    hd= link.text
    url = link.get_attribute('href')
    # Filter for interesting links (e.g., containing 'news') to avoid menu links
    if  hd and len(hd) > 20 and 'news'  in href:
        print(f"URL: {href}",  f"\n Headline :{hd}")
        print("-" * 20)
        count += 1
        if count >= 3: break

# link.text grabs all the text under the tag <a> which we specified in links, this is useful, but will be a nightmare when cleaning the data.

Sample of extracted data:
URL: https://www.bbc.com/news/articles/c0k1xj0d708o 
 Headline :As Trump retreats from climate goals, China is becoming a green superpower
--------------------
URL: https://www.bbc.com/news/articles/c0k1xj0d708o 
 Headline :Do not give away Diego Garcia, Trump tells UK in fresh attack on Chagos deal
The president's comments come just a day after the US gave its official backing to the UK's Chagos deal.
3 hrs ago
UK
--------------------
URL: https://www.bbc.com/news/articles/c0k1xj0d708o 
 Headline :Eight skiers found dead after California avalanche
Fifteen skiers went missing on Tuesday following a massive avalanche in California's Lake Tahoe region. One person remains missing but is presumed dead.
2 hrs ago
US & Canada
--------------------


In [79]:
links = browser.find_elements(By.TAG_NAME, 'a')

for link in links:
    try:
        # 1. Try to find a headline INSIDE this link
        # (This will fail for menu buttons, but work for article cards)
        hd = link.find_element(By.CSS_SELECTOR, "[data-testid='card-headline']").text
        summ= link.find_element(By.CSS_SELECTOR,"[data-testid='card-description']").text

        # 2. Get the URL
        url = link.get_attribute('href')

        print(f"Headline: {hd}")
        print(f"Summary: {summ}")
        print(f"URL:    {url}")
        print("-" * 20)
        
    except:
        # If the link doesn't have a headline inside it, just skip it!
        continue

Headline: Wave of arrests over killing of French nationalist piles pressure on far left
Summary: Eleven people are in detention in connection with the killing of far-right student activist Quentin Deranque.
URL:    https://www.bbc.com/news/articles/c62dzgy0q37o
--------------------
Headline: Third Briton dies in French Alps avalanches in one week
Summary: A total of 28 people have died in avalanches since the start of the winter season in the French Alps.
URL:    https://www.bbc.com/news/articles/cy57nxrr6dqo
--------------------
Headline: Benfica claim 'defamation campaign' against Prestianni
Summary: Benfica claim there is a "defamation campaign" against Gianluca Prestianni as Uefa launch an investigation into claims he racially abused Real Madrid's Vinicius Jr.
URL:    https://www.bbc.com/sport/football/articles/cx24vnm4dp5o
--------------------
Headline: Spain luxury hotel scammer booked rooms for one cent, police say
Summary: Police arrest a 20-year-old suspected of defrauding a l

In [80]:
# 5. Interaction (Clicking Buttons)
# Important for "Pagination" in the assignment (Clicking 'Next' or region buttons).

button = browser.find_element(By.LINK_TEXT , 'Europe')
button.click()

print("To click a button:")
print("1. Inspect the button to find its ID or Class")
print("2. Use browser.find_element(...) to select it")
print("3. Call .click() on that element")

To click a button:
1. Inspect the button to find its ID or Class
2. Use browser.find_element(...) to select it
3. Call .click() on that element


In [None]:
## Lets try next page button 
## let try to divide it 
# a function that findt page button and clicks recrusivly 
# Maybe a function that scrolls down-only if nec...
		## There is a scroll to element function --Could be solution 
		## Action chain 


In [None]:
## Iterating through pages 
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_via_next_button(driver, num_pages):
    wait = WebDriverWait(driver, 10)

    for i in range(1, num_pages + 1):
        print(f"--- Processing Page {i} ---")
        # 1. SCRAPE THE CURRENT PAGE
        # (Your scraping logic goes here)
        links = browser.find_elements(By.TAG_NAME, 'a')
        for link in links:
            try:
                # 1. Try to find a headline INSIDE this link
                # (This will fail for menu buttons(headers,footers...), but work for article cards (trageted news "cards")
                hd = link.find_element(By.CSS_SELECTOR, "[data-testid='card-headline']").text
                summ= link.find_element(By.CSS_SELECTOR,"[data-testid='card-description']").text
        
                url = link.get_attribute('href')
                print(f"Headline: {hd}")
                print(f"Summary: {summ}")
                print(f"URL:    {url}")
                print("-" * 20)
                
            except:
                # If the link doesn't have a headline inside it, just skip it!
                continue    
        # 2. CLICK NEXT (Don't click on the very last page!)
        if i < num_pages:
            try:
                
                # Find the 'Next' button (replace with actual ID/Class/Text)
                next_btn = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,f"button[aria-label='Go to page {i+1}']")))
                
                # Scroll to it if necessary
                driver.execute_script("arguments[0].scrollIntoView();", next_btn)
                next_btn.click()
                
                # Wait for old content to disappear (optional but recommended)
                time.sleep(2)
                
            except Exception as e:
                print(f"Could not find next button: {e}")
                break

In [82]:
scrape_via_next_button(browser, 7)

--- Processing Page 1 ---
Headline: Wave of arrests over killing of French nationalist piles pressure on far left
Summary: Eleven people are in detention in connection with the killing of far-right student activist Quentin Deranque.
URL:    https://www.bbc.com/news/articles/c62dzgy0q37o
--------------------
Headline: Third Briton dies in French Alps avalanches in one week
Summary: A total of 28 people have died in avalanches since the start of the winter season in the French Alps.
URL:    https://www.bbc.com/news/articles/cy57nxrr6dqo
--------------------
Headline: Benfica claim 'defamation campaign' against Prestianni
Summary: Benfica claim there is a "defamation campaign" against Gianluca Prestianni as Uefa launch an investigation into claims he racially abused Real Madrid's Vinicius Jr.
URL:    https://www.bbc.com/sport/football/articles/cx24vnm4dp5o
--------------------
Headline: Spain luxury hotel scammer booked rooms for one cent, police say
Summary: Police arrest a 20-year-old s

In [None]:
# 6. Handling Delays
# When you click, the page needs time to load before you can scrape again.
# If you scrape too fast, you might get old data or errors.

print("Simulating a wait for page load...")
time.sleep(3) # Pause for 3 seconds
print("Starting to scrape...")

Simulating a wait for page load...
Starting to scrape...


In [83]:
# 7. Cleanup
# Close the browser when done to free up resources.
browser.quit()

In [None]:
### We have different anchor tags, <a> button and more 