## Setup

### Subtask:
Install the selenium and webdriver-manager libraries.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Reasoning**:
The subtask requires installing the `selenium` and `webdriver-manager` libraries. I will use pip to install both libraries in one code block.



In [None]:
%pip install selenium webdriver-manager

## Setup chromedriver

### Subtask:
Set up chromedriver using webdriver-manager.


**Reasoning**:
The subtask requires setting up chromedriver using webdriver-manager, which involves importing `ChromeDriverManager`, instantiating it, and calling the `install()` method.



In [None]:
from webdriver_manager.chrome import ChromeDriverManager

driver_path = ChromeDriverManager().install()

## Scrape website

### Subtask:
Write a selenium script to navigate to the specified URL using chromedriver.


**Reasoning**:
Write a selenium script to navigate to the specified URL using chromedriver, including importing webdriver, setting up ChromeOptions for headless mode, instantiating the driver, navigating to the URL, printing the title, and closing the driver.



In [None]:
from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('--headless')

driver = webdriver.Chrome(executable_path=driver_path, options=options)

driver.get("https://yourblog.com/category/technology/")

print(driver.title)

driver.quit()

**Reasoning**:
The traceback indicates that the `webdriver.Chrome` constructor no longer accepts the `executable_path` argument. The correct way to specify the driver path is by using the `service` argument with a `Service` object.



In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

options = webdriver.ChromeOptions()
options.add_argument('--headless')

service = Service(executable_path=driver_path)
driver = webdriver.Chrome(service=service, options=options)

driver.get("https://yourblog.com/category/technology/")

print(driver.title)

driver.quit()

**Reasoning**:
The error message "unknown error: cannot find Chrome binary" indicates that the system where the code is running does not have a Chrome browser installed or accessible at a standard location. Since we are running in a headless environment (like a cloud notebook), we need to install Chrome.



In [None]:
# Install headless Chrome
!apt-get update
!apt-get install -y google-chrome-stable

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

service = Service(executable_path=driver_path)
driver = webdriver.Chrome(service=service, options=options)

driver.get("https://yourblog.com/category/technology/")

print(driver.title)

driver.quit()

**Reasoning**:
The previous attempt to install `google-chrome-stable` failed with "E: Unable to locate package google-chrome-stable". This indicates that the package name or repository might be incorrect or not available in this environment. I will try installing `chromium-browser` instead, which is often available in similar environments and can also be used for headless browsing.



In [None]:
# Install chromium-browser as an alternative
!apt-get update
!apt-get install -y chromium-browser

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Specify the binary location for chromium-browser
options.binary_location = '/usr/bin/chromium-browser'

service = Service(executable_path=driver_path)
driver = webdriver.Chrome(service=service, options=options)

driver.get("https://yourblog.com/category/technology/")

print(driver.title)

driver.quit()

**Reasoning**:
The error message "Chrome failed to start: exited abnormally. (unknown error: DevToolsActivePort file doesn't exist)" suggests an issue with ChromeDriver communicating with the browser. This can sometimes be resolved by ensuring the Chrome binary is found correctly and by adding additional arguments to the Chrome options to improve compatibility in headless environments. Although I installed `chromium-browser`, the error still refers to Chrome, implying ChromeDriver is looking for a Chrome binary. I will try explicitly setting the `binary_location` in the options again, ensuring the path is correct for the installed `chromium-browser`. I will also add the `--remote-debugging-port=9222` argument as suggested by some online resources for headless Chrome issues, although the port number might not be strictly necessary.



In [None]:
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--remote-debugging-port=9222')


# Explicitly set the binary location for chromium-browser
options.binary_location = '/usr/bin/chromium-browser'


service = Service(executable_path=driver_path)
driver = webdriver.Chrome(service=service, options=options)

driver.get("https://yourblog.com/category/technology/")

print(driver.title)

driver.quit()

## Scrape website

### Subtask:
Write a selenium script to navigate to the specified URL using chromedriver.


**Reasoning**:
Based on the previous attempts, the issue is likely related to the environment not having a readily available and compatible Chrome or Chromium browser binary that chromedriver can interact with. To address this, I will try setting the binary location again, ensuring it points to the correct path for chromium-browser. I will also keep the headless options as the task requires navigating to the URL.



In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--remote-debugging-port=9222')

# Set the binary location for chromium-browser
options.binary_location = '/usr/bin/chromium-browser'

service = Service(executable_path=driver_path)
driver = webdriver.Chrome(service=service, options=options)

driver.get("https://yourblog.com/category/technology/")

print(driver.title)

driver.quit()

## Summary:

### Data Analysis Key Findings

* The initial attempts to use `executable_path` directly in `webdriver.Chrome` instantiation failed because this argument is deprecated.
* Using the `Service` object with `executable_path` resolved the deprecation error but led to a `WebDriverException` indicating that the system could not find a Chrome binary.
* Attempts to install `google-chrome-stable` via `apt-get` failed due to the package not being found.
* Installing and specifying the path to `chromium-browser` resolved the "cannot find Chrome binary" error but resulted in a new `WebDriverException` indicating that Chrome failed to start and was not reachable.
* Adding more Chrome options like `--remote-debugging-port=9222` did not resolve the issue of Chrome failing to start and connect.
* The persistent failure to launch and connect to a browser instance was the primary reason the task could not be completed successfully.

### Insights or Next Steps

* The environment lacks a readily available and compatible browser binary that Selenium can reliably launch and connect to.
* Further steps would require resolving the environmental issue by ensuring a compatible Chrome or Chromium browser is installed and accessible in a way that Chromedriver can successfully interact with it.


In [None]:
!LATEST=$(curl -sSL https://chromedriver.storage.googleapis.com/LATEST_RELEASE)

In [None]:
!wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add -
!sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google-chrome.list'
!apt-get update
!apt-get install -y google-chrome-stable


In [None]:
# Download correct version of ChromeDriver for Chrome 138
!wget https://storage.googleapis.com/chrome-for-testing-public/138.0.7204.49/linux64/chromedriver-linux64.zip

# Unzip it to current directory
!unzip chromedriver-linux64.zip

# Move chromedriver binary to /usr/local/bin
!mv chromedriver-linux64/chromedriver /usr/local/bin/

# Make it executable
!chmod +x /usr/local/bin/chromedriver


In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import time

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome(service=Service('/usr/local/bin/chromedriver'), options=options)

base_url = "https://yourblog.com/category/technology/"
all_links = []

try:
    for page in range(1, 51):  # Pages 1 to 50
        url = base_url if page == 1 else f"{base_url}page/{page}/"
        print(f"Fetching page {page}: {url}")
        driver.get(url)
        time.sleep(2)  # Give time for page to load

        try:
            section = driver.find_element(By.XPATH, "/html/body/div[1]/div[6]/div/div/section/div/div")
            articles = section.find_elements(By.TAG_NAME, "article")
            for article in articles:
                try:
                    a_tag = article.find_element(By.XPATH, ".//div[1]/a")
                    href = a_tag.get_attribute("href")
                    print(f"Found link: {href}")
                    all_links.append(href)
                except NoSuchElementException:
                    continue
        except NoSuchElementException:
            print(f"Main content section not found on page {page}, skipping.")

except Exception as e:
    print("Error occurred:", e)

finally:
    driver.quit()

# Save to file
with open("technology_article_links.txt", "w", encoding="utf-8") as f:
    for link in all_links:
        f.write(link + "\n")

print(f"Total links collected: {len(all_links)}")


In [None]:
import csv
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

# Set up Chrome options
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Set up WebDriver
driver = webdriver.Chrome(service=Service('/usr/local/bin/chromedriver'), options=options)

# Input and output files
input_file = "technology_article_links.txt"
output_file = "blogdata.csv"

# XPaths
xpaths = {
    "title": "/html/body/div[1]/div[6]/div/div/div[1]/div[1]/h1",
    "featured_image": "/html/body/div[1]/div[6]/div/div/div[2]/div/a",
    "content": "/html/body/div[1]/div[6]/div/div/div[3]/article/div[1]"
}

# Read URLs
with open(input_file, "r") as f:
    urls = [line.strip() for line in f if line.strip()]

# Open CSV file
with open(output_file, "w", newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Title", "Featured Image", "Content"])

    for url in urls:
        print(f"Processing: {url}")
        try:
            driver.get(url)
            time.sleep(2)  # wait for content to load

            # Extract elements
            try:
                title = driver.find_element("xpath", xpaths["title"]).text
            except NoSuchElementException:
                title = "N/A"

            try:
                image_href = driver.find_element("xpath", xpaths["featured_image"]).get_attribute("href")
            except NoSuchElementException:
                image_href = "N/A"

            try:
                content = driver.find_element("xpath", xpaths["content"]).text
            except NoSuchElementException:
                content = "N/A"

            # Write to CSV
            writer.writerow([title, image_href, content])

        except Exception as e:
            print(f"Failed on {url}: {e}")
            writer.writerow(["Error", "Error", "Error"])

driver.quit()
print(f"Done. Results saved to {output_file}")
