# Documentation

## Overview
This Python script automates the process of extracting research insights from the PWC website using Selenium. It collects article details such as date, title, description, and link, and saves them into a CSV file.

## Key Features
- Accepts cookies automatically.
- Extracts article information (date, title, description, link).
- Handles pagination by clicking the "Load More" button.
- Filters articles within the last 30 days.
- Ensures no duplicate entries.
- Saves the extracted data to a CSV file.

## Dependencies
- selenium
- webdriver-manager

Install them using:
```
pip install selenium webdriver-manager
```

## Sections Breakdown
### 1. Setup WebDriver
Initializes Chrome WebDriver using WebDriver Manager.

### 2. accept_cookies()
Handles cookie pop-up by clicking the accept button.

### 3. extract_details()
Extracts and returns article details: date, title, description, link.

### 4. click_load_more()
Clicks the "Load More" button if available.

### 5. Main Script Execution
- Loads the PWC website.
- Extracts details within the last 30 days.
- Saves data into `pwc_insights_details.csv`.

### 6. CSV Export
Exports the collected details into a CSV file with appropriate headers.

## Limitations
- Depends on the website's structure.
- Requires ChromeDriver compatibility.
- Assumes English language for cookie acceptance.

## Output
- CSV file: `pwc_insights_details.csv` containing Date, Title, Description, Link.

---

This document provides a detailed explanation of each component of the script, making it easier to understand and modify if needed.

# Code

In [1]:
import time
import csv
from datetime import datetime, timedelta
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

def accept_cookies():
    try:
        accept_button = driver.find_element(By.XPATH, "//button[contains(text(), 'Accept') or contains(text(), 'Agree')]")
        accept_button.click()
    except:
        pass

def extract_details():
    articles = driver.find_elements(By.CSS_SELECTOR, "article.collection__item")
    details = []

    for article in articles:
        try:
            date_text = article.find_element(By.CSS_SELECTOR, "time").get_attribute("datetime")
            date_object = datetime.strptime(date_text, "%d/%m/%y")
        except:
            date_object = None

        try:
            title = article.find_element(By.CSS_SELECTOR, "h4.collection__item-heading span").text
        except:
            title = ""

        try:
            description = article.find_element(By.CSS_SELECTOR, "p.paragraph").text
        except:
            description = ""

        try:
            link = article.find_element(By.CSS_SELECTOR, "a.collection__item-link").get_attribute("href")
        except:
            link = ""

        details.append({
            "Date": date_object,
            "Title": title,
            "Description": description,
            "Link": link
        })

    return details

def click_load_more():
    try:
        load_more_button = driver.find_element(By.CSS_SELECTOR, "button.collection__load-more")
        if "disabled" not in load_more_button.get_attribute("class"):
            load_more_button.click()
            time.sleep(2)
            return True
    except:
        pass
    return False

end_date = datetime.now()
start_date = end_date - timedelta(days=30)

driver.get("https://www.pwc.in/research-insights.html")
accept_cookies()
time.sleep(2)

details = []
seen_links = set()
stop_loading = False

while not stop_loading:
    new_details = extract_details()

    for detail in new_details:
        if detail["Link"] not in seen_links:
            seen_links.add(detail["Link"])
            if detail["Date"]:
                if detail["Date"] < start_date:
                    stop_loading = True
                    break

                if start_date <= detail["Date"] <= end_date:
                    details.append(detail)

    if not stop_loading:
        more_results = click_load_more()
        if not more_results:
            break

with open("pwc_insights_details.csv", "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=["Date", "Title", "Description", "Link"])
    writer.writeheader()
    for detail in details:
        writer.writerow({
            "Date": detail["Date"].strftime("%d/%m/%y"),
            "Title": detail["Title"],
            "Description": detail["Description"],
            "Link": detail["Link"]
        })

driver.quit()

print(f"Extracted details from {len(details)} articles and saved to pwc_insights_details.csv")


Extracted details from 10 articles and saved to pwc_insights_details.csv
