# Documentation

## Overview
This document describes a two-part Python script that uses Selenium to scrape and extract information from the KPMG insights page.

### Part 1
- Collects hyperlinks from the KPMG insights page and saves them to `links.csv`.

### Part 2
- Reads the collected links from `insight-links-kpmg.csv`.
- Visits each link to extract additional details like title, description, date, content, and PDF links.
- Filters results based on the publication date (within the last 30 days).
- Saves the extracted details to `insights-details-kpmg.csv`.
This document describes a Python script that uses Selenium to scrape links from the KPMG insights page and saves them to a CSV file.

The script performs the following tasks:
1. Launches a Chrome browser using WebDriver.
2. Accepts cookies if prompted.
3. Scrolls through the page to load content.
4. Extracts links from the page.
5. Handles pagination to gather at least 100 links.
6. Saves the collected links to a CSV file.

## Dependencies
- Python
- selenium
- webdriver_manager
- csv

## Code Explanation

### Function Descriptions

1. **accept_cookies()**
   - Handles the acceptance of cookies if a cookie consent button is present on the webpage.
   - Uses XPath to locate the button and attempts to click it.
   - If the button is not found, the function exits silently.

2. **extract_links()**
   - Extracts hyperlinks from specific tiles on the page.
   - Uses CSS selectors to locate the anchor tags within tile elements.
   - Appends each valid link to the `links` list.

3. **extract_details(url)**
   - Navigates to a given URL and extracts title, description, date, main content, and PDF links if available.
   - Handles exceptions if elements are missing.
   - Returns a dictionary with the extracted information.

4. **Main Script Logic**
   - Initializes the WebDriver and opens the webpage.
   - Manages scrolling and pagination to collect links.
   - Reads links from `insight-links-kpmg.csv`.
   - Extracts details from each link if they were published in the last 30 days.
   - Saves the extracted data to `insights-details-kpmg.csv`.
   - Closes the WebDriver.

1. **accept_cookies()**
   - Handles the acceptance of cookies if a cookie consent button is present on the webpage.
   - Uses XPath to locate the button and attempts to click it.
   - If the button is not found, the function exits silently.

2. **extract_links()**
   - Extracts hyperlinks from specific tiles on the page.
   - Uses CSS selectors to locate the anchor tags within tile elements.
   - Appends each valid link to the `links` list.

3. **Main Script Logic**
   - Initializes the WebDriver and opens the webpage.
   - Manages scrolling and pagination to ensure all content is loaded and at least 100 links are collected.
   - Handles exceptions during pagination to avoid script crashes.
   - Saves the collected links to a CSV file and closes the browser.


### Brief Explanation
This script is designed to automate the process of extracting and enriching information from the KPMG insights page using Selenium.

- **Part 1:** Collects links from the main insights page and saves them to a CSV file.
- **Part 2:** Reads these links, navigates to each, extracts detailed information (title, description, date, content, and PDF links), filters based on the publication date (last 30 days), and saves the results to a new CSV file.
This script is designed to automate the process of extracting hyperlinks from the KPMG insights page using Selenium. It begins by opening the website in a Chrome browser, handles cookie acceptance if required, and scrolls through the page to load all available content. The script identifies and collects links from specific elements on the page, handles pagination to ensure a minimum of 100 links are collected, and finally, saves the extracted information from these links to a CSV file named `insights-details-kpmg.csv`.

# Code

## Part 1

In [None]:
import time
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

driver.get("https://kpmg.com/in/en/insights.html")

def accept_cookies():
    try:
        accept_button = driver.find_element(By.XPATH, "//button[contains(text(), 'Accept') or contains(text(), 'Agree')]")
        accept_button.click()
    except:
        pass

accept_cookies()

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)

links = []

def extract_links():
    tiles = driver.find_elements(By.CSS_SELECTOR, "div.cmp-filterlist__tile a.cmp-filterlist__tile--action-link")
    for tile in tiles:
        link = tile.get_attribute("href")
        if link:
            links.append(link)

extract_links()

while len(links) < 100:
    try:
        pagination = driver.find_element(By.CSS_SELECTOR, "div.cmp-filterlist__pagination[role='navigation']")
        driver.execute_script("arguments[0].scrollIntoView();", pagination)
        time.sleep(1)

        next_button = driver.find_element(By.CSS_SELECTOR, "button.cmp-filterlist__pagination--next[aria-label='Next set of results']")
        next_button.click()
        time.sleep(2)

        extract_links()
    except Exception as e:
        print(f"Error: {e}")
        break

driver.quit()

with open("insight-links-kpmg.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Link"])
    for link in links:
        writer.writerow([link])

print(f"Collected {len(links)} links and saved to links.csv")


Collected 104 links and saved to links.csv


## Part 2

In [None]:
import time
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from datetime import datetime, timedelta


driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

def accept_cookies():
    try:
        accept_button = driver.find_element(By.XPATH, "//button[contains(text(), 'Accept') or contains(text(), 'Agree')]")
        accept_button.click()
    except:
        pass

def extract_details(url):
    driver.get(url)
    accept_cookies()
    time.sleep(2)

    breadcrumbs = driver.find_elements(By.CSS_SELECTOR, "ol.cmp-breadcrumb__list li.cmp-breadcrumb__item")
    if len(breadcrumbs) < 3 or "Insights" not in breadcrumbs[1].text:
        return None

    try:
        title = driver.find_element(By.CSS_SELECTOR, "h1.cmp-hero-csi__title").text
    except:
        title = ""

    try:
        description = driver.find_element(By.CSS_SELECTOR, "div.cmp-hero-csi__description").text
    except:
        description = ""

    try:
        date = driver.find_element(By.CSS_SELECTOR, "div.cmp-hero-csi__article-date span#heroCsiMonth").text
        date_object = datetime.strptime(date, "%d %b, %Y")
    except:
        date_object = None

    try:
        content_sections = driver.find_elements(By.CSS_SELECTOR, "div.section.container.responsivegrid div.cmp-text p, div.section.container.responsivegrid div.cmp-text h3")
        content = "\n".join([section.text for section in content_sections])
    except:
        content = ""

    try:
        pdf_links = driver.find_elements(By.XPATH, "//a[contains(@href, '.pdf')]")
        pdf_link = pdf_links[0].get_attribute("href") if pdf_links else ""
    except:
        pdf_link = ""

    return {
        "url_link": url,
        "Title": title,
        "Description": description,
        "Date": date_object,
        "Content": content,
        "Pdf_link": pdf_link
    }

links = []
with open("insight-links-kpmg.csv", "r") as csvfile:
    reader = csv.reader(csvfile)
    next(reader)
    for row in reader:
        links.append(row[0])

end_date = datetime.now()
start_date = end_date - timedelta(days=30)

details = []
for link in links:
    detail = extract_details(link)
    if detail and isinstance(detail["Date"], datetime):
        if detail["Date"] < start_date:
            break
        if start_date <= detail["Date"] <= end_date:
            details.append(detail)

with open("insights-details-kpmg.csv", "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=["url_link", "Title", "Description", "Date", "Content", "Pdf_link"])
    writer.writeheader()
    for detail in details:
        writer.writerow(detail)

driver.quit()

print(f"Extracted details from {len(details)} links and saved to insights-details-kpmg.csv")


Extracted details from 7 links and saved to insights_details.csv
