# Documentation

## Overview
This document describes a Python script that downloads PDFs from URLs listed in a CSV file.

### Script Functionality
- Reads PDF links from the CSV file `insights-details-kpmg-india.csv`.
- Downloads each PDF using a naming convention: `KPMG_DATE_TITLE.pdf`.
- Saves the PDFs into a specified folder (`pdf` by default).

## Dependencies
- Python
- os
- re
- pandas
- requests

## Code Explanation

### Function Descriptions

1. **clean_filename(s)**
   - Cleans strings to make them suitable for filenames by removing illegal characters.
   - Uses regex to filter out characters like `\ / : * ? " < > |`.

2. **download_pdf(url, folder, date, title)**
   - Downloads a PDF from the provided URL.
   - Uses a naming convention: `KPMG_DATE_TITLE.pdf`.
   - Saves the PDF in the specified folder.
   - Handles HTTP request errors.

3. **download_all_pdfs(csv_file, folder='pdf')**
   - Reads the CSV file for PDF links, dates, and titles.
   - Calls `download_pdf()` for each valid PDF link.
   - Skips rows if required fields are missing.

4. **Main Script Logic**
   - Invokes `download_all_pdfs()` with the CSV file path.
   - Saves PDFs in the `pdf` folder.

### Brief Explanation
This script automates the download of PDFs from the URLs listed in `insights-details-kpmg-india.csv`, using a consistent naming convention and saving them into a specified folder.


# Code

In [7]:
import os
import re
import pandas as pd
import requests

def clean_filename(s):
    """
    Removes characters that are illegal in file names.
    
    Parameters:
    - s (str): The input string.
    
    Returns:
    - str: The sanitized string.
    """
    s = s.strip()
    # Remove characters: \ / : * ? " < > |
    return re.sub(r'[\\/*?:"<>|]', "", s)

def download_pdf(url, folder, date, title):
    """
    Downloads a PDF from the given URL and saves it using the naming convention:
    "KPMG_DATE_TITLE.pdf" in the specified folder.
    
    Parameters:
    - url (str): The URL to the PDF file.
    - folder (str): The destination folder to save the PDF.
    - date (str): The date string from the CSV (potentially with a time part).
    - title (str): The title string from the CSV.
    """
    os.makedirs(folder, exist_ok=True)
    
    # If the date string contains time, extract only the date part.
    date_only = date.split()[0] if " " in date else date
    clean_date = clean_filename(date_only)
    clean_title = clean_filename(title)
    
    # Create the filename using the naming convention
    filename = f"KPMG_{clean_date}_{clean_title}.pdf"
    file_path = os.path.join(folder, filename)
    
    response = requests.get(url)
    if response.status_code == 200:
        with open(file_path, 'wb') as f:
            f.write(response.content)
        print(f"Downloaded: {file_path}")
    else:
        print(f"Failed to download from {url}. Status code: {response.status_code}")

def download_all_pdfs(csv_file, folder='pdf'):
    """
    Reads a CSV file, extracts the PDF links from the 'Pdf_link' column,
    along with the 'Date' and 'title' columns, and downloads each PDF into the 
    specified folder using the naming convention "KPMG_DATE_TITLE.pdf".
    
    Parameters:
    - csv_file (str): The path to the CSV file.
    - folder (str): The folder where the PDFs will be saved.
    """
    df = pd.read_csv(csv_file)
    
    for index, row in df.iterrows():
        pdf_link = row.get('Pdf_link')
        date = row.get('Date')
        title = row.get('Title')
        
        if (
            isinstance(pdf_link, str) and pdf_link.strip() and 
            isinstance(date, str) and date.strip() and 
            isinstance(title, str) and title.strip()
        ):
            download_pdf(pdf_link, folder, date, title)
        else:
            print(f"Row {index} is missing a valid PDF link, Date, or title.")

# Example usage:
download_all_pdfs("insights-details-kpmg-india.csv")


Downloaded: pdf\KPMG_2025-02-28_Issue no. 103  February 2025.pdf
Downloaded: pdf\KPMG_2025-02-20_Food and Nutritional Security in India.pdf
Row 2 is missing a valid PDF link, Date, or title.
Downloaded: pdf\KPMG_2025-02-07_KPMG global tech report – industrial manufacturing insights.pdf
Downloaded: pdf\KPMG_2025-02-07_KPMG global tech report Technology insights.pdf
Downloaded: pdf\KPMG_2025-02-07_KPMG global tech report energy insights.pdf
Downloaded: pdf\KPMG_2025-02-06_KPMG global tech report 2024.pdf
