# Documentation

## Overview
This script extracts text from PDF files located in a specified folder and saves the extracted text into `.txt` files in an output folder. It uses the **PyMuPDF** library (imported as `fitz`) for PDF text extraction.

### Main Functions
1. **extract_text_from_pdf(pdf_path)**
   - Extracts text from a PDF file.
   - Returns the extracted text as a string.

2. **save_text_to_file(text, txt_path)**
   - Saves the extracted text into a `.txt` file.

3. **convert_all_pdfs_to_txt(pdf_folder, output_folder)**
   - Iterates through all PDF files in the input folder.
   - Extracts text and saves it as a `.txt` file with the same base name.

### Execution
- The script is executed directly by running:
  ```bash
  python script_name.py
  ```
- It processes PDFs from the `pdf` folder and saves `.txt` files in the `txt` folder.

### Requirements
- `PyMuPDF` library.
- Ensure `pdf` and `txt` folders exist or will be created automatically.

### Error Handling
- Catches and prints exceptions if text extraction fails for any PDF.

### Example Output
- For `sample.pdf`, the output will be `sample.txt` in the `txt` folder.

---


# Code

In [4]:
import os
import fitz

def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file using PyMuPDF and returns it as a string.
    
    Parameters:
    - pdf_path (str): The path to the PDF file.
    
    Returns:
    - str: The extracted text.
    """
    text = ""
    try:
        doc = fitz.open(pdf_path)
        for page in doc:
            text += page.get_text()
    except Exception as e:
        print(f"Error extracting text from {pdf_path}: {e}")
    return text

def save_text_to_file(text, txt_path):
    """
    Saves the provided text into a .txt file.
    
    Parameters:
    - text (str): The text to save.
    - txt_path (str): The file path where the text will be saved.
    """
    with open(txt_path, "w", encoding="utf-8") as f:
        f.write(text)

def convert_all_pdfs_to_txt(pdf_folder, output_folder):
    """
    Iterates over all PDF files in the specified folder, extracts text from each,
    and saves the text into separate .txt files in the output folder.
    
    The text file will have the same base name as the original PDF.
    
    Parameters:
    - pdf_folder (str): The folder containing PDF files.
    - output_folder (str): The folder where the .txt files will be saved.
    """
    os.makedirs(output_folder, exist_ok=True)
    for filename in os.listdir(pdf_folder):
        if filename.lower().endswith(".pdf"):
            pdf_path = os.path.join(pdf_folder, filename)
            print(f"Processing {pdf_path} ...")
            text = extract_text_from_pdf(pdf_path)
            if text:
                base_name = os.path.splitext(filename)[0]
                txt_filename = base_name + ".txt"
                txt_path = os.path.join(output_folder, txt_filename)
                save_text_to_file(text, txt_path)
                print(f"Extracted text saved to {txt_path}")
            else:
                print(f"No text extracted from {pdf_path}")

if __name__ == "__main__":
    convert_all_pdfs_to_txt("pdf", "txt")

Processing pdf\PWC_04-03-25_Financial_health_Transcending_from_access_to_impact.pdf ...
Extracted text saved to txt\PWC_04-03-25_Financial_health_Transcending_from_access_to_impact.txt
Processing pdf\PWC_04-03-25_Towards_a_climate-resilient_future_Strategies_for_the_Andaman_and_Nicobar_Islands.pdf ...
Extracted text saved to txt\PWC_04-03-25_Towards_a_climate-resilient_future_Strategies_for_the_Andaman_and_Nicobar_Islands.txt
Processing pdf\PWC_05-03-25_The_mutual_funds_route_to_Viksit_Bharat_@2047.pdf ...
Extracted text saved to txt\PWC_05-03-25_The_mutual_funds_route_to_Viksit_Bharat_@2047.txt
Processing pdf\PWC_07-03-25_Quality_measures_and_standards_for_transitioning_to_value-based_healthcare_in_India.pdf ...
Extracted text saved to txt\PWC_07-03-25_Quality_measures_and_standards_for_transitioning_to_value-based_healthcare_in_India.txt
Processing pdf\PWC_14-02-25_Deals_at_a_glance_Annual_review_2024.pdf ...
Extracted text saved to txt\PWC_14-02-25_Deals_at_a_glance_Annual_review_20