# PDF to Markdown Converter

This notebook converts PDF files under `data/raw` into Markdown files in `data/processed/pdf_markdown`.

## Optional: Install Dependencies

Run this cell if `pdfplumber` is not installed in your environment.

In [10]:
# Uncomment if you need to install the dependency
# %pip install pdfplumber

## Configure Paths

Adjust the input and output directories if needed.

In [11]:
from pathlib import Path

INPUT_DIR = Path("..") / "data" / "raw"
OUTPUT_DIR = Path("..") / "data" / "processed" / "pdf_markdown"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Input:  {INPUT_DIR.resolve()}")
print(f"Output: {OUTPUT_DIR.resolve()}")

Input:  D:\LLM\Projects\LLM-RAG-private-knowldge-worker\data\raw
Output: D:\LLM\Projects\LLM-RAG-private-knowldge-worker\data\processed\pdf_markdown


## Convert PDFs to Markdown

This cell scans for PDFs recursively and writes `.md` files with extracted text.

In [12]:
import pdfplumber

def pdf_to_text(pdf_path: Path) -> str:
    text_parts = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text() or ""
            text_parts.append(page_text)
    return "\n\n".join(text_parts).strip()

def convert_all_pdfs(input_dir: Path, output_dir: Path) -> None:
    pdf_files = sorted(input_dir.rglob("*.pdf"))
    if not pdf_files:
        print("No PDF files found.")
        return

    success_count = 0
    error_count = 0
    
    for pdf_path in pdf_files:
        rel_path = pdf_path.relative_to(input_dir)
        out_path = output_dir / rel_path.with_suffix(".md")
        out_path.parent.mkdir(parents=True, exist_ok=True)

        try:
            text = pdf_to_text(pdf_path)
            if not text:
                print(f"[SKIP] No text extracted: {rel_path}")
                error_count += 1
                continue

            out_path.write_text(text, encoding="utf-8")
            print(f"[OK] {rel_path} -> {out_path.relative_to(output_dir)}")
            success_count += 1
        except Exception as e:
            print(f"[ERROR] {rel_path}: {type(e).__name__}")
            error_count += 1
            continue

    print(f"\nSummary: {success_count} converted, {error_count} failed/skipped")

convert_all_pdfs(INPUT_DIR, OUTPUT_DIR)

[SKIP] No text extracted: academic achievements\BTech Degree.pdf
[OK] academic achievements\CertificateOfCompletion_Business Intelligence for Consultants.pdf -> academic achievements\CertificateOfCompletion_Business Intelligence for Consultants.md
[OK] academic achievements\CertificateOfCompletion_Deep Learning Getting Started.pdf -> academic achievements\CertificateOfCompletion_Deep Learning Getting Started.md
[OK] academic achievements\CertificateOfCompletion_Power BI Data Visualization and Dashboard Tips Tricks  Techniques.pdf -> academic achievements\CertificateOfCompletion_Power BI Data Visualization and Dashboard Tips Tricks  Techniques.md
[OK] academic achievements\CertificateOfCompletion_Power BI Essential Training 2020.pdf -> academic achievements\CertificateOfCompletion_Power BI Essential Training 2020.md
[OK] academic achievements\CoAuthor Maternal health paper presentation.pdf -> academic achievements\CoAuthor Maternal health paper presentation.md
[OK] academic achievements

Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats


[OK] research_papers\Fall_Detection_Methods_for_Elderly_People-_A_Comprehensive_Survey.pdf -> research_papers\Fall_Detection_Methods_for_Elderly_People-_A_Comprehensive_Survey.md


Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats


[OK] research_papers\Role of big data analytics in healthcare systems _ Medical Imaging Informatics.pdf -> research_papers\Role of big data analytics in healthcare systems _ Medical Imaging Informatics.md
[OK] Resume\5 Resume- Uravane Prathamesh Suhas.pdf -> Resume\5 Resume- Uravane Prathamesh Suhas.md
[OK] Resume\AI_ML Resume 5 Prathamesh Uravane.pdf -> Resume\AI_ML Resume 5 Prathamesh Uravane.md
[OK] Resume\Resume 6 Prathamesh Uravane.pdf -> Resume\Resume 6 Prathamesh Uravane.md
[SKIP] No text extracted: Transcripts\10th Certificate.pdf
[SKIP] No text extracted: Transcripts\12th Mark Sheet.pdf
[OK] Transcripts\Testudo - Unofficial Transcript.pdf -> Transcripts\Testudo - Unofficial Transcript.md
[SKIP] No text extracted: Transcripts\Transcripts.pdf
[ERROR] Transcripts\WES view-report.pdf: PdfminerException

Summary: 21 converted, 16 failed/skipped


In [13]:
import pdfplumber

def analyze_pdfs(input_dir: Path) -> dict:
    """Identify encrypted, scanned, and convertible PDFs"""
    pdf_files = sorted(input_dir.rglob("*.pdf"))
    results = {
        "convertible": [],
        "no_text": [],
        "encrypted": []
    }
    
    for pdf_path in pdf_files:
        rel_path = pdf_path.relative_to(input_dir)
        try:
            with pdfplumber.open(pdf_path) as pdf:
                text = "\n\n".join([page.extract_text() or "" for page in pdf.pages]).strip()
                if text:
                    results["convertible"].append(str(rel_path))
                else:
                    results["no_text"].append(str(rel_path))
        except Exception as e:
            results["encrypted"].append((str(rel_path), type(e).__name__))
    
    return results

# Analyze all PDFs
stats = analyze_pdfs(INPUT_DIR)

print("=" * 70)
print("ENCRYPTED OR IMAGE-BASED PDFs (No extractable text)")
print("=" * 70)
for path in sorted(stats["no_text"]):
    print(f"  {INPUT_DIR / path}")

if stats["encrypted"]:
    print("\n" + "=" * 70)
    print("ERROR (Likely encrypted)")
    print("=" * 70)
    for path, error in stats["encrypted"]:
        print(f"  {INPUT_DIR / path} [{error}]")

print("\n" + "=" * 70)
print(f"Summary: {len(stats['convertible'])} convertible, {len(stats['no_text'])} image-based, {len(stats['encrypted'])} encrypted")
print("=" * 70)

Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats
Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats


ENCRYPTED OR IMAGE-BASED PDFs (No extractable text)
  ..\data\raw\Extra cariculam\Drawing grade exam B grade .pdf
  ..\data\raw\Extra cariculam\Elementray Drawing grade exame.pdf
  ..\data\raw\Extra cariculam\Intermidiate.pdf
  ..\data\raw\Research integrity Course Certificates\Engineering and technology.pdf
  ..\data\raw\Research integrity Course Certificates\Natural and physical sciences.pdf
  ..\data\raw\Research integrity Course Certificates\arts and humanities.pdf
  ..\data\raw\Research integrity Course Certificates\biomedical sciences.pdf
  ..\data\raw\Research integrity Course Certificates\social and behavioural sciences.pdf
  ..\data\raw\Transcripts\10th Certificate.pdf
  ..\data\raw\Transcripts\12th Mark Sheet.pdf
  ..\data\raw\Transcripts\Transcripts.pdf
  ..\data\raw\academic achievements\BTech Degree.pdf
  ..\data\raw\academic achievements\Degree Certificate.pdf
  ..\data\raw\internships\NTU singapore internship completion  (1).pdf
  ..\data\raw\internships\Uma internship c