
# 📄 PDF to Text Extraction Notebook

Welcome to the **PDF to Text Extraction** notebook!  
This notebook is designed to help you to **extract text** from PDF files using Python.

### Let's get started! 💪


📦 First, install the PyMuPDF library!

This library is required to extract text from PDF files using the functions below.

In [1]:
!pip install PyMuPDF --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m43.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# Import PyMuPDF library you've just installed (`fitz`) to work with PDF files
import fitz

In [13]:
#You can modify the filename below with any PDF path you upload
pdf_path = "/content/Documento para o Desafio de Identificação e Remoção de Dados Sensíveis (1).pdf"

In [14]:
# Define a function to extract all text from a PDF file. It reads every page and returns the combined text.

def extract_text_from_pdf(path: str) -> str:
    """
      Extracts text content from all pages of a PDF file.

      Parameters:
          path (str): The file path to the PDF document.

      Returns:
          str: The extracted text from the entire PDF.
    """
    text = ""
    with fitz.open(path) as doc:
        for page in doc:
            text += page.get_text()
    return text

In [15]:
#Defines a function to extract all the special characters in the text, in order to have a raw version of the text

import unicodedata
import re

def remove_all_special_characters(text: str) -> str:
    """
    Normalizes and cleans a text string by removing accents, punctuation, and special characters.

    Steps:
        1. Converts accented characters to their ASCII equivalents.
        2. Removes all characters except letters, numbers, and spaces.
        3. Collapses multiple spaces into a single space.

    Parameters:
        text (str): The input string to be cleaned.

    Returns:
        str: The cleaned and normalized string.
    """
    text = unicodedata.normalize("NFD", text)
    text = text.encode("ascii", "ignore").decode("utf-8")

    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    text = re.sub(r"\s+", " ", text)

    return text.strip()

In [16]:
# 🚀 Run this function to extract the text on your chosen PDF!

raw_text = extract_text_from_pdf(pdf_path)

In [17]:
# 🚀 Run this function to remove all the special characters from the texrt you've extracted in the step before!

raw_text_cleaned = remove_all_special_characters(raw_text)

In [18]:
txt_path = pdf_path.replace(".pdf", ".txt")
with open(txt_path, "w", encoding="utf-8") as f:
    f.write(raw_text_cleaned)

print(f"✅ Text extracted and saved to: {txt_path}")

✅ Text extracted and saved to: /content/Documento para o Desafio de Identificação e Remoção de Dados Sensíveis (1).txt


In [19]:
print("\n--- Preview of Extracted Text ---\n")
print(raw_text_cleaned)


--- Preview of Extracted Text ---

Documento para o Desafio de Identificacao e Remocao de Dados Sensiveis Relatorio de Admissao Centro Medico Lisboa Data 15 de abril de 2025 Referencia ADM20250415089 Informacoes do Paciente Nome Maria Conceicao Oliveira Santos Data de Nascimento 12031978 CPF 12345678910 Cartao de Cidadao 123456789ZX0 Morada Rua das Flores 123 Apt 45 Sacavem Lisboa Telefone 351 912 345 678 Email mariasantosemailpessoalpt Numero da Seguranca Social 11223344556 Historico Medico A paciente Maria Santos mulher caucasiana de 47 anos compareceu a consulta relatando dores abdominais intensas Tem historico de hipertensao e diabetes tipo 2 diagnosticada ha 5 anos E HIV positivo desde 2018 atualmente com carga viral indetectavel gracas ao tratamento com antirretrovirais A paciente relatou que sua familia tem historico de cancro da mama mae falecida aos 52 anos e doenca cardiaca pai e avo paterno Exames geneticos realizados em 2022 indicaram predisposicao ao cancro de mama mutaca