# Web-Scale PDF Processing Pipeline - Educational Example

This notebook provides a simplified version of the web-scale PDF processing pipeline based on the [Web Scale PDF Processing Pipeline](https://github.com/aisingapore/web_scale_pdf_processing_pipeline) pipeline used to extract educational web resources for pretraining large language models.

We'll break down the workflow into the following steps:
1. Setup and Environment
2. PDF Collection & Filtering
3. Quality Filtering
4. OCR Text Extraction using Marker
5. Text Post-processing

This simplified version uses a single GPU without distributed computing or Spark, perfect for educational purposes.

## 1. Setup and Environment

First, let's install the required packages. The main package we'll need is [Marker](https://github.com/VikParuchuri/marker), a GPU-accelerated OCR tool for extracting text from PDFs.

In [None]:
# Install required packages
!pip install marker-pdf pypdf pandas opencv-python openai transformers torch

In [None]:
# Import necessary libraries
import os
import re
import glob
import json
import pandas as pd
import numpy as np
from pypdf import PdfReader
from marker.convert import convert_single_pdf
from marker.models import load_all_models
from marker.output import save_markdown
import tempfile
import shutil
import uuid
import time

# Set up directories
input_dir = "./pdf_samples"  # Directory containing your PDF files
output_dir = "./output"      # Directory for outputs

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

## 2. PDF Collection & Filtering

In this step, we'll find all PDF files in a directory and filter them based on basic criteria like page count.

In [None]:
def list_pdf_files(directory):
    """List all PDF files in the given directory."""
    pdf_files = glob.glob(os.path.join(directory, "*.pdf"))
    return pdf_files

def get_pdf_page_count(pdf_path):
    """Get the number of pages in a PDF file."""
    try:
        with open(pdf_path, "rb") as file:
            pdf_reader = PdfReader(file)
            return len(pdf_reader.pages)
    except Exception as e:
        print(f"Error in get_pdf_page_count for {pdf_path}: {str(e)}")
        return 0

def filter_pdfs_by_page_count(pdf_files, min_pages=2, max_pages=500):
    """Filter PDFs by page count."""
    filtered_pdfs = []
    for pdf_path in pdf_files:
        page_count = get_pdf_page_count(pdf_path)
        if min_pages <= page_count <= max_pages:
            filtered_pdfs.append((pdf_path, page_count))
    return filtered_pdfs

In [None]:
# Get all PDF files
pdf_files = list_pdf_files(input_dir)
print(f"Found {len(pdf_files)} PDF files")

# Filter PDFs by page count
filtered_pdfs = filter_pdfs_by_page_count(pdf_files)
print(f"After filtering by page count: {len(filtered_pdfs)} PDFs")

# Create a DataFrame with the filtered PDFs
pdf_df = pd.DataFrame(filtered_pdfs, columns=["pdf_path", "page_count"])
pdf_df.head()

## 3. Quality Filtering

This step determines if a PDF contains relevant content for our needs. We'll implement two approaches:

1. **Rule-based filtering**: A simple approach using basic text metrics
2. **LLM-based filtering**: More sophisticated approach using language models
   - API-based models (OpenAI's GPT-4o-mini)
   - Open source models (Llama-3 8B)
   
You can choose which filtering method to use based on your needs and available resources.

In [None]:
def extract_basic_text(pdf_path):
    """Extract text from PDF using PyPDF (not as good as Marker but faster for initial filtering)."""
    try:
        with open(pdf_path, "rb") as file:
            pdf_reader = PdfReader(file)
            # Only extract from first few pages for quick filtering
            max_pages = min(5, len(pdf_reader.pages))
            text = ""
            for i in range(max_pages):
                text += pdf_reader.pages[i].extract_text() + "\n"
        return text
    except Exception as e:
        print(f"Error in extract_basic_text for {pdf_path}: {str(e)}")
        return ""

# 1. RULE-BASED FILTERING

def is_relevant_rule_based(text, keywords=None, min_text_length=100):
    """Simple relevance check based on text length and optional keywords."""
    if len(text) < min_text_length:
        return False
        
    if keywords:
        return any(keyword.lower() in text.lower() for keyword in keywords)
    
    return True

# 2. LLM-BASED FILTERING

# 2.1 OpenAI API Model (GPT-4o-mini)
def is_relevant_openai(text, api_key=None, model="gpt-4o-mini", domain="education"):
    """Use OpenAI's API to determine if a PDF is relevant for a specific domain."""
    try:
        import openai
        
        # You would need to set your API key
        if api_key:
            openai.api_key = api_key
        elif os.environ.get("OPENAI_API_KEY"):
            openai.api_key = os.environ.get("OPENAI_API_KEY")
        else:
            print("Warning: No OpenAI API key provided. Skipping OpenAI filtering.")
            return True  # Default to True if no API key
        
        client = openai.OpenAI()
        
        # Truncate the text to avoid excessive token usage
        truncated_text = text[:15000]  # Using first 15k chars, adjust as needed
        
        # Create the prompt based on the domain
        prompt = f"""You are an expert content evaluator. Your task is to determine if the following document is relevant for {domain} content.

Here is a sample of the document:

<document_sample>
{truncated_text}
</document_sample>

Is this document relevant for {domain} purposes? Answer only with 'true' or 'false'."""
        
        # Call the OpenAI API
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You help determine if documents are relevant for specific domains."},
                {"role": "user", "content": prompt}
            ],
            temperature=0
        )
        
        # Extract the answer
        answer = response.choices[0].message.content.strip().lower()
        is_relevant = "true" in answer
        
        return is_relevant
    
    except Exception as e:
        print(f"Error in is_relevant_openai: {str(e)}")
        return True  # Default to True in case of error

# 2.2 Open Source Model (Llama-3)
def is_relevant_llama(text, domain="education"):
    """Use Llama-3 to determine if a PDF is relevant for a specific domain."""
    try:
        from transformers import AutoTokenizer, AutoModelForCausalLM
        import torch
        
        # Truncate the text to avoid excessive token usage
        truncated_text = text[:5000]  # Smaller context for local models
        
        # Load model and tokenizer (cached after first run)
        model_name = "meta-llama/Llama-3-8b-instruct"  # Or any other appropriate model
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
        
        # Move model to GPU if available
        device = "cuda" if torch.cuda.is_available() else "cpu"
        model.to(device)
        
        # Create the prompt
        prompt = f"""<|system|>
You are an expert content evaluator. You determine if documents are relevant for specific domains.
<|user|>
Is the following document relevant for {domain} content? Answer only with 'true' or 'false'.

{truncated_text}
<|assistant|>
"""
        
        # Tokenize and generate
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
        
        with torch.no_grad():
            output = model.generate(
                input_ids,
                max_new_tokens=10,
                temperature=0,
                do_sample=False
            )
        
        # Decode the output
        response = tokenizer.decode(output[0], skip_special_tokens=True)
        
        # Extract just the assistant's response (after the prompt)
        assistant_response = response.split("<|assistant|>")[-1].strip().lower()
        
        is_relevant = "true" in assistant_response
        return is_relevant
    
    except Exception as e:
        print(f"Error in is_relevant_llama: {str(e)}")
        return True  # Default to True in case of error

In [None]:
# Extract text for filtering
pdf_df["ocr_text"] = pdf_df["pdf_path"].apply(extract_basic_text)

# Choose your filtering method
filtering_method = "rule-based"  # Options: "rule-based", "openai", "llama"

# Parameters for filtering
domain = "education"  # Target domain for content
keywords = ["education", "research", "study", "learning"]  # For rule-based filtering

# Apply the selected filtering method
if filtering_method == "rule-based":
    print("Using rule-based filtering...")
    pdf_df["is_relevant"] = pdf_df["ocr_text"].apply(lambda text: is_relevant_rule_based(text, keywords))

elif filtering_method == "openai":
    # You would need to set your API key: os.environ["OPENAI_API_KEY"] = "your-api-key"
    print("Using OpenAI API filtering...")
    # Check a small sample first (comment out for full dataset)
    sample_size = min(3, len(pdf_df))
    pdf_df = pdf_df.head(sample_size)  # For testing API usage
    pdf_df["is_relevant"] = pdf_df["ocr_text"].apply(lambda text: is_relevant_openai(text, domain=domain))

elif filtering_method == "llama":
    print("Using Llama-3 filtering...")
    # Comment out the next line for the full dataset
    pdf_df = pdf_df.head(min(3, len(pdf_df)))  # Small sample for testing
    pdf_df["is_relevant"] = pdf_df["ocr_text"].apply(lambda text: is_relevant_llama(text, domain=domain))

# Filter relevant PDFs
relevant_pdfs = pdf_df[pdf_df["is_relevant"] == True]
print(f"After relevance check: {len(relevant_pdfs)} PDFs out of {len(pdf_df)} total")
relevant_pdfs.head()

## 4. OCR Text Extraction using Marker

Now we'll use Marker, a GPU-accelerated OCR tool, to extract high-quality text from the PDFs. This is the core of the pipeline and the part that benefits most from GPU acceleration.

In [None]:
# Load all Marker models (only once)
print("Loading Marker models (this may take a while)...")
model_lst = load_all_models()
print("Models loaded!")

In [None]:
def process_pdf_with_marker(pdf_path, model_lst):
    """Process a PDF file using Marker's convert_single_pdf function and return the extracted text."""
    try:
        # Convert PDF to text using Marker
        full_text, images, out_meta = convert_single_pdf(pdf_path, model_lst)
        
        # Generate a unique filename
        unique_filename = f"{uuid.uuid4().hex}_{os.path.basename(pdf_path)}"
        
        # Save the markdown to a temporary directory
        with tempfile.TemporaryDirectory() as temp_dir:
            md_dir = save_markdown(temp_dir, unique_filename, full_text, images, out_meta)
            
            # Construct the path to the .md file inside the created directory
            md_filename = os.path.basename(md_dir) + ".md"
            md_file_path = os.path.join(md_dir, md_filename)
            
            # Read the content of the markdown file
            with open(md_file_path, "r", encoding="utf-8") as md_file:
                md_content = md_file.read()
        
        return md_content
    except Exception as e:
        print(f"Error processing PDF {pdf_path}: {str(e)}")
        return f"Error: Failed to process PDF {pdf_path}"

In [None]:
# Process a sample of PDFs (for quicker execution in this educational example)
# In practice, you might process all relevant PDFs
sample_size = min(5, len(relevant_pdfs))
sample_pdfs = relevant_pdfs.head(sample_size)

# Process each PDF with Marker
start_time = time.time()
results = []

for idx, row in sample_pdfs.iterrows():
    print(f"Processing {row['pdf_path']}...")
    md_content = process_pdf_with_marker(row['pdf_path'], model_lst)
    results.append({
        "pdf_path": row['pdf_path'],
        "page_count": row['page_count'],
        "is_relevant": row['is_relevant'],
        "md_extraction_result": md_content
    })
    
end_time = time.time()
duration = end_time - start_time
print(f"Processed {len(results)} PDFs in {duration:.2f} seconds")

# Create a DataFrame with the results
results_df = pd.DataFrame(results)
results_df.head()

## 5. Text Post-processing

Finally, we'll clean up the extracted text to make it more useful for downstream tasks like language model training.

In [None]:
def extract_meaningful_text(markdown_content):
    """Extract meaningful text from markdown content."""
    # Remove metadata and formatting
    content = re.sub(r'^---.*?---', '', markdown_content, flags=re.DOTALL)
    
    # Remove title tags but keep the title text
    content = re.sub(r'^#\s*(.*?)\n', r'\1\n', content)
    
    # Remove #### symbols but keep the header content
    content = re.sub(r'^####\s*(.*?)\n', r'\1\n', content, flags=re.MULTILINE)
    content = re.sub(r'^##\s*(.*?)\n', r'\1\n', content, flags=re.MULTILINE)
    
    # Remove bold formatting
    content = re.sub(r'\*\*.*?\*\*', '', content)
    
    # Remove math formulas
    content = re.sub(r'\$.*?\$', '', content)
    
    # Remove citations and references
    content = re.sub(r'\[.*?\]', '', content)
    content = re.sub(r'\(.*?\)', '', content)
    
    # Remove Markdown tables
    content = re.sub(r'\|[^\n]*\|(\n\|[-:| ]+\|)?(\n\|[^\n]*\|)*', '', content)
    
    # Remove Keywords section
    content = re.sub(r'Keywords:.*?(?=\n\n)', '', content, flags=re.DOTALL)
    
    # Remove extra whitespace and newlines
    content = re.sub(r'\s+', ' ', content)
    content = content.strip()
    
    return content

In [None]:
# Apply text post-processing
results_df["extracted_meaningful_text"] = results_df["md_extraction_result"].apply(extract_meaningful_text)

# Save the results to a CSV file
output_file = os.path.join(output_dir, "processed_pdfs.csv")
results_df.to_csv(output_file, index=False)

print(f"Results saved to: {output_file}")

# Display a sample of the extracted text
for idx, row in results_df.head(1).iterrows():
    print(f"Sample extracted text from {os.path.basename(row['pdf_path'])}:")
    print("-" * 80)
    print(row["extracted_meaningful_text"][:500] + "...")
    print("-" * 80)

## Conclusion

This notebook has demonstrated a simplified version of the web-scale PDF processing pipeline, focusing on the core components:

1. **PDF Collection & Filtering**: Finding and filtering PDFs based on page count
2. **Quality Filtering**: Multiple approaches including rule-based and LLM-based filtering (both API and open source models)
3. **OCR Text Extraction**: Using Marker for high-quality text extraction with GPU acceleration
4. **Text Post-processing**: Cleaning the extracted text for downstream use

The full pipeline in the `web_scale_pdf_processing_pipeline` directory includes additional components for distributed processing using Apache Spark and SLURM, which enables scaling to thousands or millions of PDFs.

### Key Differences from Full Pipeline

- **Distribution**: This simplified version runs on a single machine with one GPU, while the full pipeline can distribute work across multiple nodes and GPUs
- **Parallelism**: Our example processes PDFs sequentially, while the full pipeline uses Spark for parallel processing
- **Scale**: This example is designed for educational purposes with a small number of PDFs, while the full pipeline can handle web-scale datasets

### Tips for Quality Filtering

- **Rule-based filtering** is fast and requires no API keys or model downloads, but less accurate
- **OpenAI API filtering** provides high-quality results but requires an API key and has usage costs
- **Open source model filtering** with Llama-3 offers a balance between quality and cost, but requires GPU resources

For production use, you would want to use the full pipeline with proper distribution and parallelism for efficiency at scale.

## 6. Multilingual Translation of Extracted Text

After extracting text from PDFs, you might need to translate content between languages to make it more accessible or to standardize your dataset. This section demonstrates two approaches:

1. **Google Translate API**: Cloud-based translation service with high quality and broad language support
2. **Meta's NLLB Model**: Open-source multilingual model that can translate between 200+ languages locally

Both methods have their advantages:
- Google Translate API is simple to use but requires an API key and has usage costs
- NLLB is free to use and works offline, but requires more computational resources

Let's see how to implement both approaches.

In [None]:
# Install additional required packages
!uv pip install google-cloud-translate langdetect transformers sentencepiece protobuf sacremoses

In [None]:
# Import translation-related libraries
import langdetect
from google.cloud import translate_v2 as translate
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Add these imports at the top of your notebook
import torch
import re
import os
import warnings
warnings.filterwarnings('ignore')

# Function to detect language
def detect_language(text):
    """Detect the language of a text using langdetect."""
    try:
        return langdetect.detect(text)
    except:
        return "unknown"

### 6.1 Google Translate API Implementation

Google Cloud Translation provides a simple, scalable API for translating text with high quality results. It supports over 100 languages and is reliable for production use.

**Note**: You'll need a Google Cloud account and API key to use this service. The code below assumes you've set up authentication via environment variables or service account credentials.

In [None]:
def translate_with_google(text, target_language='en', source_language=None):
    """
    Translate text using Google Cloud Translation API.
    
    Args:
        text (str): Text to translate
        target_language (str): Target language code (e.g., 'en', 'fr', 'zh-CN')
        source_language (str, optional): Source language code. If None, Google will auto-detect
    
    Returns:
        str: Translated text
    """
    # Check if text is too long (Google limits to 100K characters per request)
    if len(text) > 100000:
        # Split into smaller chunks (for simplicity, this splits by periods)
        chunks = text.split('. ')
        translated_chunks = []
        
        for i in range(0, len(chunks), 50):  # Process 50 sentences at a time
            chunk_text = '. '.join(chunks[i:i+50]) + ('.' if i+50 < len(chunks) else '')
            result = translate_with_google(chunk_text, target_language, source_language)
            translated_chunks.append(result)
            
        return ' '.join(translated_chunks)
    
    try:
        # Initialize the Google Translate client
        # Note: This assumes you've set up authentication via environment variables
        # or a service account. You may need to adjust based on your setup.
        translate_client = translate.Client()
        
        # Perform the translation
        if source_language:
            result = translate_client.translate(
                text,
                target_language=target_language,
                source_language=source_language
            )
        else:
            result = translate_client.translate(
                text,
                target_language=target_language
            )
        
        # Return the translated text
        return result['translatedText']
    
    except Exception as e:
        print(f"Error in translation: {e}")
        return "Error: Translation failed"

In [None]:
# Example of using Google Translate API
# Set your Google Cloud credentials (if not using a service account)
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your/credentials.json"

# Sample translation with Google API
# Note: This code is commented out as it requires API credentials
# Feel free to uncomment and run this code after setting up your credentials

sample_text = results_df["extracted_meaningful_text"].iloc[0][:1000]  # First 1000 chars of first document
detected_lang = detect_language(sample_text)
print(f"Detected language: {detected_lang}")

# Translate to English if not already English
if detected_lang != "en" and detected_lang != "unknown":
    translated_text = translate_with_google(sample_text, target_language="en", source_language=detected_lang)
    print(f"Original text ({detected_lang}): {sample_text[:200]}...")
    print(f"Translated text (en): {translated_text[:200]}...")
else:
    print(f"Sample text is already in English or language couldn't be detected.")
    # Example of translating English to French
    # translated_text = translate_with_google(sample_text, target_language="fr", source_language="en")
    # print(f"Original text (en): {sample_text[:200]}...")
    # print(f"Translated text (fr): {translated_text[:200]}...")

### 6.2 Meta's NLLB (No Language Left Behind) Model Implementation

NLLB is an open-source machine translation model developed by Meta AI that can translate between 200+ languages. It's particularly useful for:

- **Low-resource languages** that might not be well-supported by commercial services
- **Offline translation** without requiring API calls
- **Cost-free translation** for large volumes of text

The model is available on Hugging Face in different sizes (distilled, 1.3B, 3.3B). We'll use the 3.3B parameter version, which offers a good balance between quality and computational requirements.

In [None]:
def load_nllb_model(model_name="facebook/nllb-200-3.3B", device=None):
    """
    Load the NLLB model and tokenizer.
    
    Args:
        model_name (str): Name or path of the NLLB model to load
        device (str, optional): Device to load the model on ('cuda', 'cpu'). If None, will use CUDA if available.
    
    Returns:
        tuple: (model, tokenizer)
    """
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"
    
    print(f"Loading NLLB model {model_name} on {device}...")
    
    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Load with lower precision for GPU memory efficiency if using CUDA
    if device == "cuda":
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
    else:
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
    
    return model, tokenizer

def translate_with_nllb(text, model, tokenizer, target_language='eng_Latn', source_language=None, device=None):
    """
    Translate text using the NLLB model.
    
    Args:
        text (str): Text to translate
        model: The NLLB model
        tokenizer: The NLLB tokenizer
        target_language (str): Target language code in NLLB format (e.g., 'eng_Latn', 'fra_Latn')
        source_language (str, optional): Source language code. If None, will use langdetect to guess
        device (str, optional): Device to use for translation
        
    Returns:
        str: Translated text
    """
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"
    
    # If no source language provided, try to detect it
    if source_language is None:
        # Map langdetect codes to NLLB codes (simplified mapping for common languages)
        lang_map = {
            "en": "eng_Latn", "fr": "fra_Latn", "es": "spa_Latn", "de": "deu_Latn", 
            "zh": "zho_Hans", "ja": "jpn_Jpan", "ko": "kor_Hang", "ru": "rus_Cyrl",
            "ar": "arb_Arab", "hi": "hin_Deva", "pt": "por_Latn", "it": "ita_Latn",
        }
        
        detected = detect_language(text)
        source_language = lang_map.get(detected, "eng_Latn")  # Default to English if not found
    
    # Function to split text into manageable chunks to avoid GPU OOM errors
    def chunk_text(text, max_length=800):
        # Simple splitting by sentences to avoid cutting in middle of sentences
        sentences = re.split(r'(?<=[.!?])\s+', text)
        chunks = []
        current_chunk = []
        current_length = 0
        
        for sentence in sentences:
            sentence_length = len(sentence.split())
            if current_length + sentence_length <= max_length:
                current_chunk.append(sentence)
                current_length += sentence_length
            else:
                chunks.append(' '.join(current_chunk))
                current_chunk = [sentence]
                current_length = sentence_length
        
        if current_chunk:
            chunks.append(' '.join(current_chunk))
        
        return chunks
    
    # Process text in chunks to avoid OOM errors
    chunks = chunk_text(text)
    translated_chunks = []
    
    for chunk in chunks:
        try:
            # Prepare the input text with language tags
            inputs = tokenizer(chunk, return_tensors="pt").to(device)
            
            # Generate translation
            with torch.no_grad():
                # Set the language we're translating to as forced first token
                forced_bos_token_id = tokenizer.lang_code_to_id[target_language]
                
                # Generate translation
                translated_tokens = model.generate(
                    **inputs, 
                    forced_bos_token_id=forced_bos_token_id,
                    max_length=4096,
                    num_beams=5,
                    length_penalty=1.0
                )
            
            # Decode the translated tokens
            translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
            translated_chunks.append(translated_text)
        
        except Exception as e:
            print(f"Error translating chunk: {e}")
            translated_chunks.append("[Translation Error]")
    
    # Combine the translated chunks
    return ' '.join(translated_chunks)

In [None]:
# Example of using NLLB for translation
# Note: This code is commented out to prevent accidental execution
# It requires significant resources, especially VRAM if using the full model

# For educational purposes, you can use a smaller model version like:
# - "facebook/nllb-200-distilled-600M" (smallest, fastest)
# - "facebook/nllb-200-1.3B" (medium size)
# - "facebook/nllb-200-3.3B" (larger, better quality)

# Load the NLLB model (only do this once and reuse for multiple translations)
model, tokenizer = load_nllb_model("facebook/nllb-200-distilled-600M")  # Use smaller model for demo

# Get a sample text
sample_text = results_df["extracted_meaningful_text"].iloc[0][:500]  # First 500 chars of first document

# Detect language with langdetect
detected_lang = detect_language(sample_text)
print(f"Detected language: {detected_lang}")

# Map to NLLB language code
nllb_lang_map = {"en": "eng_Latn", "fr": "fra_Latn", "es": "spa_Latn", "de": "deu_Latn", 
                "zh": "zho_Hans", "ja": "jpn_Jpan", "ko": "kor_Hang", "ru": "rus_Cyrl"}
source_lang = nllb_lang_map.get(detected_lang, "eng_Latn")

# Translate to French
translated_text = translate_with_nllb(
    sample_text, 
    model, 
    tokenizer, 
    target_language="fra_Latn",  # French
    source_language=source_lang
)

print(f"Original text ({detected_lang}): {sample_text[:200]}...")
print(f"Translated text (fra_Latn): {translated_text[:200]}...")

# Translate to Spanish
translated_text_es = translate_with_nllb(
    sample_text, 
    model, 
    tokenizer, 
    target_language="spa_Latn",  # Spanish 
    source_language=source_lang
)

print(f"Translated text (spa_Latn): {translated_text_es[:200]}...")

### 6.3 Combined Translation Pipeline

Now let's create a unified translation pipeline that:
1. Automatically detects document language
2. Chooses the appropriate translation method
3. Translates all documents to the target language
4. Updates our dataset with the translated text

This pipeline allows for flexible translation with either Google Translate API or NLLB model, depending on your preferences and available resources.

In [None]:
def translate_documents(df, text_column, target_language='en', method='google',
                    nllb_model=None, nllb_tokenizer=None, nllb_model_name="facebook/nllb-200-distilled-600M"):
    """
    Translate documents in a DataFrame to the target language.
    
    Args:
        df (pd.DataFrame): DataFrame containing documents
        text_column (str): Column name containing text to translate
        target_language (str): Target language code 
        method (str): Translation method - 'google', 'nllb', or 'auto'
        nllb_model: Pre-loaded NLLB model (required if method is 'nllb')
        nllb_tokenizer: Pre-loaded NLLB tokenizer (required if method is 'nllb')
        nllb_model_name (str): NLLB model name to load if model not provided
        
    Returns:
        pd.DataFrame: DataFrame with added translated_text column
    """
    # Create a copy of the DataFrame to avoid modifying the original
    result_df = df.copy()
    
    # Add a column for detected language
    result_df['detected_language'] = result_df[text_column].apply(detect_language)
    
    # Map target language code for NLLB if needed
    nllb_target_map = {
        'en': 'eng_Latn', 'fr': 'fra_Latn', 'es': 'spa_Latn', 'de': 'deu_Latn', 
        'zh': 'zho_Hans', 'ja': 'jpn_Jpan', 'ko': 'kor_Hang', 'ru': 'rus_Cyrl',
        'ar': 'arb_Arab', 'hi': 'hin_Deva', 'pt': 'por_Latn', 'it': 'ita_Latn'
    }
    
    # Load NLLB model if needed and not provided
    if method == 'nllb' and (nllb_model is None or nllb_tokenizer is None):
        nllb_model, nllb_tokenizer = load_nllb_model(nllb_model_name)
    
    # Translate each document
    translated_texts = []
    
    for idx, row in result_df.iterrows():
        text = row[text_column]
        detected_lang = row['detected_language']
        
        # Skip translation if already in target language
        if detected_lang == target_language:
            translated_texts.append(text)
            print(f"Document {idx}: Already in {target_language}, skipping translation")
            continue
        
        print(f"Document {idx}: Translating from {detected_lang} to {target_language}...")
        
        # Choose translation method
        if method == 'auto':
            # Use Google for common languages, NLLB for others
            common_langs = ['en', 'fr', 'es', 'de', 'zh', 'ja', 'pt', 'it']
            chosen_method = 'google' if detected_lang in common_langs else 'nllb'
        else:
            chosen_method = method
        
        # Perform translation
        try:
            if chosen_method == 'google':
                translated_text = translate_with_google(text, target_language=target_language, 
                                                      source_language=detected_lang)
            else:  # nllb
                nllb_source = nllb_target_map.get(detected_lang, 'eng_Latn')
                nllb_target = nllb_target_map.get(target_language, 'eng_Latn')
                
                translated_text = translate_with_nllb(
                    text, nllb_model, nllb_tokenizer,
                    target_language=nllb_target,
                    source_language=nllb_source
                )
            
            translated_texts.append(translated_text)
            print(f"✓ Translation completed using {chosen_method}")
            
        except Exception as e:
            print(f"Error translating document {idx}: {e}")
            translated_texts.append(f"[Translation Error: {str(e)}]")
    
    # Add translated texts to the DataFrame
    result_df['translated_text'] = translated_texts
    
    return result_df

In [None]:
# Run the translation pipeline (commented out to prevent accidental execution)
# Choose your translation method: 'google', 'nllb', or 'auto'
translation_method = 'google'  # Change as needed

# Target language
target_language = 'en'  # Change as needed

# Select a small sample to demonstrate (for educational purposes)
sample_size = min(3, len(results_df))
sample_df = results_df.head(sample_size)

# NLLB model and tokenizer (only needed for 'nllb' or 'auto' methods)
nllb_model = None
nllb_tokenizer = None

if translation_method in ['nllb', 'auto']:
    # For educational purposes, use a smaller model
    nllb_model, nllb_tokenizer = load_nllb_model("facebook/nllb-200-distilled-600M")

# Run the translation pipeline
translated_df = translate_documents(
    sample_df, 
    text_column='extracted_meaningful_text',
    target_language=target_language,
    method=translation_method,
    nllb_model=nllb_model,
    nllb_tokenizer=nllb_tokenizer
)

# Display results
print(f"\nTranslation Results (method: {translation_method}):")
for idx, row in translated_df.iterrows():
    print('-' * 80)
    print(f"Document {idx} | Original language: {row['detected_language']}")
    
    # Show original and translated snippets
    original_text = row['extracted_meaningful_text'][:200] + "..."
    translated_text = row['translated_text'][:200] + "..."
    
    print(f"\nOriginal: {original_text}")
    print(f"\nTranslated ({target_language}): {translated_text}")

# Save the results
translated_csv = os.path.join(output_dir, f"translated_documents_{translation_method}.csv")
translated_df.to_csv(translated_csv, index=False)
print(f"\nSaved translated documents to: {translated_csv}")

## 7. Conclusion and Next Steps

In this notebook, we've demonstrated the complete pipeline for processing educational PDF content:

1. **PDF Collection & Basic Filtering**: Finding PDFs and filtering by page count
2. **Quality Filtering**: Using rule-based and LLM-based approaches
3. **OCR Text Extraction**: Extracting high-quality text with Marker
4. **Text Post-processing**: Cleaning the extracted text for downstream use
5. **Multilingual Translation**: Translating content with Google Translate API or NLLB model

These components can be used as building blocks for creating comprehensive data pipelines for educational content. The full `web_scale_pdf_processing_pipeline` directory contains additional components for distributed processing at web scale.

### Considerations for Production Use

- **GPU Resources**: Marker and NLLB benefit significantly from GPU acceleration
- **API Costs**: Consider API costs when using Google Translate API for large volumes
- **Language Coverage**: NLLB may provide better coverage for low-resource languages
- **Batch Processing**: Process documents in batches for better efficiency
- **Quality Evaluation**: Implement quality checks for both OCR and translation results

### Further Improvements

- Implement parallel processing for faster execution
- Add quality metrics for translated content
- Integrate with document databases for storage and retrieval
- Add support for specialized domain terminology
- Implement caching to avoid redundant translations

The combination of high-quality OCR extraction and translation capabilities makes this pipeline versatile for educational content processing across languages and domains.