<a href="https://colab.research.google.com/github/wesslen/seamless_sacrebleu_evaluation/blob/main/notebook/sentence_alignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
english = """
Machine learning is transforming the way we interact with technology. It powers everything from recommendation systems to autonomous vehicles.

## Basic Concepts

Neural networks are inspired by the human brain! They consist of interconnected nodes that process information in layers.

### Types of Learning

Supervised learning requires labeled data.   Multiple spaces    here should be cleaned.

Unsupervised learning finds patterns without labels.

* This is a bullet point.
* x

Semi-supervised learning combines both approaches...

## Advanced Topics

Deep learning has revolutionized computer vision and natural language processing.

Transfer learning allows models to apply knowledge from one domain to another.

Contact: info@example.com
"""

spanish = """
El aprendizaje automático está transformando la forma en que interactuamos con la tecnología. Impulsa todo, desde sistemas de recomendación hasta vehículos autónomos.

## Conceptos Básicos

¡Las redes neuronales están inspiradas en el cerebro humano! Consisten en nodos interconectados que procesan información en capas.

### Tipos de Aprendizaje

El aprendizaje supervisado requiere datos etiquetados.    Múltiples espacios    aquí deben limpiarse.

El aprendizaje no supervisado encuentra patrones sin etiquetas.

* Este es un punto de viñeta.
* x

El aprendizaje semisupervisado combina ambos enfoques...

## Temas Avanzados

El aprendizaje profundo ha revolucionado la visión por computadora y el procesamiento del lenguaje natural.

La transferencia de aprendizaje permite que los modelos apliquen el conocimiento de un dominio a otro.

Contacto: info@example.com
"""

# save english as english.txt
with open('english.txt', 'w') as f:
    f.write(english)

# save spanish as spanish.txt
with open('spanish.txt', 'w') as f:
    f.write(spanish)

In [5]:
import json
import math
import chardet
import spacy
from typing import List, Tuple, Dict
from dataclasses import dataclass
from pathlib import Path
from collections import defaultdict

class FileValidationError(Exception):
    pass

@dataclass
class SentencePair:
    source: str
    target: str
    source_index: List[int]
    target_index: List[int]
    alignment_score: float = 0.0

class GaleChurchAligner:
    # Constants for Gale-Church algorithm
    MEAN_CHARACTERS_RATIO = 1
    VARIANCE_CHARACTERS_RATIO = 6.8

    def __init__(self):
        print("Initializing Gale-Church Aligner...")
        self.log_prob_tables = {}

    def char_length_ratio(self, source_len: int, target_len: int) -> float:
        try:
            ratio = (target_len - source_len * self.MEAN_CHARACTERS_RATIO) / \
                    math.sqrt(source_len * self.VARIANCE_CHARACTERS_RATIO)
            return -math.log(1 + ratio * ratio)
        except (ValueError, ZeroDivisionError):
            return float('-inf')

    def align_blocks(
        self,
        source_sents: List[str],
        target_sents: List[str]
    ) -> List[Tuple[List[int], List[int], float]]:
        print(f"Starting alignment of {len(source_sents)} source and {len(target_sents)} target sentences...")

        n, m = len(source_sents), len(target_sents)

        # Initialize DP tables
        dp = defaultdict(lambda: float('inf'))
        dp[0, 0] = 0
        back = {}

        # Alignment patterns (1-1, 1-2, 2-1, 2-2)
        patterns = [(1,1), (1,2), (2,1), (2,2)]

        # Progress tracking
        total_steps = (n + 1) * (m + 1)
        current_step = 0

        print("Computing optimal alignments...")
        # Fill DP table
        for i in range(n + 1):
            for j in range(m + 1):
                current_step += 1
                if current_step % 100 == 0:
                    print(f"Progress: {current_step}/{total_steps} steps ({(current_step/total_steps)*100:.1f}%)")

                if i == 0 and j == 0:
                    continue

                for si, ti in patterns:
                    if i >= si and j >= ti:
                        source_block = source_sents[i-si:i]
                        target_block = target_sents[j-ti:j]
                        source_len = sum(len(s) for s in source_block)
                        target_len = sum(len(t) for t in target_block)

                        if source_len and target_len:
                            cost = -self.char_length_ratio(source_len, target_len)
                            if dp[i-si, j-ti] + cost < dp[i, j]:
                                dp[i, j] = dp[i-si, j-ti] + cost
                                back[i, j] = (si, ti)

        print("Reconstructing alignments...")
        alignments = []
        i, j = n, m
        while i > 0 or j > 0:
            si, ti = back.get((i, j), (1, 1))
            source_indices = list(range(i-si, i))
            target_indices = list(range(j-ti, j))
            score = dp[i, j] - dp[i-si, j-ti]
            alignments.append((source_indices, target_indices, score))
            i, j = i-si, j-ti

        print(f"Found {len(alignments)} alignments")
        return list(reversed(alignments))

class SentenceAligner:
    def __init__(self):
        print("Initializing Sentence Aligner...")
        # Create blank spaCy models for both languages
        self.source_nlp = spacy.blank("es")  # Spanish
        self.target_nlp = spacy.blank("en")  # English

        # Add the sentencizer to both models
        self.source_nlp.add_pipe("sentencizer")
        self.target_nlp.add_pipe("sentencizer")

        self.gale_church = GaleChurchAligner()
        print("Initialization complete")

    @staticmethod
    def detect_encoding(file_path: Path) -> str:
        print(f"Detecting encoding for: {file_path}")
        with open(file_path, 'rb') as f:
            raw_data = f.read()
        result = chardet.detect(raw_data)
        print(f"Detected encoding: {result['encoding']} (confidence: {result['confidence']:.2f})")
        return result['encoding']

    @staticmethod
    def validate_file_contents(text: str) -> bool:
        print("Validating file contents...")
        if not text.strip():
            raise FileValidationError("File is empty or contains only whitespace")
        if len(text) > 10_000_000:
            raise FileValidationError("File exceeds size limit")
        print("File validation successful")
        return True

    def clean_text(self, text: str) -> str:
        """Apply whitespace cleaning rules to text."""
        # Replace multiple whitespace characters with single space
        return " ".join(text.split())

    def is_valid_sentence(self, sentence: str) -> bool:
        """Check if sentence meets inclusion criteria."""
        sentence = self.clean_text(sentence)

        words = sentence.split()
        if len(words) <= 1 or len(words) > 50:
            return False

        try:
            alphanumeric_chars = sum(c.isalnum() for c in sentence)
            if alphanumeric_chars / len(sentence) < 0.01:
                return False
        except ZeroDivisionError:
            return False

        return True

    def tokenize_sentences(self, text: str, is_source: bool = True) -> List[str]:
        """Split text into sentences using spaCy."""
        print("\nTokenizing sentences...")
        print(f"Input text length: {len(text)} characters")

        # Clean the text
        text = self.clean_text(text)
        print(f"Cleaned text length: {len(text)} characters")

        # Use appropriate model based on source/target
        nlp = self.source_nlp if is_source else self.target_nlp

        # Process the text and get sentences
        doc = nlp(text)
        sentences = [str(sent).strip() for sent in doc.sents]

        # Debug output
        print(f"Found {len(sentences)} sentences")
        if sentences:
            print("\nFirst few sentences found:")
            for i, sent in enumerate(sentences[:3]):
                print(f"{i+1}. {sent}")
        else:
            print("WARNING: No sentences were found!")
            print("Text sample:", text[:100], "...")

        return sentences

    def align_sentences(
        self,
        source_text: str,
        target_text: str
    ) -> Tuple[List[Dict], List[Dict]]:
        try:
            print("\nStarting sentence alignment process...")

            # Tokenize using appropriate language models
            source_sentences = self.tokenize_sentences(source_text, is_source=True)
            target_sentences = self.tokenize_sentences(target_text, is_source=False)

            alignments = self.gale_church.align_blocks(source_sentences, target_sentences)

            aligned_pairs = []
            excluded_pairs = []

            print("\nProcessing aligned pairs...")
            for source_indices, target_indices, score in alignments:
                source_block = [source_sentences[i] for i in source_indices]
                target_block = [target_sentences[i] for i in target_indices]

                pair_dict = {
                    "source": " ".join(source_block),
                    "target": " ".join(target_block),
                    "source_index": source_indices,
                    "target_index": target_indices,
                    "alignment_score": score
                }

                should_exclude = any(
                    not self.is_valid_sentence(sent)
                    for sent in source_block + target_block
                )

                if should_exclude:
                    excluded_pairs.append(pair_dict)
                else:
                    aligned_pairs.append(pair_dict)

            print(f"\nAlignment complete:")
            print(f"- Aligned pairs: {len(aligned_pairs)}")
            print(f"- Excluded pairs: {len(excluded_pairs)}")

            return aligned_pairs, excluded_pairs

        except Exception as e:
            print(f"Error in sentence alignment: {str(e)}")
            raise

def process_files(source_path: str, target_path: str, output_dir: str) -> Tuple[Path, Path]:
    """Process source and target files with detailed progress output."""
    try:
        print("\n=== Starting File Processing ===")

        # Convert to Path objects
        source_path = Path(source_path)
        target_path = Path(target_path)
        output_dir = Path(output_dir)

        # Create output directory
        output_dir.mkdir(parents=True, exist_ok=True)
        print(f"Created output directory: {output_dir}")

        # Detect and read files
        source_encoding = SentenceAligner.detect_encoding(source_path)
        target_encoding = SentenceAligner.detect_encoding(target_path)

        print("\nReading input files...")
        with open(source_path, 'r', encoding=source_encoding) as f:
            source_text = f.read()
            print(f"Read source file: {len(source_text):,} characters")

        with open(target_path, 'r', encoding=target_encoding) as f:
            target_text = f.read()
            print(f"Read target file: {len(target_text):,} characters")

        # Validate contents
        SentenceAligner.validate_file_contents(source_text)
        SentenceAligner.validate_file_contents(target_text)

        # Process texts
        aligner = SentenceAligner()
        aligned_pairs, excluded_pairs = aligner.align_sentences(source_text, target_text)

        # Write output files
        print("\nWriting output files...")
        aligned_path = output_dir / 'aligned.jsonl'
        excluded_path = output_dir / 'excluded.jsonl'

        with open(aligned_path, 'w', encoding='utf-8') as f:
            for pair in aligned_pairs:
                f.write(json.dumps(pair, ensure_ascii=False) + '\n')
        print(f"Written {len(aligned_pairs)} pairs to {aligned_path}")

        with open(excluded_path, 'w', encoding='utf-8') as f:
            for pair in excluded_pairs:
                f.write(json.dumps(pair, ensure_ascii=False) + '\n')
        print(f"Written {len(excluded_pairs)} pairs to {excluded_path}")

        print("\n=== File Processing Complete ===")
        return aligned_path, excluded_path

    except Exception as e:
        print(f"\nError: {str(e)}")
        raise

# Example usage in Jupyter notebook
if __name__ == "__main__":
    try:
        # Replace these with your actual file paths
        source_file = "spanish.txt"
        target_file = "english.txt"
        output_directory = "output"

        print(f"\nProcessing files:")
        print(f"Source: {source_file}")
        print(f"Target: {target_file}")
        print(f"Output: {output_directory}")

        aligned_file, excluded_file = process_files(
            source_file,
            target_file,
            output_directory
        )

        print("\nSuccess!")
        print(f"Aligned sentences: {aligned_file}")
        print(f"Excluded sentences: {excluded_file}")

    except Exception as e:
        print(f"\nFailed to process files: {str(e)}")
        raise



Processing files:
Source: spanish.txt
Target: english.txt
Output: output

=== Starting File Processing ===
Created output directory: output
Detecting encoding for: spanish.txt
Detected encoding: utf-8 (confidence: 0.99)
Detecting encoding for: english.txt
Detected encoding: ascii (confidence: 1.00)

Reading input files...
Read source file: 870 characters
Read target file: 747 characters
Validating file contents...
File validation successful
Validating file contents...
File validation successful
Initializing Sentence Aligner...
Initializing Gale-Church Aligner...
Initialization complete

Starting sentence alignment process...

Tokenizing sentences...
Input text length: 870 characters
Cleaned text length: 851 characters
Found 11 sentences

First few sentences found:
1. El aprendizaje automático está transformando la forma en que interactuamos con la tecnología.
2. Impulsa todo, desde sistemas de recomendación hasta vehículos autónomos. ##
3. Conceptos Básicos ¡Las redes neuronales están