# Czech Language Adaptation of Gemma Language Model

**Author:** Jirka Helmich

**Last Updated:** 2025-01-06

**License:** MIT

## Overview

This notebook demonstrates the fine-tuning process of the Gemma language model for Czech language understanding and generation. We focus on creating a robust multilingual model capable of handling various Czech-specific NLP tasks.

### Key Objectives

1. 🎯 **Primary Goal**: Adapt Gemma for superior Czech language processing
2. 🔄 **Tasks**: Translation, sentiment analysis, text generation
3. 📊 **Evaluation**: Comprehensive benchmarking on Czech-specific metrics

### Technical Requirements

```
Python >= 3.10
polars >= 0.20.0
datasets >= 2.15.0
tqdm >= 4.66.0
```

### Dataset Sources

We utilize multiple high-quality Czech datasets:

1. **ParaCrawl v9**
   - Parallel corpus for EN-CS translation
   - ~52M sentence pairs
   - [Source](https://paracrawl.eu/v9)

2. **Czech Books Descriptions**
   - Book descriptions in Czech
   - [Source](https://huggingface.co/datasets/vojtam/czech_books_descriptions)

## Environment Setup

First, let's install required dependencies. We use specific versions to ensure reproducibility.

In [None]:
%pip install datasets>=2.15.0 polars>=0.20.0 tqdm>=4.66.0

## Data Processing Components

### 1. ParaCrawl Dataset Loader

The `ParaCrawlDataLoader` class handles downloading and processing of the ParaCrawl translation dataset. Key features:

- Automatic download and decompression
- Progress tracking
- Data cleaning and validation

## Implementation

This section implements a robust data loader for the ParaCrawl dataset with the following features:

- ✨ Automatic download with progress tracking
- 🔍 Data validation and integrity checks
- 📊 Efficient processing using Polars
- 💾 Caching of processed data

### Dependencies

In [2]:
import polars as pl
from pathlib import Path
import urllib.request
import gzip
import logging
from tqdm import tqdm
from typing import Optional

# Configure logging
logging.basicConfig(level=logging.INFO)

### ParaCrawl Data Loader Class

The main class implementation with detailed documentation:

In [3]:
class ParaCrawlDataLoader:
    """Handles downloading and processing of ParaCrawl translation datasets."""

    def __init__(
        self,
        source_lang: str = "en",
        target_lang: str = "cs",
        data_dir: Optional[str] = None,
        cache_dir: Optional[str] = None
    ):
        """Initialize the ParaCrawl data loader."""
        self.source_lang = source_lang
        self.target_lang = target_lang
        self.base_url = "https://web-language-models.s3.amazonaws.com/paracrawl/release9"

        # Setup directories
        self.data_dir = Path(data_dir or "./data")
        self.cache_dir = Path(cache_dir or "./cache")
        self.data_dir.mkdir(parents=True, exist_ok=True)
        self.cache_dir.mkdir(parents=True, exist_ok=True)

        self.logger = logging.getLogger(__name__)

        # Construct file paths
        self.filename = f"{source_lang}-{target_lang}.txt.gz"
        self.filepath = self.data_dir / self.filename
        self.processed_path = self.cache_dir / f"{source_lang}-{target_lang}.parquet"

### Download and Validation Methods

Methods for downloading data with progress tracking and validation:

In [4]:
def _download_with_progress(self, url: str, filepath: Path) -> None:
    """Download a file with progress bar."""
    try:
        response = urllib.request.urlopen(url)
        total_size = int(response.headers['Content-Length'])
        print(f"Total size: {total_size}")
        with tqdm(total=total_size, unit='B', unit_scale=True, desc=f"Downloading {filepath.name}") as pbar:
            urllib.request.urlretrieve(
                url,
                filepath,
                reporthook=lambda count, block_size, total_size: pbar.update(block_size)
            )
    except Exception as e:
        self.logger.error(f"Error downloading file: {e}")
        raise

def _validate_file(self, filepath: Path) -> bool:
    """Validate downloaded file integrity."""
    if not filepath.exists():
        return False
        
    try:
        with gzip.open(filepath, 'rt', encoding='utf-8') as f:
            for _ in range(5):
                line = f.readline()
                if not '\t' in line:
                    return False
        return True
    except Exception:
        return False

ParaCrawlDataLoader._download_with_progress = _download_with_progress
ParaCrawlDataLoader._validate_file = _validate_file

### Data Processing Methods

Methods for processing and loading the data:

In [5]:
def _process_raw_file(self) -> None:
    """Process raw gzipped file into Parquet format."""
    if self.processed_path.exists():
        self.logger.info("Using cached processed data")
        return

    self.logger.info("Processing raw data file...")

    chunk_size = 100_000
    chunks = []

    with gzip.open(self.filepath, "rt", encoding="utf-8") as f:
        with tqdm(desc="Processing chunks") as pbar:
            while True:
                lines = [next(f, None) for _ in range(chunk_size)]
                lines = [line for line in lines if line is not None]

                if not lines:
                    break

                pairs = [line.strip().split("\t") for line in lines]
                # Filter out invalid pairs
                pairs = [p for p in pairs if len(p) == 2]

                if not pairs:
                    continue

                # Pre-filter by length before creating DataFrame
                pairs = [
                    p
                    for p in pairs
                    if (0 < len(p[0]) < 1000 and 0 < len(p[1]) < 1000)
                ]

                if not pairs:
                    continue

                chunk_df = pl.DataFrame(
                    pairs,
                    schema=[self.source_lang, self.target_lang],
                    orient="row",  # Explicitly specify orientation
                )

                if len(chunk_df) > 0:
                    chunks.append(chunk_df)
                pbar.update(1)

    if not chunks:
        raise ValueError("No valid data found in the input file")

    df = pl.concat(chunks)
    df.write_parquet(self.processed_path)
    self.logger.info(f"Processed data saved to {self.processed_path}")


ParaCrawlDataLoader._process_raw_file = _process_raw_file

### Public Interface Methods

Methods for downloading and loading the dataset:

In [6]:
def download_data(self) -> None:
    """Download ParaCrawl dataset if not already present."""
    if self.filepath.exists() and self._validate_file(self.filepath):
        self.logger.info("Using existing download")
        return
        
    url = f"{self.base_url}/{self.source_lang}-{self.target_lang}/{self.filename}"
    self.logger.info(f"Downloading from {url}")
    
    self._download_with_progress(url, self.filepath)
    
    if not self._validate_file(self.filepath):
        raise ValueError("Downloaded file appears to be corrupt")

def load_dataframe(self) -> pl.DataFrame:
    """Load the processed ParaCrawl dataset."""
    self.download_data()
    self._process_raw_file()
    
    df = pl.read_parquet(self.processed_path)
    self.logger.info(f"Loaded {len(df):,} translation pairs")
    
    return df

def get_sample(self, n: int = 5) -> pl.DataFrame:
    """Get a sample of n translation pairs."""
    df = self.load_dataframe()
    return df.sample(n)

ParaCrawlDataLoader.download_data = download_data
ParaCrawlDataLoader.load_dataframe = load_dataframe
ParaCrawlDataLoader.get_sample = get_sample

### 2. Alpaca Format Converter Implementation

This section implements a robust converter for transforming datasets into the Alpaca instruction format, which is optimized for fine-tuning language models. Key features:

- 🔄 Flexible input handling
- 📝 Customizable instruction templates
- 💾 Efficient JSONL output
- ✨ Data validation and cleaning

#### Dependencies

In [7]:
import polars as pl
import json
from pathlib import Path
from typing import Optional, Dict, List, Union
from tqdm.auto import tqdm
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

### Alpaca Data Converter Class

Main class for converting datasets to Alpaca instruction format:

In [8]:
class AlpacaConverter:
    """Converts datasets to Alpaca instruction format for fine-tuning.
    
    This class handles the conversion of various dataset formats into the
    Alpaca instruction format, which is suitable for fine-tuning language models.
    """
    
    def __init__(
        self,
        instruction_templates: Optional[Dict[str, str]] = None,
        max_length: int = 2048,
        min_length: int = 3
    ):
        """Initialize the Alpaca converter.
        
        Args:
            instruction_templates: Dictionary of task types to instruction templates
            max_length: Maximum length of input/output text
            min_length: Minimum length of input/output text
        """
        self.instruction_templates = instruction_templates or {
            'translation': "Přelož tento text z {source_lang} do {target_lang}",
            'book_description': "Popiš tuto knihu",
        }
        self.max_length = max_length
        self.min_length = min_length
        self.logger = logging.getLogger(__name__)

### Data Validation Methods

Methods for validating and cleaning input data:

In [9]:
def _validate_text(self, text: str) -> bool:
    """Validate text length and content.
    
    Args:
        text: Input text to validate
        
    Returns:
        bool: True if text is valid
    """
    if not isinstance(text, str):
        return False
        
    text = text.strip()
    length = len(text)
    
    return (length >= self.min_length and 
            length <= self.max_length and
            not text.isspace())

def _clean_text(self, text: str) -> str:
    """Clean and normalize text.
    
    Args:
        text: Input text to clean
        
    Returns:
        str: Cleaned text
    """
    return " ".join(text.strip().split())

AlpacaConverter._validate_text = _validate_text
AlpacaConverter._clean_text = _clean_text

### Format Conversion Methods

Core methods for converting data to Alpaca format:

In [10]:
def _create_instruction(self, task_type: str, **kwargs) -> str:
    """Create instruction from template.
    
    Args:
        task_type: Type of task (e.g., 'translation')
        **kwargs: Format parameters for instruction template
        
    Returns:
        str: Formatted instruction
    """
    template = self.instruction_templates.get(task_type)
    if not template:
        raise ValueError(f"Unknown task type: {task_type}")
    return template.format(**kwargs)

def _create_example(self,
    instruction: str,
    output: str,
    input_text: Optional[str] = None
) -> Dict[str, str]:
    """Create a single Alpaca format example.
    
    Args:
        instruction: Task instruction
        output: Expected output text
        input_text: Optional input text
        
    Returns:
        Dict[str, str]: Alpaca format example
    """
    example = {
        "instruction": instruction,
        "output": self._clean_text(output)
    }
    
    if input_text:
        example["input"] = self._clean_text(input_text)
        
    return example

AlpacaConverter._create_instruction = _create_instruction
AlpacaConverter._create_example = _create_example

### Public Interface Methods

Methods for converting different types of datasets:

In [11]:
def convert_translations(
    self,
    df: pl.DataFrame,
    source_lang: str,
    target_lang: str, 
    output_path: Union[str, Path]
) -> None:
    """Convert translation pairs to Alpaca format."""
    instruction = self._create_instruction(
        'translation',
        source_lang=source_lang,
        target_lang=target_lang
    )
    
    output_path = Path(output_path)
    output_path.parent.mkdir(parents=True, exist_ok=True)
    
    # Process in chunks for memory efficiency
    chunk_size = 10000
    
    with open(output_path, 'w', encoding='utf-8') as f:
        with tqdm(total=len(df), desc="Processing translation rows") as pbar:
            for i in range(0, len(df), chunk_size):
                # Get chunk
                chunk = df.slice(i, chunk_size)
                
                # Process chunk
                valid_rows = []
                for row in chunk.iter_rows():
                    source = row[chunk.get_column_index(source_lang)]
                    target = row[chunk.get_column_index(target_lang)]
                    
                    if self._validate_text(source) and self._validate_text(target):
                        example = {
                            "instruction": instruction,
                            "input": self._clean_text(source),
                            "output": self._clean_text(target)
                        }
                        valid_rows.append(json.dumps(example, ensure_ascii=False))
                
                # Write valid rows
                if valid_rows:
                    f.write('\n'.join(valid_rows) + '\n')
                
                # Update progress
                pbar.update(len(chunk))
                pbar.set_postfix({'valid_rows': len(valid_rows)})

def convert_descriptions(
    self,
    df: pl.DataFrame,
    title_col: str,
    desc_col: str,
    output_path: Union[str, Path]
) -> None:
    """Convert title-description pairs to Alpaca format."""
    instruction = self._create_instruction('book_description')
    
    output_path = Path(output_path)
    output_path.parent.mkdir(parents=True, exist_ok=True)
    
    # Process in chunks for memory efficiency
    chunk_size = 10000
    
    with open(output_path, 'w', encoding='utf-8') as f:
        with tqdm(total=len(df), desc="Processing book description rows") as pbar:
            for i in range(0, len(df), chunk_size):
                # Get chunk
                chunk = df.slice(i, chunk_size)
                
                # Process chunk
                valid_rows = []
                for row in chunk.iter_rows():
                    title = row[chunk.get_column_index(title_col)]
                    desc = row[chunk.get_column_index(desc_col)]
                    
                    if self._validate_text(title) and self._validate_text(desc):
                        example = {
                            "instruction": instruction,
                            "input": self._clean_text(title),
                            "output": self._clean_text(desc)
                        }
                        valid_rows.append(json.dumps(example, ensure_ascii=False))
                
                # Write valid rows
                if valid_rows:
                    f.write('\n'.join(valid_rows) + '\n')
                
                # Update progress
                pbar.update(len(chunk))
                pbar.set_postfix({'valid_rows': len(valid_rows)})

AlpacaConverter.convert_translations = convert_translations
AlpacaConverter.convert_descriptions = convert_descriptions

## Data Processing Pipeline 🔄

This section implements the main data processing pipeline for preparing our training data. We'll walk through each step to ensure high-quality training data.

### Pipeline Overview 📋

1. 📥 **Load ParaCrawl Corpus**
   - Download EN-CS parallel data
   - Clean and validate entries
   - Remove low-quality pairs

2. 📚 **Process Book Descriptions**
   - Load Czech book dataset
   - Extract titles and descriptions
   - Filter and clean text

3. 🔄 **Format Conversion**
   - Transform to Alpaca format
   - Add instruction templates
   - Validate final structure

4. 💾 **Save Training Data**
   - Export to JSONL format
   - Create data splits
   - Verify data integrity

### Key Features ✨

- 🧹 Robust data cleaning
- ⚡ Efficient Polars processing
- 🔍 Quality validation steps
- 📊 Progress tracking
- 💪 Scalable pipeline

### 1. Load and Process ParaCrawl Dataset 🌐

In [None]:
# Initialize loader
loader = ParaCrawlDataLoader(source_lang="en", target_lang="cs")

# Load ParaCrawl EN-CS dataset
df_paracrawl = loader.load_dataframe()
print(f"Loaded {len(df_paracrawl):,} translation pairs")
df_paracrawl.head()

### 2. Process Book Descriptions Dataset

In [None]:
from datasets import load_dataset

# Load Czech book descriptions
ds = load_dataset("vojtam/czech_books_descriptions")
books_df = ds['train'].to_polars()
print(f"Loaded {len(books_df):,} book descriptions")
books_df.head()

## Convert to Training Format

Convert our processed datasets to the Alpaca instruction format for fine-tuning.

In [None]:
alpaca_converter = AlpacaConverter()

alpaca_converter.convert_translations(
    df_paracrawl,
    source_lang="en",
    target_lang="cs",
    output_path="data/translation/dataset/paracrawl.jsonl"
)

alpaca_converter.convert_descriptions(
    books_df,
    title_col="title",
    desc_col="description",
    output_path="data/translation/dataset/czech_books.jsonl"
)

## Next Steps

1. **Data Validation**
   - Implement quality checks
   - Remove potential noise

2. **Model Fine-tuning**
   - Configure training parameters
   - Set up evaluation metrics

3. **Evaluation**
   - Benchmark on Czech NLP tasks
   - Compare with baseline models

## References

1. ParaCrawl (2023). ParaCrawl v9.0. https://paracrawl.eu/v9
2. Gemma (2024). Google AI. https://blog.google/technology/ai/gemma-open-models/
3. Czech Books Descriptions Dataset. https://huggingface.co/datasets/vojtam/czech_books_descriptions