## Introduction: Gemma Czech Adaptation

### Overview
This notebook is part of the **Gemma Czech Adaptation** project, which aims to fine-tune the Gemma language model for enhanced understanding and generation of the Czech language. The project is independent of the original dataset from the Gemma competition and instead compiles custom datasets from the Hugging Face datasets hub, tailored to the nuances of Czech text.

### Aims
- **Fine-tuning:** Optimize the Gemma model for Czech text by using diverse datasets that include conversational language, formal writing, and domain-specific terminology.
- **Task Specialization:** Enhance performance in specific tasks such as:
  - Translation from and to Czech.
  - Sentiment analysis on Czech text.
  - Generative language tasks specific to the Czech context.
- **Adaptability:** Make the model robust for varied Czech applications, such as chatbots, content summarization, and text classification.

### Goals
1. Compile and preprocess high-quality Czech text datasets from Hugging Face.
2. Fine-tune the Gemma model using the compiled data to improve language-specific accuracy.
3. Evaluate the fine-tuned model on a variety of tasks and benchmarks to measure its performance.
4. Provide actionable insights for further improvements and adaptations.

### Workflow
1. **Dataset Compilation:** Identify and gather Czech-language datasets from Hugging Face.
2. **Data Preprocessing:** Clean and tokenize the text data, preparing it for model fine-tuning.
3. **Model Training:** Fine-tune the Gemma model using preprocessed datasets and suitable hyperparameters.
4. **Evaluation:** Use a robust evaluation framework to measure the model's performance across multiple tasks.
5. **Deployment:** Save the fine-tuned model in a deployable format for integration into Czech-specific applications.

This notebook serves as a comprehensive guide to achieving these aims and goals while ensuring transparency and reproducibility of the fine-tuning process.

# Dataset preparation

Brief introduction to the dataset preparation.
Break down of dataset preparation into smaller steps combining multiple datasets.

Dataset preparation is a crucial part of the fine-tuning process. We will use the following datasets:

- Europarl: https://www.statmt.org/europarl/v10/training/
https://www.statmt.org/europarl/v10/training/europarl-v10.cs-en.tsv.gz
- Paracrawl: https://web-language-models.s3.amazonaws.com/paracrawl/release9/en-cs/en-cs.txt.gz
- Czech News Simple: https://huggingface.co/datasets/CIIRC-NLP/czech_news_simple-cs\n
- Czech News Simple is a dataset of Czech news articles, which is a good starting point for our dataset preparation.

We will also use the following datasets:
...

First we need to install the necessary packages:

In [None]:
%pip install datasets
%pip install polars

Then we preprocess our datasets

In [None]:
import polars as pl
from functools import reduce
import operator

def count_quotes(column):
    return pl.col(column).str.count_matches('"')

europarl_filepath = 'data/translation/europarl.tsv'

df_europarl = pl.read_csv(europarl_filepath, separator='\t', quote_char=None, ignore_errors=True, truncate_ragged_lines=True)

string_columns = [col for col in df_europarl.columns if df_europarl[col].dtype == pl.Utf8] 

mask = reduce(operator.or_, [(count_quotes(col) % 2 != 0) for col in string_columns])
df_europarl = df_europarl.filter(~mask)
df_europarl = df_europarl.rename({
    df_europarl.columns[0]: "lhs",
    df_europarl.columns[1]: "rhs",
})

df_europarl.head()


In [None]:
import polars as pl
from functools import reduce
import operator

def count_quotes(column):
    return pl.col(column).str.count_matches('"')

paracrawl_filepath = 'data/translation/paracrawl.txt'

df_paracrawl = pl.read_csv(paracrawl_filepath, separator='\t', quote_char=None, ignore_errors=True, truncate_ragged_lines=True)

string_columns = [col for col in df_paracrawl.columns if df_paracrawl[col].dtype == pl.Utf8] 

mask = reduce(operator.or_, [(count_quotes(col) % 2 != 0) for col in string_columns])
df_paracrawl = df_paracrawl.filter(~mask)
df_paracrawl = df_paracrawl.rename({
    df_paracrawl.columns[0]: "lhs",
    df_paracrawl.columns[1]: "rhs",
})

df_paracrawl.head()

Create function that converts our datasets to alpaca format

In [4]:
from typing import Optional, List, Dict
import json
import polars as pl

def convert_to_alpaca(df: pl.DataFrame, instruction: str, output_col: str, input_col: Optional[str] = None, output_path: str = "dataset.jsonl"):
    records: List[Dict] = []

    for row in df.iter_rows(named=True):
        record = {
            "instruction": instruction
        }

        if input_col and row[input_col]:
            record["input"] = row[input_col]

        record["output"] = row[output_col]

        records.append(record)

    with open(output_path, 'w', encoding='utf-8') as f:
        for record in records:
            f.write(json.dumps(record, ensure_ascii=False) + '\n')

    print(f"Saved to path: {output_path}")

And finally convert our datasets to alpaca

In [None]:
europarl_jsonl_outputpath = 'data/translation/dataset/europarl.jsonl'

convert_to_alpaca(df_europarl, instruction="Přelož tento text z angličtiny do češtiny", input_col="rhs", output_col="lhs", output_path=europarl_jsonl_outputpath)

In [None]:
paracrawl_jsonl_outputpath = 'data/translation/dataset/paracrawl.jsonl'

convert_to_alpaca(df_paracrawl, instruction="Přelož tento text z angličtiny do češtiny", input_col="lhs", output_col="rhs", output_path=paracrawl_jsonl_outputpath)

In [None]:
from datasets import load_dataset

dataset = load_dataset('vojtam/czech_books_descriptions', split='train')

dataset = dataset.rename_columns({
  "title": "lhs",
  "text": "rhs"
})

polars_dataset = dataset.to_polars()

books_jsonl_outputpath = 'data/translation/dataset/books.jsonl'

convert_to_alpaca(polars_ataset, instruction="Řekni mi něco o knize", input_col="lhs", output_col="rhs", output_path=books_jsonl_outputpath)



Saved to path: data/translation/dataset/books.jsonl
