# Introduction

Hindi is the world's third-most spoken language. Giving it scale: approximately 8 in every hundred people speak Hindi! With such a large population of the world communicating and spending their day-to-day lives immersed in this language and its culture, it is disheartening that the availability of high-quality language models specifically trained on Hindi data remains a significant challenge.

This project aims to address this gap by adapting Gemma 2 to effectively generate fluent and culturally sensitive Hindi text. By fine-tuning the model on a carefully curated dataset of high-quality Hindi text, I aim to improve its performance on various NLP tasks, such as text generation, translation, and question answering.

# Dataset Creation/Curation

The dataset I will be using to fine tune the model Is [Varta](https://huggingface.co/datasets/rahular/varta), a diverse, challenging, large-scale, multilingual, high-quality headline-generation dataset containing 41.8 million news articles in 14 Indic languages and English. The data is crawled from DailyHunt, a popular news aggregator in India that pulls high-quality articles from multiple trusted and reputed news publishers.

The hundreds of thousands of Hindi articles present in this data set will provide an excellent foundation for this project due to its several key characteristics. Its diverse nature, encompassing articles from various sources, exposes the model to a wide range of linguistic styles, vocabulary, and real-world contexts. By training on this data, we can expect Llama-2 to learn robust representations of the Hindi language, capturing its nuances and complexities.

In [1]:
# %%capture
# !git clone https://github.com/AI4Bharat/IndicTrans2.git

In [2]:
# %%capture
# %cd /content/IndicTrans2/huggingface_interface

In [3]:
# %%capture
# !python3 -m pip install nltk sacremoses pandas regex mock transformers>=4.33.2 mosestokenizer
# !python3 -c "import nltk; nltk.download('punkt')"
# !python3 -m pip install bitsandbytes scipy accelerate datasets
# !python3 -m pip install sentencepiece

# !git clone https://github.com/VarunGumma/IndicTransToolkit.git
# %cd IndicTransToolkit
# !python3 -m pip install --editable ./
# %cd ..

In [1]:
!pip install datasets



In [2]:
import os  # For making and writing to directory where we will store our data.
from datasets import load_dataset   # Hugging Face Datasets library for loading and managing our dataset.

In [None]:
dataset = load_dataset("rahular/varta", split = "train", streaming = True)  # Importing dataset from Hugging Face.
os.makedirs("data", exist_ok = True)  # Creating a data directory.
# Let's open a file with UTF-8 encoding to store the Hindi data.
with open(os.path.join("data", "Hindi.txt"), "w", encoding="utf-8") as f:
    count = 0;
    for idx, d in enumerate(dataset):
        # Stopping data collection when we hit 100,000 documents for space and training reasons.
        # Ideally we would train on the whole Hindi corpus.
        if(count == 100000):
            break
        if d["langCode"] == "hi":  # If we iterate over Hindi data,
            count += 1
            f.write(d["headline"] + "\n" + d["text"] + "\n")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/2.86k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/51 [00:00<?, ?it/s]

# Tokenization

Now that we have our data, we have to make it LLM-friendly by tokenizing it. We will be using [SentencePiece](https://github.com/google/sentencepiece) to do this.

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems. Instead of splitting text into individual words, SentencePiece divides words into smaller units like prefixes, suffixes, and common character sequences, which it uses to understand the optimal subword vocabulary directly from the input text data without any labeled information. It is known for its speed and efficiency and allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

But there's a catch: The number of unique tokens is predetermined.

Neural Machine Translation models typically operate with a fixed vocabulary. Unlike most unsupervised word segmentation algorithms, which assume an infinite vocabulary, SentencePiece trains the segmentation model such that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k. We are going to set the vocabulary size to 16k to allow for efficient model training with lower memory requirements, better generalization, and lower risk of overfitting.

In [3]:
import sentencepiece as spm

os.makedirs("hi_tokenizer", exist_ok = True)
tokenizer_name = os.path.join("hi_tokenizer", "tokenizer")

spm.SentencePieceTrainer.train(
    input="data/Hindi.txt",  # Data path
    model_prefix=tokenizer_name,  # Will depost the model file and the vocabulary of subwords file.
    vocab_size=16000,  # Vocab size
    num_threads=8,  # Number of parallel processing threads.
    model_type="bpe",  # Byte Pair Encoding algorithm.
    max_sentence_length=1073741824,  # Large value to disable any length restrictions on the input sentences.
    shuffle_input_sentence="true",  # Preventing biases by shuffling data,
    character_coverage=1.0,  # 1.0 indicates all characters should be represented in the vocabulary.
    hard_vocab_limit="false",  # Vocab size is not a hard limit and a few additional subwords can be learned if needed.
)

Our SentencePiece Hindi tokenizer is now ready! We created a new tokenizer because Llama-2's tokenizer isn't adapted to Hindi text, which means that our custom tokenization will have to be merged with the original without disturbing its vocabulary in any way.

Grabbing the original Llama-2 tokenizer so we can add more to it.

In [5]:
from huggingface_hub import hf_hub_download
from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
import re

# We will be using Protobuf, a language-neutral, platform-neutral, extensible mechanism, for serializing structured data.
# We will us ethis data structure for our sentencepiece model.
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"
original_tokenizer_path = hf_hub_download(repo_id="meta-llama/Llama-2-7b-chat-hf", filename="tokenizer.model", local_dir="original_tokenizer")
original_tokenizer_spm = sp_pb2_model.ModelProto()
original_tokenizer_spm.ParseFromString(open(original_tokenizer_path, "rb").read())
new_tokenizer_spm = sp_pb2_model.ModelProto()
new_tokenizer_spm.ParseFromString(open(os.path.join("hi_tokenizer", "tokenizer.model"), "rb").read())

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

614045

Function to check whether a given text contains any English characters. We are going to use this to add new vocabulary to the extended tokenizer.

In [4]:
def contains_eng(text):
    # re.compile(r"[\u0020-\u007E]+") is a regular expression that checks the for unicdoe characters from the range u0020-\u007E
    eng_pattern = re.compile(r"[\u0020-\u007E]+")
    return True if eng_pattern.search(text) else False

Appending new tokens to the orginal tokenizer

In [7]:
original_tokenizer_tokenset = set(p.piece for p in original_tokenizer_spm.pieces)
print(f"Number of tokens before merge: {len(original_tokenizer_tokenset)}")
for p in new_tokenizer_spm.pieces:
    piece = p.piece
    if piece not in original_tokenizer_tokenset and not contains_eng(piece):
        new_p = sp_pb2_model.ModelProto().SentencePiece()
        new_p.piece = piece
        new_p.score = 0
        original_tokenizer_spm.pieces.append(new_p)
print(f"Number of tokens after merge: {len(original_tokenizer_spm.pieces)}")

Number of tokens before merge: 32000
Number of tokens after merge: 45992


Tokenizers are now merged

In [8]:
os.makedirs("extended_tokenizer", exist_ok=True)
with open(os.path.join("extended_tokenizer", "tokenizer.model"), "wb") as f:
    f.write(original_tokenizer_spm.SerializeToString())

In [9]:
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer(vocab_file=os.path.join("extended_tokenizer", "tokenizer.model"), legacy=False)
tokenizer.save_pretrained("extended_tokenizer")
print("Tokenizer saved to extended_tokenizer")

Tokenizer saved to extended_tokenizer


Let's test the extended tokenizer real quick. Below are two code cells comparing (first) the original tokenizer to the new one (second)


In [10]:
tok1 = LlamaTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf')
tok2 = LlamaTokenizer.from_pretrained("extended_tokenizer")
for i in range(len(tok1)):
    assert tok1.convert_ids_to_tokens(i) == tok2.convert_ids_to_tokens(i), f"Token mismatch at index {i}."

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

In [11]:
text = "मैं एक अच्छा हाथी हूँ"
tok1.tokenize(text)

['▁',
 'म',
 'ै',
 'ं',
 '▁',
 '<0xE0>',
 '<0xA4>',
 '<0x8F>',
 'क',
 '▁',
 'अ',
 'च',
 '्',
 '<0xE0>',
 '<0xA4>',
 '<0x9B>',
 'ा',
 '▁',
 'ह',
 'ा',
 'थ',
 'ी',
 '▁',
 'ह',
 'ू',
 '<0xE0>',
 '<0xA4>',
 '<0x81>']

In [12]:
tok2.tokenize(text)

['▁मैं', '▁एक', '▁अच', '्', 'छा', '▁हाथी', '▁हूँ']

# Pre-Training

I'm going to do the pre-training in two phases.

The first phase will be [translation-based pretraining](https://ojs.aaai.org/index.php/AAAI/article/view/6256). The core idea is to train the model to generate the original text given its translation. This forces the model to learn deep semantic and syntactic relationships between the two languages.

Let's get some English data to translate.

In [3]:
ds = load_dataset("rahular/varta", split="validation", streaming=True)
english_paragraphs = []
for d in ds:
    if d["langCode"] != "en": continue
    english_paragraphs.append(" ".join(d["text"].split("\n")))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/2.86k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/51 [00:00<?, ?it/s]

We will translate this english text to Hindi using the [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2/tree/main/huggingface_interface) library.

IndicTrans2 is the first open-source transformer-based multilingual NMT model that supports high-quality translations across all the 22 scheduled Indic languages — including multiple scripts for low-resouce languages like Kashmiri, Manipuri and Sindhi. It adopts script unification wherever feasible to leverage transfer learning by lexical sharing between languages. Overall, the model supports five scripts Perso-Arabic (Kashmiri, Sindhi, Urdu), Ol Chiki (Santali), Meitei (Manipuri), Latin (English), and Devanagari (used for all the remaining languages).

Below are some functions we have grabbed from the library that are the core of the translation.

In [8]:
import torch
from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig, AutoTokenizer
from IndicTransToolkit import IndicProcessor
from mosestokenizer import MosesSentenceSplitter
import nltk
from nltk import sent_tokenize
nltk.download('punkt_tab')
from indicnlp.tokenize.sentence_tokenize import sentence_split, DELIM_PAT_NO_DANDA

BATCH_SIZE = 4
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
quantization = None

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [5]:
def initialize_model_and_tokenizer(ckpt_dir, quantization):
    if quantization == "4-bit":
        qconfig = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
        )
    elif quantization == "8-bit":
        qconfig = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_use_double_quant=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
        )
    else:
        qconfig = None

    tokenizer = AutoTokenizer.from_pretrained(ckpt_dir, trust_remote_code=True)
    model = AutoModelForSeq2SeqLM.from_pretrained(
        ckpt_dir,
        trust_remote_code=True,
        low_cpu_mem_usage=True,
        quantization_config=qconfig,
    )

    if qconfig == None:
        model = model.to(DEVICE)
        if DEVICE == "cuda":
            model.half()

    model.eval()

    return tokenizer, model


def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
    translations = []
    for i in range(0, len(input_sentences), BATCH_SIZE):
        batch = input_sentences[i : i + BATCH_SIZE]

        # Preprocess the batch and extract entity mappings
        batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)

        # Tokenize the batch and generate input encodings
        inputs = tokenizer(
            batch,
            truncation=True,
            padding="longest",
            return_tensors="pt",
            return_attention_mask=True,
        ).to(DEVICE)

        # Generate translations using the model
        with torch.no_grad():
            generated_tokens = model.generate(
                **inputs,
                use_cache=True,
                min_length=0,
                max_length=256,
                num_beams=5,
                num_return_sequences=1,
            )

        # Decode the generated tokens into text

        with tokenizer.as_target_tokenizer():
            generated_tokens = tokenizer.batch_decode(
                generated_tokens.detach().cpu().tolist(),
                skip_special_tokens=True,
                clean_up_tokenization_spaces=True,
            )

        # Postprocess the translations, including entity replacement
        translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)

        del inputs
        torch.cuda.empty_cache()

    return translations


def split_sentences(input_text, lang):
    if lang == "eng_Latn":
        input_sentences = sent_tokenize(input_text)
        with MosesSentenceSplitter("hi") as splitter:
            sents_moses = splitter([input_text])
        sents_nltk = sent_tokenize(input_text)
        if len(sents_nltk) < len(sents_moses):
            input_sentences = sents_nltk
        else:
            input_sentences = sents_moses
        input_sentences = [sent.replace("\xad", "") for sent in input_sentences]
    else:
        input_sentences = sentence_split(
            input_text, lang="hi", delim_pat=DELIM_PAT_NO_DANDA
        )
    return input_sentences

def translate_paragraph(input_text, src_lang, tgt_lang, model, tokenizer, ip):
    input_sentences = split_sentences(input_text, src_lang)
    translated_text = batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip)
    return " ".join(translated_text)

Our goal is to create data in the format {translated_paragraph}\n\n{english_paragraph}.

In [None]:
quantization = ""
en_indic_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B"
en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(en_indic_ckpt_dir, quantization)
ip = IndicProcessor(inference=True)

phase1_data = []
for para in english_paragraphs:
    trans_para = translate_paragraph(para, "eng_Latn", "hin_Deva", en_indic_model, en_indic_tokenizer, ip)
    phase1_data.append({"text": f"{trans_para}\n\n{para}"})

[1;30;43mStreaming output truncated to the last 5000 lines.[0m


In [None]:
from google.colab import drive
drive.mount('/content/drive')
with open('/content/phase1_data.txt', 'w') as f:
  for item in phase1_data:
    f.write("%s\n" % item)

from google.colab import files
files.download("phase1_data.txt")

Phase 2 is [bilingual next token prediction](https://huggingface.co/blog/alonsosilva/nexttokenprediction). This involves training a language model on a dataset where sentences alternate between two languages. We do this for two reasons:
*   By learning to predict the next token across languages, the model develops a deeper understanding of the relationships between languages.
*   This approach can potentially improve the model's ability to handle code-switching (mixing languages within a single sentence or conversation), which is common in multilingual settings.

Here is how we can create a dataset with interleaved text:







In [None]:
quantization = ""
en_indic_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B"
en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(en_indic_ckpt_dir, quantization)
ip = IndicProcessor(inference=True)

phase2_data = []
for para in english_paragraphs:
    en_sents = split_sentences(para, "eng_Latn")
    trans_sents = batch_translate(para, "eng_Latn", "hin_Deva", en_indic_model, en_indic_tokenizer, ip)
    final_para = []
    for idx, (en_sent, trans_sent) in enumerate(zip(en_sents, trans_sents)):
        sent_to_append = en_sent if idx % 2 == 0 else trans_sent
        final_para.append(sent_to_append)
    phase2_data.append({"text": " ".join(final_para)})

In [None]:
with open('/content/phase2_data.txt', 'w') as f:
  for item in phase2_data:
    f.write("%s\n" % item)

from google.colab import files
files.download("phase2_data.txt")

# Training

Like the pre-training, the training will be done in two parts. The first part will involve training on phase 1 data, and the second part will involve training on phase 2 data.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline, logging
from peft import LoraConfig
from trl import SFTTrainer

base_model = "meta-llama/Llama-2-7b-chat-hf"
temp_model = "temp_results/llama-2-7b-chat-Hindi-temp"

compute_dtype = getattr(torch, "float16")
quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=False)

model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config=quant_config, device_map={"": 0})
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = LlamaTokenizer(vocab_file=os.path.join("extended_tokenizer", "tokenizer.model"), legacy=False)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

peft_params = LoraConfig(lora_alpha=16, lora_dropout=0.1, r=64, bias="none", task_type="CAUSAL_LM")

training_params = TrainingArguments(output_dir="./temp_results", num_train_epochs=1, per_device_train_batch_size=4, gradient_accumulation_steps=1, optim="paged_adamw_32bit", save_steps=25, logging_steps=25, learning_rate=2e-4, weight_decay=0.001, fp16=False, bf16=False, max_grad_norm=0.3, max_steps=-1, warmup_ratio=0.03, group_by_length=True, lr_scheduler_type="constant", report_to="tensorboard")
trainer = SFTTrainer(model=model, train_dataset=phase1_data, peft_config=peft_params, dataset_text_field="text", max_seq_length=None, tokenizer=tokenizer, args=training_params, packing=False)
trainer.train()

trainer.model.save_pretrained(temp_model)
trainer.tokenizer.save_pretrained(temp_model)

I saved the half-trained model in a temporary directory. For this last part of the training, we will use that directory as the base_model.

In [None]:
new_model = "results/llama-2-7b-chat-Hindi-Finetuned"

compute_dtype = getattr(torch, "float16")
quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=False)

model = AutoModelForCausalLM.from_pretrained(temp_model, quantization_config=quant_config, device_map={"": 0})
model.config.use_cache = False
model.config.pretraining_tp = 1

peft_params = LoraConfig(lora_alpha=16, lora_dropout=0.1, r=64, bias="none", task_type="CAUSAL_LM")

training_params = TrainingArguments(output_dir="./results", num_train_epochs=1, per_device_train_batch_size=4, gradient_accumulation_steps=1, optim="paged_adamw_32bit", save_steps=25, logging_steps=25, learning_rate=2e-4, weight_decay=0.001, fp16=False, bf16=False, max_grad_norm=0.3, max_steps=-1, warmup_ratio=0.03, group_by_length=True, lr_scheduler_type="constant", report_to="tensorboard")
trainer = SFTTrainer(model=model, train_dataset=phase1_data, peft_config=peft_params, dataset_text_field="text", max_seq_length=None, tokenizer=tokenizer, args=training_params, packing=False)
trainer.train()

trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

# Testing

In [None]:
logging.set_verbosity(logging.CRITICAL)
prompt = "Who is Leonardo Da Vinci?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

In [None]:
# Save the model to Hugging Face Hub
trainer.model.push_to_hub("")

# Save the tokenizer to Hugging Face Hub
trainer.tokenizer.push_to_hub("")

# Links and Resources

### Dataset Creation/Curation:
- [Quick and easy to follow fine-tuning tutorial for the uninitiated](https://www.youtube.com/watch?v=BJQrQT2Xfyo)
- [Extending Llama to a new Language](https://github.com/meta-llama/llama-recipes/blob/0efb8bd31e4359ba9e8f52e8d003d35ff038e081/recipes/multilingual/README.md)
- [OpenHathi's Llama fine-tuning](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base/tree/main)
- [The dataset](https://huggingface.co/datasets/rahular/varta)
- [UTF-8 encoding to handle Devanagri script](https://www.freecodecamp.org/news/what-is-utf-8-character-encoding/)

### Tokenization:
- [SentencePiece](https://github.com/google/sentencepiece)
- [SentencePiece settings](https://huggingface.co/transformers/v3.0.2/tokenizer_summary.html#:~:text=More%20specifically%2C%20we%20will%20look,BPE\)%2C%20WordPiece%20and%20SentencePiece%2C)
- [Byte Pair Encoding](https://medium.com/@himankvjain/tokenization-byte-pair-encoding-8f92f5d7d86b#:~:text=Many%20state%2Dof%2Dthe%2D,offers%20a%20good%20compromise%20between)
- [Merging custom Chinese tokenizer with Llama tokenizer](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_tokenizer/merge_tokenizers.py)

### Pre-Training
- [Translation-based pretraining research](https://ojs.aaai.org/index.php/AAAI/article/view/6256)
- [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2/tree/main/huggingface_interface)
- [Next token prediction](https://huggingface.co/blog/alonsosilva/nexttokenprediction)

### Training
- [Finetuning Llama](https://github.com/rahul-sarvam/llama-recipes/tree/main/recipes/finetuning)
- [Finetuning Llama-2](https://www.run.ai/guides/generative-ai/llama-2-fine-tuning)