# Unsloth CPT on Hindi (`Hindi-data-hub/odaigen_hindi_pre_trained_sp`) ‚Äî Colab

This notebook installs Unsloth, logs in to Hugging Face, **loads the Hindi pre-train data set** from the Hub, and runs **continued pretraining (CPT)** on a base LLM (4‚Äëbit QLoRA).

> Note: You may need to **accept the data set conditions** on its Hub page and log in with a token to access it.


## 0) Runtime
In Colab: Runtime ‚Üí **Change runtime type** ‚Üí GPU (T4 or A100).

## 1) Install & restart

In [1]:

!pip -q install unsloth datasets accelerate bitsandbytes peft transformers trl huggingface_hub


[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m61.8/61.8 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m351.3/351.3 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m506.8/506.8 kB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.4/59.4 MB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m564.7/564.7 kB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0

## 2) Imports, login & basic config

In [2]:
from datasets import load_dataset
from unsloth import (
    FastLanguageModel,
    UnslothTrainer,
    UnslothTrainingArguments,
    is_bfloat16_supported,
)
from huggingface_hub import login
import torch, os, json

# === Login to Hugging Face (paste your token when prompted) ===
# If you've already saved it to Colab secrets or env, you can skip logging in again.
try:
    token = os.environ.get("HF_TOKEN", None)
    if token:
        login(token=token, add_to_git_credential=True)
    else:
        login()  # will prompt in Colab
except Exception as e:
    print("Login skipped or failed:", e)

# === Config ===
DATASET_NAME = "Hindi-data-hub/odaigen_hindi_pre_trained_sp"
MODEL_NAME   = "unsloth/llama-3-8b-bnb-4bit"   # Use a BASE model (not Instruct) for CPT
MAX_SEQ_LEN  = 2048
LOAD_4BIT    = True
DTYPE        = None  # auto-pick bf16/fp16
OUTPUT_DIR   = "cpt_lang_hi"
SEED         = 42
torch.manual_seed(SEED)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

<torch._C.Generator at 0x7e999b8c8990>

## 3) Load data set from Hugging Face Hub

In [3]:
# Try loading the dataset. If access is restricted, make sure you've accepted conditions on the dataset page.
# We'll first peek the dataset to discover column names and size.
from datasets import get_dataset_config_names, get_dataset_split_names

print("Checking dataset configs and splits...")
try:
    configs = get_dataset_config_names(DATASET_NAME, token=True)
except Exception as e:
    print("Could not list configs (may be gated). Proceeding with default config. Error:", e)
    configs = [None]

print("Configs:", configs)

split_names = []
for cfg in configs:
    try:
        splits = get_dataset_split_names(DATASET_NAME, config_name=cfg, token=True)
        split_names.append((cfg, splits))
    except Exception as e:
        split_names.append((cfg, ["train"]))
print("Splits:", split_names)

# Load only 10% for training; use the next ~2% for eval if no validation split exists.
try:
    if configs and configs[0] is not None:
        ds_train = load_dataset(DATASET_NAME, configs[0], split="train[:10%]", token=True)
    else:
        ds_train = load_dataset(DATASET_NAME, split="train[:10%]", token=True)
except Exception as e:
    print("Direct 'train[:10%]' split failed; trying explicit slicing fallback. Error:", e)
    if configs and configs[0] is not None:
        ds_train = load_dataset(DATASET_NAME, configs[0], split="train[:10%]", token=True)
        ds_eval  = load_dataset(DATASET_NAME, configs[0], split="train[10%:12%]", token=True)
    else:
        ds_train = load_dataset(DATASET_NAME, split="train[:10%]", token=True)
        ds_eval  = load_dataset(DATASET_NAME, split="train[10%:12%]", token=True)
else:
    # Make eval split if not created above
    try:
        ds_eval = load_dataset(DATASET_NAME, split="validation", token=True)
    except Exception:
        # If no validation split, carve ~2% out of the 10% train subset
        ds_eval  = ds_train.shard(num_shards=50, index=0)  # ~2% of the 10%
        ds_train = ds_train.shard(num_shards=50, index=1)

print(ds_train)
print(ds_eval)

# Detect the primary text column
text_column = None
for cand in ["text", "sentence", "content", "raw_text", "document", "data"]:
    if cand in ds_train.column_names:
        text_column = cand
        break

if text_column is None:
    # Heuristic: pick the first string column
    for name in ds_train.column_names:
        if isinstance(ds_train[0][name], str):
            text_column = name
            break

if text_column is None:
    raise ValueError("Could not find a text column. Please inspect ds_train.column_names and set one.")

print("Using text column:", text_column)

Checking dataset configs and splits...


README.md:   0%|          | 0.00/5.70k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/35 [00:00<?, ?it/s]

Configs: ['default']


Resolving data files:   0%|          | 0/35 [00:00<?, ?it/s]

Splits: [('default', ['train'])]


Resolving data files:   0%|          | 0/35 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/35 [00:00<?, ?files/s]

ai4bharat_IndicParaphrase_part1.json:   0%|          | 0.00/363M [00:00<?, ?B/s]

oscar_part19.json:   0%|          | 0.00/461M [00:00<?, ?B/s]

oscar_part12.json:   0%|          | 0.00/474M [00:00<?, ?B/s]

oscar_part13.json:   0%|          | 0.00/472M [00:00<?, ?B/s]

oscar_part17.json:   0%|          | 0.00/467M [00:00<?, ?B/s]

oscar_part1.json:   0%|          | 0.00/509M [00:00<?, ?B/s]

oscar_part14.json:   0%|          | 0.00/463M [00:00<?, ?B/s]

miracl_corpus_part1.json:   0%|          | 0.00/255M [00:00<?, ?B/s]

oscar_part15.json:   0%|          | 0.00/468M [00:00<?, ?B/s]

HinDialect_1.1.json:   0%|          | 0.00/5.25M [00:00<?, ?B/s]

oscar_part18.json:   0%|          | 0.00/457M [00:00<?, ?B/s]

miracl_corpus_part2.json:   0%|          | 0.00/226M [00:00<?, ?B/s]

oscar_part16.json:   0%|          | 0.00/482M [00:00<?, ?B/s]

oscar_part10.json:   0%|          | 0.00/486M [00:00<?, ?B/s]

oscar_part11.json:   0%|          | 0.00/486M [00:00<?, ?B/s]

oscar_part2.json:   0%|          | 0.00/491M [00:00<?, ?B/s]

oscar_part20.json:   0%|          | 0.00/467M [00:00<?, ?B/s]

oscar_part3.json:   0%|          | 0.00/481M [00:00<?, ?B/s]

oscar_part4.json:   0%|          | 0.00/502M [00:00<?, ?B/s]

oscar_part5.json:   0%|          | 0.00/494M [00:00<?, ?B/s]

oscar_part8.json:   0%|          | 0.00/498M [00:00<?, ?B/s]

oscar_part7.json:   0%|          | 0.00/498M [00:00<?, ?B/s]

oscar_part6.json:   0%|          | 0.00/517M [00:00<?, ?B/s]

oscar_part9.json:   0%|          | 0.00/490M [00:00<?, ?B/s]

wikimediasp.json:   0%|          | 0.00/644M [00:00<?, ?B/s]

xP3all_part1.json:   0%|          | 0.00/630M [00:00<?, ?B/s]

xP3all_part10.json:   0%|          | 0.00/305M [00:00<?, ?B/s]

xP3all_part3.json:   0%|          | 0.00/629M [00:00<?, ?B/s]

xP3all_part2.json:   0%|          | 0.00/628M [00:00<?, ?B/s]

xP3all_part4.json:   0%|          | 0.00/629M [00:00<?, ?B/s]

xP3all_part5.json:   0%|          | 0.00/629M [00:00<?, ?B/s]

xP3all_part6.json:   0%|          | 0.00/305M [00:00<?, ?B/s]

xP3all_part7.json:   0%|          | 0.00/305M [00:00<?, ?B/s]

xP3all_part8.json:   0%|          | 0.00/306M [00:00<?, ?B/s]

xP3all_part9.json:   0%|          | 0.00/305M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Resolving data files:   0%|          | 0/35 [00:00<?, ?it/s]

Dataset({
    features: ['file_name', 'content'],
    num_rows: 14165
})
Dataset({
    features: ['file_name', 'content'],
    num_rows: 14165
})
Using text column: content


## 4) Tokenizer/model (4‚Äëbit)

In [4]:

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LEN,
    dtype=DTYPE,
    load_in_4bit=LOAD_4BIT,
)


==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

## 5) Prepare tokenized data set

In [5]:

def tok_fn(batch):
    return tokenizer(
        batch[text_column],
        truncation=True,
        max_length=MAX_SEQ_LEN,
        return_attention_mask=False,
    )

tokenized_train = ds_train.map(tok_fn, batched=True, remove_columns=[c for c in ds_train.column_names if c != text_column])
tokenized_eval  = ds_eval.map(tok_fn, batched=True, remove_columns=[c for c in ds_eval.column_names if c != text_column])


Map:   0%|          | 0/14165 [00:00<?, ? examples/s]

Map:   0%|          | 0/14165 [00:00<?, ? examples/s]

## 6) QLoRA (incl. embeddings & lm_head) and train (CPT)

In [6]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules=[
        "q_proj","k_proj","v_proj","o_proj",
        "gate_proj","up_proj","down_proj"
    ],
)

args = UnslothTrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=1,                 # increase for more data
    per_device_train_batch_size=1,      # tune to your GPU
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=16,
    learning_rate=5e-5,                 # main LR for LoRA blocks
    embedding_learning_rate=5e-6,       # smaller for embed/lm_head
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    gradient_checkpointing=True,
    optim="adamw_8bit",
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    logging_steps=50,
    eval_strategy="steps", # Changed from evaluation_strategy to eval_strategy
    eval_steps=1000,
    save_steps=1000,
    save_total_limit=2,
    max_grad_norm=1.0,
)

#To save memory
model.config.use_cache = False

trainer = UnslothTrainer(
    model=model,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    args=args,
    tokenizer=tokenizer, # Explicitly pass the tokenizer
    packing=True,
)

trainer.train()

Unsloth 2025.11.2 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 14,165 | Num Epochs = 1 | Total steps = 886
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 16 x 1) = 16
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mbhanagearshan[0m ([33mbhanagearshan-san-jose-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [huggingface_hub.inference, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


Step,Training Loss,Validation Loss


Unsloth: Will smartly offload gradients to save VRAM!


TrainOutput(global_step=886, training_loss=1.492715814043652, metrics={'train_runtime': 5916.0532, 'train_samples_per_second': 2.394, 'train_steps_per_second': 0.15, 'total_flos': 2.116353018740736e+16, 'train_loss': 1.492715814043652, 'epoch': 1.0})

## 7) Save LoRA and (optional) merged weights

In [7]:

import os
os.makedirs(OUTPUT_DIR, exist_ok=True)

model.save_pretrained(f"{OUTPUT_DIR}/lora")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/lora")

# Merge LoRA into a single checkpoint (optional)
model.save_pretrained(f"{OUTPUT_DIR}/merged", merge=True)
tokenizer.save_pretrained(f"{OUTPUT_DIR}/merged")

print("Saved LoRA to", f"{OUTPUT_DIR}/lora")
print("Saved merged to", f"{OUTPUT_DIR}/merged")


Saved LoRA to cpt_lang_hi/lora
Saved merged to cpt_lang_hi/merged


### (Optional) Save to Google Drive

## 8) Quick perplexity & generation sanity check

In [11]:
import math
from torch.utils.data import DataLoader
from tqdm import tqdm

# Use the trainer's built-in evaluate method which handles packing
metrics = trainer.evaluate()
pp = math.exp(metrics["eval_loss"])
print("Eval Perplexity:", pp)

# Simple generation test (raw text ‚Äî no chat template during CPT)
prompt = "‡§π‡§ø‡§Ç‡§¶‡•Ä ‡§Æ‡•á‡§Ç ‡§è‡§ï ‡§õ‡•ã‡§ü‡§æ ‡§Ö‡§®‡•Å‡§ö‡•ç‡§õ‡•á‡§¶ ‡§≤‡§ø‡§ñ‡•á‡§Ç ‡§ú‡•ã ‡§á‡§∏ ‡§Æ‡•â‡§°‡§≤ ‡§ï‡•Ä ‡§∏‡§Æ‡§ù ‡§ï‡§æ ‡§™‡§∞‡•Ä‡§ï‡•ç‡§∑‡§£ ‡§ï‡§∞‡•á‡•§"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


Eval Perplexity: 4.496370995897227
‡§π‡§ø‡§Ç‡§¶‡•Ä ‡§Æ‡•á‡§Ç ‡§è‡§ï ‡§õ‡•ã‡§ü‡§æ ‡§Ö‡§®‡•Å‡§ö‡•ç‡§õ‡•á‡§¶ ‡§≤‡§ø‡§ñ‡•á‡§Ç ‡§ú‡•ã ‡§á‡§∏ ‡§Æ‡•â‡§°‡§≤ ‡§ï‡•Ä ‡§∏‡§Æ‡§ù ‡§ï‡§æ ‡§™‡§∞‡•Ä‡§ï‡•ç‡§∑‡§£ ‡§ï‡§∞‡•á‡•§ (‡§è‡§ï‡•ç‡§∏‡§è‡§Æ‡§è‡§≤ ‡§ï‡§æ ‡§á‡§∏‡•ç‡§§‡•á‡§Æ‡§æ‡§≤ ‡§ï‡§∞‡•á‡§Ç) 2. ‡§ú‡•Ä ‡§™‡•Ä ‡§è‡§∏ ‡§∏‡•á‡§µ‡§æ‡§ì‡§Ç ‡§ï‡•Ä ‡§µ‡§ø‡§∂‡•á‡§∑‡§§‡§æ ‡§π‡§ø‡§Ç‡§¶‡•Ä ‡§Æ‡•á‡§Ç 150 ‡§∂‡§¨‡•ç‡§¶‡•ã‡§Ç ‡§Æ‡•á‡§Ç ‡§è‡§ï ‡§õ‡•ã‡§ü‡§æ ‡§Ö‡§®‡•Å‡§ö‡•ç‡§õ‡•á‡§¶ ‡§≤‡§ø‡§ñ‡•á‡§Ç‡•§ (‡§è‡§ï‡•ç‡§∏‡§è‡§Æ‡§è‡§≤ ‡§ï‡§æ ‡§á‡§∏‡•ç‡§§‡•á‡§Æ‡§æ‡§≤ ‡§ï‡§∞‡•á‡§Ç) 3. 20 ‡§µ‡§æ‡§ï‡•ç‡§Ø‡•ã‡§Ç ‡§ï‡§æ ‡§Ö‡§®‡•Å‡§µ‡§æ‡§¶ ‡§π‡§ø‡§Ç‡§¶‡•Ä ‡§Æ‡•á‡§Ç ‡§ï‡§∞‡•á‡§Ç ‡§§‡§•‡§æ ‡§â‡§∏‡§ï‡§æ ‡§Ö‡§®‡•Å‡§µ‡§æ‡§¶ ‡§Ö‡§Ç‡§ó‡•ç‡§∞‡•á‡§ú‡•Ä ‡§Æ‡•á‡§Ç ‡§ï‡§∞‡•á‡§Ç (‡§è‡§ï‡•ç‡§∏‡§è‡§Æ‡§è‡§≤ ‡§ï‡§æ ‡§á‡§∏‡•ç‡§§‡•á‡§Æ‡§æ‡§≤ ‡§ï‡§∞‡•á‡§Ç) 4
