# 🌍Continued Pretraining — Teaching an LLM a New Language with Unsloth

This notebook walks through **continued pretraining** (a.k.a. domain/language adaptation) to help a small LLM acquire proficiency in a **new target language** (e.g., **Hindi**) using **Unsloth** for memory-efficient training.

## 🎯 Learning Objectives
- Understand **continued pretraining** vs. SFT/RL (why and when to use each).
- Prepare a **monolingual corpus** (tokenized, de-duplicated, and length-bounded).
- Configure a **4-bit base model + LoRA** for efficient training with Unsloth.
- Run a short, scalable **pretraining loop** (with bf16/fp16 and gradient checkpointing).
- Track metrics (loss/perplexity) and perform **quick language sanity checks**.
- Save **LoRA adapters** and (optionally) **merge** to a single fp16 checkpoint.


In [None]:
!pip install unsloth datasets transformers accelerate bitsandbytes wandb huggingface_hub

Collecting unsloth
  Downloading unsloth-2025.11.2-py3-none-any.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.8/61.8 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting bitsandbytes
  Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting unsloth_zoo>=2025.11.3 (from unsloth)
  Downloading unsloth_zoo-2025.11.3-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.35-py3-none-any.whl.metadata (12 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.33-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.2 kB)
Collecting datasets
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting trl!=0.19.0,<=0.23.0,>=0.18.2 (from unsloth)
  Downloading trl-0.23.0-py3-none-any.whl.metadata (11 kB)
Collecting pyarrow>=21.0.0 (fr

In [None]:
# ===============================================================
# 🔐 Secure Hugging Face Login + Environment Setup
# ===============================================================
import torch
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset, Dataset  # for creating or slicing subsets
import os, gc
from google.colab import userdata
from huggingface_hub import login

# ---------------------------------------------------------------
# Fetch Hugging Face token securely from Colab secrets
# ---------------------------------------------------------------
hf_token = userdata.get("HGFaceApi")  # Retrieve token stored under this key

if not hf_token:
    raise ValueError(
        "❌ Hugging Face token not found in Colab secrets.\n"
        "Please set it using:\n"
        "  from google.colab import userdata\n"
        "  userdata.set('HGFaceApi', 'hf_YourAccessTokenHere')"
    )

try:
    login(hf_token)
    print("✅ Hugging Face login successful.")
except Exception as e:
    print(f"❌ Hugging Face login failed: {e}")
    print("Please ensure your Hugging Face token is valid.")

print("=== Imports and Secure Login Complete ===")


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.9.0+cu130 with CUDA 1300 (you have 2.9.0+cu128)
    Python  3.10.19 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.9.0+cu130 with CUDA 1300 (you have 2.9.0+cu128)
    Python  3.10.19 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
🦥 Unsloth Zoo will now patch everything to make training faster!
✅ Hugging Face login successful.
=== Imports and Secure Login Complete ===


### 🌍 Set Continued Pretraining Configuration

Here we define the core parameters for **continued pretraining**, where the base model will learn to understand and generate text in a **new target language**.

- **`TARGET_LANGUAGE_CODE`** – ISO code of the new language (e.g., `"hi"` for Hindi, `"fr"` for French, `"sw"` for Swahili).  
- **`max_seq_length`** – Controls how much text context the model sees per batch.  
- **`load_in_4bit=True`** – Enables memory-efficient quantized loading for faster training.  
- **`model_name`** – Uses a small, efficient base model (`unsloth/Qwen2-0.5B-bnb-4bit`) ideal for experimentation.

> 🧠 Adjust `max_seq_length` and `load_in_4bit` depending on your available GPU memory. Longer sequences help the model learn longer syntactic and semantic dependencies in the new language.


In [None]:
# ===============================================================
# 🌍 Continued Pretraining Configuration
# ===============================================================

# --- Target Language Settings ---
TARGET_LANGUAGE_CODE = "hi"   # ISO code for Hindi ("sw" = Swahili, "fr" = French, etc.)
TARGET_LANGUAGE_NAME = "Hindi"  # Readable name for display and logging

# --- Model & Training Parameters ---
max_seq_length = 2048          # Sequence length cap (increase if GPU memory allows)
dtype = None                   # Let Unsloth auto-detect optimal precision (bf16/fp16)
load_in_4bit = True            # Use 4-bit quantization for efficiency

# --- Base Model Selection ---
model_name = "unsloth/Qwen2-0.5B-bnb-4bit"  # Compact base for continued pretraining

# ---------------------------------------------------------------
# Print configuration summary
# ---------------------------------------------------------------
print("🧩 Configuration Summary")
print(f"  • Base Model: {model_name}")
print(f"  • Target Language: {TARGET_LANGUAGE_NAME} ({TARGET_LANGUAGE_CODE})")
print(f"  • Max Sequence Length: {max_seq_length}")
print(f"  • 4-bit Quantization: {load_in_4bit}")
print("=== Configuration Set ===")


🧩 Configuration Summary
  • Base Model: unsloth/Qwen2-0.5B-bnb-4bit
  • Target Language: Hindi (hi)
  • Max Sequence Length: 2048
  • 4-bit Quantization: True
=== Configuration Set ===


### 🚀 Load the Base Model and Tokenizer

This step loads the **base model** and its tokenizer using **Unsloth’s FastLanguageModel** loader, optimized for:
- **4-bit quantization** (to fit on smaller GPUs),
- **long-context pretraining**, and
- **fast memory-efficient operations**.

> ⚙️ The model is loaded first *without* any PEFT/LoRA adapters.  
> In the next step, we’ll attach LoRA layers to enable lightweight continued pretraining on the new target language.


In [None]:
# ===============================================================
# 🚀 Load Base Model and Tokenizer for Continued Pretraining
# ===============================================================
import time

start_time = time.time()
print(f"🔄 Loading base model: {model_name} ...")

# Load the base model using Unsloth's optimized loader
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = model_name,
    max_seq_length = max_seq_length,
    dtype          = dtype,
    load_in_4bit   = load_in_4bit,
    # token = hf_token,  # Optional: uncomment if HF login fails
)

elapsed = time.time() - start_time
print(f"✅ Model and tokenizer successfully loaded in {elapsed:.2f} seconds.")

# ---------------------------------------------------------------
# Note:
# We only load the base model here.
# LoRA adapters (for efficient fine-tuning) will be added in the next step.
# ---------------------------------------------------------------
print("=== Base Model and Tokenizer Ready ===")


🔄 Loading base model: unsloth/Qwen2-0.5B-bnb-4bit ...
==((====))==  Unsloth 2025.11.2: Fast Qwen2 patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/457M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/256 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

✅ Model and tokenizer successfully loaded in 23.60 seconds.
=== Base Model and Tokenizer Ready ===


In [None]:
# ===============================================================
# 📚 Load Raw Text Dataset (mC4 via allenai/c4) — Hindi
# ===============================================================
from datasets import load_dataset, Dataset

dataset_name = "allenai/c4"
subset_name = TARGET_LANGUAGE_CODE      # e.g., "hi"
subset_size = 20_000
min_length = 50
text_column = "text"

print(f"Loading dataset: {dataset_name}, language: {subset_name}")
print(f"Streaming and selecting the first {subset_size} usable examples...")

try:
    # ✅ Use the maintained repo + per-language config
    streamed = load_dataset(
        dataset_name,
        subset_name,              # e.g., "hi"
        split="train",
        streaming=True,
    )
    print("Dataset stream initialized.")

    # Filter by minimum length; over-sample to account for drops
    filtered_iterable = (
        ex for ex in streamed.take(subset_size * 2)
        if len(ex.get(text_column, "")) >= min_length
    )

    # Materialize a small in-memory subset
    dataset_list = [ex for _, ex in zip(range(subset_size), filtered_iterable)]
    if not dataset_list:
        raise ValueError(
            f"No entries >= {min_length} chars found in first {subset_size*2} examples for '{subset_name}'."
        )

    dataset = Dataset.from_list(dataset_list)
    print(f"✅ Built subset with {len(dataset)} rows.")
    print("Features:", dataset.features)

    if text_column in dataset.features:
        print("\nSample text:\n", dataset[0][text_column][:500])
    else:
        print(f"⚠️ Expected '{text_column}' column not found. Check printed features.")

except Exception as e:
    print(f"❌ Error loading '{dataset_name}' / '{subset_name}': {e}")
    print("Tip: try the fallback below that targets files explicitly in the repo.")
    raise

print("\n=== mC4 (via allenai/c4) subset ready ===")


Loading dataset: allenai/c4, language: hi
Streaming and selecting the first 20000 usable examples...


README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

Dataset stream initialized.
✅ Built subset with 20000 rows.
Features: {'text': Value('string'), 'timestamp': Value('timestamp[us]'), 'url': Value('string')}

Sample text:
 6 साल की बच्ची अपनी मां के लिए बनी मां | UPUKLive
6 साल की बच्ची अपनी मां के लिए बनी मां
जो प्यार, करुणा और देखभाल का स्वभाव ईश्वर ने बेटियों को दिया है, वह बेटों को हासिल नहीं है। मां को ब्रेन हैमरेज हो जाने के बाद छह साल की मासूम ने जिस तरह से मां की देखभाल की, उसे देखकर लगता है कि मां असल में बेटी है और बेटी मां है। काई चेंगचेंग जब महज छह साल की थी, तो उसकी मां चेन ली को ब्रेन हैमरेज हो गया था। इसकी वजह से उनकी याददाश्त खराब हो गई।
बीते चार साल से अपनी मां को पढ़ना, लिखना और बोलना सिखाना ही क

=== mC4 (via allenai/c4) subset ready ===


In [None]:
# Even for continued pretraining, PEFT/LoRA is often used with Unsloth
# to make training feasible on limited hardware and manage checkpoints.
# The LoRA adapters will learn the new language patterns.
print("Configuring LoRA adapters for pretraining...")

model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Rank can be higher for pretraining (e.g., 32, 64) as we want to learn broader patterns
    lora_alpha = 64, # Adjust alpha accordingly (often 2*r)
    lora_dropout = 0, # Set to 0 for Unsloth fast patching
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 3407,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
)

print("LoRA configured for pretraining:")
print(model.print_trainable_parameters())
print("=== LoRA Configuration Complete ===")

Configuring LoRA adapters for pretraining...


Unsloth 2025.11.2 patched 24 layers with 24 QKV layers, 24 O layers and 24 MLP layers.


LoRA configured for pretraining:
trainable params: 17,596,416 || all params: 511,629,184 || trainable%: 3.4393
None
=== LoRA Configuration Complete ===


### 🧰 Configure Continued Pretraining with SFTTrainer

We use **TRL’s `SFTTrainer`** in **packing mode** to pretrain on raw text from the target language:

- **`packing=True`**: efficiently fills each training sequence to `max_seq_length` by concatenating raw samples.
- **Raw text field**: `dataset_text_field="text"` points the trainer to the column that contains plain text.
- **Precision**: automatically prefers **bf16** if the GPU supports it, otherwise **fp16**.
- **Throughput**: small per-device batch with **gradient accumulation** to simulate a larger effective batch.
- **Scheduling**: short warmup with a **linear** LR schedule; adjust `max_steps`/`learning_rate` as needed.
- **Checkpoints**: save every 50 steps for easy resumption and evaluation.

> This setup is optimized for **continued pretraining (CPT)**: we focus on language exposure and next-token prediction, not instruction formats. After CPT, you can stack **SFT** or **RL** for task alignment.


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

output_directory = f"llama3_base_pretrain_{TARGET_LANGUAGE_CODE}_run1"

print(f"Configuring SFTTrainer for Continued Pretraining. Output directory: {output_directory}")

# Key difference: Use packing=True for efficient pretraining on raw text
# No custom formatting function is needed; SFTTrainer handles text packing.

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,        # The raw text dataset
    dataset_text_field = "text",    # The column containing the raw text
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = True,                 # <<< IMPORTANT: Enable packing for pretraining efficiency

    args = TrainingArguments(
        per_device_train_batch_size = 2,  # Keep batch size low due to sequence length
        gradient_accumulation_steps = 8,  # Increase accumulation (effective batch size 16)
        warmup_steps = 20,                # Slightly more warmup might be beneficial
        max_steps = 200,                # Set a max step count for the assignment (adjust as needed)
        num_train_epochs = 1,             # Or train for 1 epoch on the subset
        learning_rate = 1e-4,             # Learning rate can sometimes be slightly higher or lower for CPT (e.g., 5e-5 to 2e-4)
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = output_directory,
        save_strategy = "steps",
        save_steps = 50,                 # Save checkpoints regularly
        report_to="tensorboard",
    ),
)

print("Trainer configured for continued pretraining.")
if torch.cuda.is_available():
    gpu_stats = torch.cuda.get_device_properties(0)
    start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
    print(f"Pre-Train GPU: {gpu_stats.name}. Max memory reserved: {start_gpu_memory} GB.")
print("=== Trainer Configuration Complete ===")

Configuring SFTTrainer for Continued Pretraining. Output directory: llama3_base_pretrain_hi_run1


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/20000 [00:00<?, ? examples/s]

Trainer configured for continued pretraining.
Pre-Train GPU: Tesla T4. Max memory reserved: 0.555 GB.
=== Trainer Configuration Complete ===


In [None]:
# ===============================================================
# 🚀 Run Continued Pretraining
# ===============================================================
import time, gc, torch

print(f"🧠 Starting continued pretraining on {TARGET_LANGUAGE_NAME} text...")
start_time = time.time()

# ---------------------------------------------------------------
# 🧹 Clear cache and memory before training
# ---------------------------------------------------------------
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("✅ Cleared CUDA memory cache.")

# ---------------------------------------------------------------
# 🏋️ Begin training
# ---------------------------------------------------------------
trainer_stats = trainer.train()

elapsed = (time.time() - start_time) / 60
print(f"\n✅ Continued pretraining completed in {elapsed:.2f} minutes.")

# ---------------------------------------------------------------
# 💾 GPU Memory Usage Summary
# ---------------------------------------------------------------
if torch.cuda.is_available():
    used_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
    if "start_gpu_memory" in locals():
        used_memory_for_training = round(used_gpu_memory - start_gpu_memory, 3)
        print(f"Peak GPU memory reserved: {used_gpu_memory} GB "
              f"(+{used_memory_for_training} GB used during training).")
    else:
        print(f"Peak GPU memory reserved: {used_gpu_memory} GB")

# ---------------------------------------------------------------
# 📊 Training Stats
# ---------------------------------------------------------------
print("\n📈 Training statistics:")
print(trainer_stats)
print("\n=== ✅ Continued Pretraining Complete ===")


The model is already on multiple devices. Skipping the move to device specified in `args`.


🧠 Starting continued pretraining on Hindi text...
✅ Cleared CUDA memory cache.


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 20,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 17,596,416 of 511,629,184 (3.44% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,1.9684
20,1.771
30,1.7485
40,1.7528
50,1.7409
60,1.8207
70,1.6892
80,1.7215
90,1.7556
100,1.8182



✅ Continued pretraining completed in 24.77 minutes.
Peak GPU memory reserved: 1.99 GB (+1.435 GB used during training).

📈 Training statistics:
TrainOutput(global_step=200, training_loss=1.7344526290893554, metrics={'train_runtime': 1481.0186, 'train_samples_per_second': 2.161, 'train_steps_per_second': 0.135, 'total_flos': 7174573946496000.0, 'train_loss': 1.7344526290893554, 'epoch': 0.16})

=== ✅ Continued Pretraining Complete ===


### 💾 Save Final LoRA Adapters and Tokenizer

After continued pretraining, we export the **LoRA adapter weights** and **tokenizer** for reuse.

- The **LoRA adapters** capture the learned linguistic features of the new target language (e.g., Hindi grammar, vocabulary, syntax).  
- Saving the **tokenizer** ensures consistent tokenization when reloading or fine-tuning.

These files can later be:
- Reattached to the same base model for inference or supervised fine-tuning, or  
- Merged into a single full-precision checkpoint for deployment.

> 💡 Tip: Keep this adapter directory organized by run name (`/final_adapters`) so you can easily compare multiple language pretraining runs.


In [None]:
# ===============================================================
# 💾 Save Final LoRA Adapters and Tokenizer
# ===============================================================
import os

# Define final adapter directory inside the output folder
final_adapter_dir = f"{output_directory}/final_adapters"
os.makedirs(final_adapter_dir, exist_ok=True)

print(f"\n🔖 Saving final LoRA adapters to: {final_adapter_dir}")

# Save trained LoRA adapter weights (language adaptation layers)
model.save_pretrained(final_adapter_dir)

# Save the tokenizer configuration for future use
tokenizer.save_pretrained(final_adapter_dir)

print(f"✅ LoRA adapters and tokenizer successfully saved to {final_adapter_dir}")
print("=== Continued Pretraining Artifacts Saved ===")



🔖 Saving final LoRA adapters to: llama3_base_pretrain_hi_run1/final_adapters
✅ LoRA adapters and tokenizer successfully saved to llama3_base_pretrain_hi_run1/final_adapters
=== Continued Pretraining Artifacts Saved ===


### 🧪 Inference Sanity Check (Target Language)

We run a quick generation test **after continued pretraining**:

- Use a short **prompt in the target language** (raw text; no chat template).
- Generate up to `max_new_tokens=50` with sampling (`temperature=0.7`, `top_p=0.9`).
- Print both the **new tokens only** and the **full string** (prompt + completion).

> If outputs look repetitive or off-topic, increase training steps, improve corpus cleanliness, or try a slightly higher `max_seq_length` (if VRAM permits).


In [None]:
# ===============================================================
# 🧪 Inference Sanity Check (Raw Text Generation in Target Language)
# ===============================================================
import warnings, torch, gc, time
warnings.filterwarnings("ignore")

print("\n🚦 Running inference smoke test...")

# Pick device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Ensure tokenizer has a pad token
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

# Switch model to inference mode (Unsloth utility) and eval()
FastLanguageModel.for_inference(model)
model.eval()

# Optional: reproducible sampling (comment out for more randomness)
# torch.manual_seed(3407)

# --- Prompt in the target language (edit as needed) ---
prompt_native = "संयुक्त राज्य अमेरिका एक विशाल देश है जहाँ"
print(f"Using prompt in {TARGET_LANGUAGE_NAME}: {prompt_native!r}")

# Tokenize raw text (no chat template for base CPT)
inputs = tokenizer([prompt_native], return_tensors="pt").to(device)

# Generation configuration
gen_kwargs = dict(
    max_new_tokens=50,     # short continuation
    do_sample=True,        # enable sampling
    temperature=0.7,
    top_p=0.9,
    use_cache=True,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
)

print("\n🧩 Generating...")
t0 = time.time()
with torch.no_grad(), torch.inference_mode():
    outputs = model.generate(**inputs, **gen_kwargs)
dt = time.time() - t0

# Decode full sequence (prompt + completion)
full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Decode new tokens only
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
gen_text = tokenizer.decode(new_tokens, skip_special_tokens=True)

print("\n--- Prompt ---")
print(prompt_native)

print("\n--- Generated Continuation (new tokens only) ---")
print(gen_text)

print("\n--- Full Text (prompt + continuation) ---")
print(full_text)

print(f"\n⏱️ Generation time: {dt:.2f}s")

# Cleanup
del inputs, outputs, new_tokens
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("\n=== ✅ Pretraining Inference Test Complete ===")



🚦 Running inference smoke test...
Using prompt in Hindi: 'संयुक्त राज्य अमेरिका एक विशाल देश है जहाँ'

🧩 Generating...

--- Prompt ---
संयुक्त राज्य अमेरिका एक विशाल देश है जहाँ

--- Generated Continuation (new tokens only) ---
 देश को लोगों को रहने की सुनाई देखने का भी विचार है जो �

--- Full Text (prompt + continuation) ---
संयुक्त राज्य अमेरिका एक विशाल देश है जहाँ देश को लोगों को रहने की सुनाई देखने का भी विचार है जो �

⏱️ Generation time: 2.49s

=== ✅ Pretraining Inference Test Complete ===
