A transformer-based language model pretrained from scratch on a cleaned and deduplicated Telugu corpus. It is trained on high-quality, natural Telugu text collected from diverse sources.
MIT
- Telugu (
te
)
text-generation
The training corpus was carefully prepared using the following steps to ensure data quality, linguistic relevance, and uniqueness:
- From AI4Bharat/Sangraha, only Telugu-native content was selected. Synthetic dataset was excluded.
- From allenai/c4 and oscar, only documents identified as Telugu language were retained.
A custom deduplication and cleaning pipeline was developed using MinHash
and Locality Sensitive Hashing (LSH)
to eliminate near-duplicate documents and maintain a diverse dataset.
Steps included:
-
Text Normalization:
- Stripping extra whitespaces.
- Replacing multiple newlines and tabs with a single space.
-
MinHash-based Deduplication:
- A
MinHashLSH
index was used with:num_perm = 128
similarity_threshold = 0.95
- Each document was tokenized at the word level and hashed.
- Duplicates were detected and removed without adding them to the final corpus.
- A
The Sakhi
model was trained from scratch with the following configuration:
model_parameters:
embed_dim: 2048
num_heads: 8
ff_dim: 4096
chunk_length: 1024
num_layers: 10
vocab_size: 64000
- Embedding Dimension: 2048
- Attention Heads: 8
- Feedforward Layer Dimension: 4096 (with SwiGLU activation)
- Context Length: 1024 tokens
- Layers: 10 transformer decoder blocks
- Vocabulary Size: 64,000 (custom Byte-Level BPE)
The model was pretrained for 100 hours on 4× A100 GPUs provided by Lambda. Pretraining was done using PyTorch with mixed precision and DDP (DistributedDataParallel) for efficient scaling.
train_parameters:
batch_size: 12
num_epochs: 1
init_learning_rate: 1e-5
min_learning_rate: 1e-8
seed: 42
master_addr: "localhost"
master_port: "12355"
num_gpus: -1
save_every_n_steps: 25000
log_every_n_steps: 100
gradient_clipping_max_norm: 3.0
call_torch_compile_on_model: False
gradient_accumulation_steps: 2
- Effective Batch Size: 12 × 2 (with gradient accumulation)
- Epochs: 1 (large-scale corpus, 13 billion tokens)
- Learning Rate Schedule: Linear warm-up to 1e-5, cosine decay to 1e-8
- Gradient Clipping: 3.0
- Logging: Every 100 steps using Weights & Biases
- Checkpointing: Every 25,000 steps
- GPUs: 4 × A100 (Lambda)
- Runtime: 100 hours
- Precision: Mixed precision (FP16)
🚀 GPU costs were partially sponsored by Lambda Labs.
paths:
tokenizer_path: "/"
dataset_path: "/"
save_dir: "/"
⚠️ Paths are placeholders — these should be replaced with actual paths