# Sarvam 2B Demo

`sarvamai/sarvam-2b-v0.5` is an early checkpoint of `sarvam-2b`, a small, yet powerful language model pre-trained from scratch on 4 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

sarvam-2b will be trained on a data mixture containing equal parts English (2T) and Indic (2T) tokens. The current checkpoint has seen a total of 2 trillion tokens, and has not undergone any post-training.

## Text completions

In [None]:
from IPython.display import HTML, display

# wrap the output text; horizonal scrolling is a pain!
def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TextGenerationPipeline

# initialize a HF pipeline for easy use
model = AutoModelForCausalLM.from_pretrained("sarvamai/sarvam-2b-v0.5")
tokenizer = AutoTokenizer.from_pretrained("sarvamai/sarvam-2b-v0.5")
tokenizer.pad_token_id = tokenizer.eos_token_id
pipe = TextGenerationPipeline(model=model, tokenizer=tokenizer, device="cuda", torch_dtype="bfloat16", return_full_text=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.76G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/24.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/1.70M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.54M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [None]:
gen_kwargs = {
  "temperature": 0.01, # we are introducing a small amount of stocasticity in the model generation process
  "repetition_penalty": 1.2, # discourage the model from repeating recently generated tokens
  "max_new_tokens": 256, # maximum number of tokens to generate
  "stop_strings": ["</s>", "\n\n"], # if these tokens are seen, stop the generation
  "tokenizer": tokenizer, # tokenizer of the model
} # works best with these defaults

# a simple wrapper to get text completions
def gen(prompt):
  # having a trailing space hurts the model generation; let us strip them
  prompt = prompt.rstrip()
  output = pipe(prompt, **gen_kwargs)
  print(output[0]["generated_text"])

**sarvam-2b-v0.5 is a pre-trained model. It is not instruction fine-tuned or aligned, which means that it cannot answer questions or follow instructions. It is only a text-completion model.**

However, a lot of world-knowledge is baked into the model. It can be brought out either by asking it to complete a properly construction prompt, or by finetuning it on a dataset. First, let us see how we can get the model to complete text for us.

*Note: the model can sometimes hallucinate and give out wrong information.*

In [None]:
gen("Here is a simple sorting algorithm implemented in Python:")


```python 
def bubble_sort(arr):    # Bubble Sort Algorithm
  n = len(arr) # Get the length of array
  for i in range (0, n-1):      # Iterate through each element and swap if needed
    if arr[i] > arr[i+1] :        # If current item is greater than next one then do this...
          temp = arr[i]               #...and assign it to temp variable. Then swap them!
          arr[i], arr[i + 1] = temp, arr[i + 1]       # And finally update indexes accordingly.""""
  return arr     
 ``` </s>


In [None]:
prompt = """*ಬೆಂಗಳೂರಿನ ಇತಿಹಾಸ*

ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ ಬೆಂಗಳೂರು ಇವತ್ತು"""

gen(prompt)

 ಭಾರತದ ತಂತ್ರಜ್ಞಾನ ಕೇಂದ್ರವಾಗಿ ಮತ್ತು ಐಟಿ ಉದ್ಯಮದ ಹೃದಯಭಾಗವಾಗಿದೆ. ಇದು ತನ್ನ ರೋಮಾಂಚಕ ಸಂಸ್ಕೃತಿ, ಶ್ರೀಮಂತ ಪರಂಪರೆ ಮತ್ತು ಅಭಿವೃದ್ಧಿ ಹೊಂದುತ್ತಿರುವ ಆರ್ಥಿಕತೆಗೆ ಹೆಸರುವಾಸಿಯಾಗಿದೆ. ಈ ನಗರವು ಹಲವಾರು ಐತಿಹಾಸಿಕ ಹೆಗ್ಗುರುತುಗಳು, ಆಧುನಿಕ ಗಗನಚುಂಬಿ ಕಟ್ಟಡಗಳು ಮತ್ತು ಅದರ ನಿವಾಸಿಗಳ ಸೃಜನಶೀಲ ಚೈತನ್ಯವನ್ನು ಹೊಂದಿದೆ.




If you are interested in solving a particular task, you can give a few examples in the prompt, so that the model can follow the same pattern. Below, we give four English to Oriya translations in the prompt and ask the model to translate the 5th sentence.

In [None]:
prompt = """English: The main motive of the Manhattan Project was to develop a nuclear bomb
Oriya: ମ୍ୟାନହାଟନ ପ୍ରକଳ୍ପର ମୁଖ୍ୟ ଉଦ୍ଦେଶ୍ୟ ଥିଲା ପରମାଣୁ ବୋମା ବିକଶିତ କରିବା

English: It was done during World War II and was a major factor during the last stages of the war.
Oriya: ଏହା ଦ୍ବିତୀୟ ବିଶ୍ୱଯୁଦ୍ଧ ସମୟରେ କରାଯାଇଥିଲା ଏବଂ ଯୁଦ୍ଧର ଶେଷ ପର୍ଯ୍ୟାୟରେ ଏହା ଏକ ପ୍ରମୁଖ କାରଣ ଥିଲା।

English: The current population of earth is about to exceed 8 billion, but experts say that this is not a cause for alarm.
Oriya: ପୃଥିବୀର ବର୍ତ୍ତମାନର ଜନସଂଖ୍ୟା ୮ ବିଲିୟନରୁ ଅଧିକ, କିନ୍ତୁ ବିଶେଷଜ୍ଞମାନେ କହୁଛନ୍ତି ଯେ ଏହା ଚିନ୍ତାର କାରଣ ନୁହେଁ।

English: The first horcrux that Harry Potter destroyed was Tom Riddle's diary.
Oriya:"""

gen(prompt)

 ପ୍ରଥମ ହୋର୍କ୍ରୁକ୍ସ ଯାହା ହାରି ପୋଟର ଭାଙ୍ଗିଥିଲେ ତାହା ହେଉଛି ଟୋମ୍ ରାଇଡଲ୍ ଙ୍କ ଡାଏରୀ । </s>


In [None]:
prompt = """देश और उनकी राजधानियाँ:
चीन -> बीजिंग
भारत -> नई दिल्ली
यूएस -> वाशिंगटन डीसी
केन्या ->"""

gen(prompt)

 नैरोबी
जर्मनी -> बर्लिन
इटली -> रोम
फ्रांस -> पेरिस
स्पेन -> मैड्रिड
स्वीडन -> स्टॉकहोम
डेनमार्क -> कोपेनहेगन
नॉर्वे -> ओस्लो
फिनलैंड -> हेलसिंकी
ग्रीस -> एथेंस
तुर्की -> इस्तांबुल
रूस -> मास्को
कनाडा -> टोरंटो
ऑस्ट्रेलिया -> कैनबरा
न्यूज़ीलैंड -> वेलिंगटन
दक्षिण अफ्रीका -> जोहान्सबर्ग
अफ्रीका के अन्य देशों में से प्रत्येक का अपना राजधानी शहर है। </s>


## Supervised Finetuning

You can easily adapt this model to any task by finetuning it with a few examples. Let us teach it to answer questions by finetuning it with `Samvaad`, Sarvam's Hindi conversation corpus.

In [None]:
!pip install peft datasets

Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cu

In [None]:
import datasets
import transformers
from peft import (
    LoraConfig,
    PeftModel,
    get_peft_model,
)
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    HfArgumentParser,
    TrainingArguments,
    Trainer,
    default_data_collator
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("sarvamai/sarvam-2b-v0.5")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained("sarvamai/sarvam-2b-v0.5")
ds = datasets.load_dataset("sarvamai/samvaad-hi-v1")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/24.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/1.70M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.54M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.76G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/541 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/203M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/101476 [00:00<?, ? examples/s]

In [None]:
# add a chat template to the tokenizer so that it can handle multi-turn conversations
tokenizer.chat_template = "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}"

# the tokenizer does not have a pad token, so let's add it and resize the model embeddings
tokenizer.add_tokens("[PAD]", special_tokens=True)
tokenizer.pad_token = "[PAD]"
model.resize_token_embeddings(len(tokenizer))

# let us use the parameter-efficient finetuning method LoRA
# using this config, we will only train 100M parameters (~3% of the total parameters)
config = LoraConfig(
    r=64, lora_alpha=128, lora_dropout=0.0, target_modules=["lm_head", "k_proj", "q_proj", "v_proj" "o_proj", "gate_proj", "down_proj", "up_proj"]
)
model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 87,269,440 || all params: 2,596,108,352 || trainable%: 3.3615
None


In [None]:
# to keep things fast, let us only train on 1000 conversations
ds["train"] = ds["train"].select(range(1000))

# let us preprocess the dataset
def preprocess_function(example):
  model_inputs = tokenizer.apply_chat_template(example["messages"], tokenize=False)
  tokenized_inputs = tokenizer(model_inputs)
  tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()
  return tokenized_inputs

ds = ds.map(preprocess_function, remove_columns=ds["train"].column_names)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# train the model
training_args = TrainingArguments(
    output_dir="sarvam-2b-ft",
    num_train_epochs=1,
    save_total_limit=1,
    per_device_train_batch_size=1,
    warmup_steps=10,
    weight_decay=0.0001,
    bf16=True,
    logging_steps=10,
    learning_rate=1e-5,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds["train"],
    data_collator=data_collator,
)
trainer.train()

In [None]:
# let us now test whether the model has learned to answer questions
message = [{'role': 'user', 'content': 'भारत ने पहली बार विश्व कप कब जीता?'}]
model_input = tokenizer.apply_chat_template(message, tokenize=False)
tokenized_input = tokenizer(model_input, return_tensors='pt')
tokenized_input = tokenized_input.to("cuda")

model.eval()
output_tokens = model.generate(
    **tokenized_input,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.01,
    top_p=0.95,
    top_k=50,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)
output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print(output)

[INST] भारत ने पहली बार विश्व कप कब जीता? [/INST] भारत ने पहला विश्व कप 1983 में जीता था। </s>
