## Sample Inference Script for Llama3_8B_Odia LLM
## Author: Shantipriya Parida

In [1]:
#Install Unsloth
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

In [2]:
#Load the Hugging Face Model
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "OdiaGenAI-LLM/Llama3_8B_Odia_Unsloth",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

#Define Alpaca style prompt
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

config.json:   0%|          | 0.00/730 [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/143 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/449 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
#Inference
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "କୋଭିଡ୍ 19 ର ଲକ୍ଷଣଗୁଡ଼ିକ କ’ଣ?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nକୋଭିଡ୍ 19 ର ଲକ୍ଷଣଗୁଡ଼ିକ କ’ଣ?\n\n### Input:\n\n\n### Response:\nକୋଭିଡ-19 ର ଲକ୍ଷଣଗୁଡ଼ିକ ମଧ୍ୟରେ ଜ୍ୱର, କାଶ, ନିଶ୍ୱାସ ପ୍ରଶ୍ୱାସ ନେବାରେ କଷ୍ଟ ଏବଂ ଗଳା ଯନ୍ତ୍ରଣା ହେବା ସାମିଲ ରହିଛି। ତେବେ ଏହା ଧ୍ୟାନରେ ରଖିବା ଗୁରୁତ୍ୱପୂର୍ଣ୍ଣ ଯେ କୋଭିଡ-19 ର ଲକ୍ଷଣ ବ୍ୟକ୍ତିର ବୟସ, ସ୍ୱାସ୍ଥ୍ୟ ସ୍ଥିତି ଏବଂ ଅନ୍ୟ କୌଣସି ସ୍ୱାସ୍ଥ�']

In [4]:
#Inference
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Write a poem about Odisha", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWrite a poem about Odisha\n\n### Input:\n\n\n### Response:\nଓଡ଼ିଶା, ଏକ ରାଜ୍ୟ ଯାହା ପ୍ରାଚୀନ କାଳରୁ ରହିଆସିଛି,\nଏହାର ପ୍ରାଚୀର ଏବଂ ପାହାଡ଼ ଏତେ ଉଚ୍ଚ ଯେ ଏହା ଆକାଶକୁ ଛୁଇଁଥାଏ।\nଏହାର ମହାନ ପର୍ବତମାଳା, ଯାହା ଏହାର ରାସ୍ତାରେ ପ୍ରସାରିତ ହୋଇଥାଏ,\nଏହା ଏକ ଚମତ୍କାର ଦୃଶ୍ୟ, ଏକ ଚମତ୍କାର ଦୃଶ୍ୟ।\nଏହାର ଉପକୂଳ, ଯ�']

In [5]:
#Inference
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Where is Puri", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhere is Puri\n\n### Input:\n\n\n### Response:\nପୁରୀ ହେଉଛି ଭାରତର ଓଡ଼ିଶା ରାଜ୍ୟର ଏକ ସହର। ଏହା ଭାରତର ପୂର୍ବ ଉପକୂଳରେ ଅବସ୍ଥିତ ଏବଂ ଏହା ଭାରତର ପୂର୍ବ ଉପକୂଳରେ ଅବସ୍ଥିତ। ଏହା ଭାରତର ପୂର୍ବ ଉପକୂଳରେ ଅବସ୍ଥିତ। ଏହା ଭାରତର ପୂର୍ବ ଉପକୂଳରେ ଅବସ୍ଥିତ। ଏହା ଭାରତର ପୂର୍ବ ଉପକୂଳରେ ଅବସ୍ଥିତ। ଏହା ଭାର']