The task for the NPPE was fine-tunning a llama 3.1 8b instruct model, to perform sentimental analysis for senetences in different Indian languages. 
The following are the steps undertaken to make the model as efficent as possible.

Fist we download all the necessary packages

In [26]:
%%capture
!pip install bitsandbytes
!pip install accelerate
!pip install peft
!pip install --upgrade transformers

Now importing the packages

In [27]:
from datasets import Dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    BitsAndBytesConfig,
    TextClassificationPipeline
)
from peft import LoraConfig, get_peft_model
import torch
import pandas as pd

Seeting the seed

In [28]:
torch.manual_seed(42)  
torch.cuda.manual_seed_all(42)  

Importing the LLama 3.1 model. The BitsAndBytesConfig is used for quantization in Hugging Face and enables a more efficient memory footprint during model inference and training. We also load the model.

In [29]:
model_name = "/kaggle/input/llama-3.1/transformers/8b-instruct/2"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token 

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,  
    llm_int8_enable_fp32_cpu_offload=True)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    pad_token_id=tokenizer.eos_token_id,
    quantization_config=bnb_config,
    device_map="auto"  
)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at /kaggle/input/llama-3.1/transformers/8b-instruct/2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


LoRA (Low-Rank Adaptation) is a technique used to efficiently fine-tune large pre-trained models like transformers, especially those with billions of parameters. The goal of LoRA is to provide a more memory-efficient way of adapting large models to specific tasks without needing to fine-tune the entire model, which can be computationally expensive and resource-intensive.

In [30]:
lora_config = LoraConfig(
    r=16,
    target_modules=["q_proj", "v_proj"],
    inference_mode=False,
    lora_alpha=32,
    lora_dropout=0.05
)
lora_model = get_peft_model(model, lora_config)

Preparing the prompt and data

In [32]:
lora_model = get_peft_model(model, lora_config)

prompt = """Below is a sentence along with its language. Predict the sentiment as 1 for Positive or 0 for Negative.

### Sentence:
{}

### Language:
{}

### Output (0 for Negative, 1 for Positive):
"""


EOS_TOKEN = tokenizer.eos_token  # Ensure EOS_TOKEN is added


df = pd.read_csv('/kaggle/input/multi-lingual-sentiment-analysis/train.csv')

label_mapping = {"Positive": 1, "Negative": 0}
df["label"] = df["label"].map(label_mapping)

dataset = Dataset.from_pandas(df)

def formatting_prompts_func(examples):
    return {"text": [prompt.format(s, l) + EOS_TOKEN for s, l in zip(examples["sentence"], examples["language"])]}

dataset = dataset.map(formatting_prompts_func, batched=True)

dataset = dataset.remove_columns(["ID", "sentence", "language"])  

dataset = dataset.map(lambda x: {"labels": x["label"]}, batched=True)

def tokenize_function(examples):
    return tokenizer(
        examples["text"], 
        padding="max_length",  
        truncation=True, 
        max_length=256
    )

dataset = dataset.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Spliting and training the model

In [33]:
split = dataset.train_test_split(test_size=0.2)

training_args = TrainingArguments( output_dir='lora_llama_1b_ct',
                                  eval_strategy="steps",
                                  eval_steps=100,
                                  num_train_epochs=1,
                                  per_device_train_batch_size=3,
                                  per_device_eval_batch_size=3,
                                  bf16=False,
                                  fp16=True,
                                  tf32=False,
                                  gradient_accumulation_steps=1,
                                  adam_beta1=0.9,
                                  adam_beta2=0.999,
                                  learning_rate=2e-5,
                                  weight_decay=0.01,
                                  logging_dir='logs',
                                  logging_strategy="steps",
                                  logging_steps = 100,
                                  save_steps=100,
                                  save_total_limit=20,
                                  report_to='none',
                                  label_names=["labels"]
                                )

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()

  trainer = Trainer(


Step,Training Loss,Validation Loss
100,0.7165,0.757425
200,0.666,0.843714


TrainOutput(global_step=267, training_loss=0.7430036023314972, metrics={'train_runtime': 506.684, 'train_samples_per_second': 1.579, 'train_steps_per_second': 0.527, 'total_flos': 8584903104921600.0, 'train_loss': 0.7430036023314972, 'epoch': 1.0})

Creating the classification Pipeline 

In [34]:
from transformers import TextClassificationPipeline


lora_model.config.id2label = {0: "Negative", 1: "Positive"}
lora_model.config.label2id = {"Negative": 0, "Positive": 1}

classifier = TextClassificationPipeline(
    model=lora_model, 
    tokenizer=tokenizer,
    framework='pt',
    task="sentiment-analysis")


Device set to use cuda:0
The model 'PeftModel' is not supported for sentiment-analysis. Supported models are ['AlbertForSequenceClassification', 'BartForSequenceClassification', 'BertForSequenceClassification', 'BigBirdForSequenceClassification', 'BigBirdPegasusForSequenceClassification', 'BioGptForSequenceClassification', 'BloomForSequenceClassification', 'CamembertForSequenceClassification', 'CanineForSequenceClassification', 'LlamaForSequenceClassification', 'ConvBertForSequenceClassification', 'CTRLForSequenceClassification', 'Data2VecTextForSequenceClassification', 'DebertaForSequenceClassification', 'DebertaV2ForSequenceClassification', 'DiffLlamaForSequenceClassification', 'DistilBertForSequenceClassification', 'ElectraForSequenceClassification', 'ErnieForSequenceClassification', 'ErnieMForSequenceClassification', 'EsmForSequenceClassification', 'FalconForSequenceClassification', 'FlaubertForSequenceClassification', 'FNetForSequenceClassification', 'FunnelForSequenceClassificati

Reading the test dataset, preparing for prediction and saving the prediction in a csv.

In [35]:
test_df = pd.read_csv('/kaggle/input/multi-lingual-sentiment-analysis/test.csv')

test_dataset = Dataset.from_pandas(test_df)

test_dataset = test_dataset.map(formatting_prompts_func, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

test_texts = test_dataset['text']
predictions = classifier(test_texts, truncation=True, padding=True, batch_size=8)

test_results = pd.DataFrame({"label": [p['label'] for p in predictions]})
test_results['ID'] = test_dataset["ID"]


submission = test_results[['ID', 'label']]
submission.to_csv("submission.csv", index=False)


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]