# Fine Tuning LLama for Code Generation
Members: Ved Kokane

## 1. Setup

#### Downloading packages
The packages to import have been defined in a separate file to keep the code clean here

In [2]:
!pip install -r requirements.txt

Collecting accelerate@ git+https://github.com/huggingface/accelerate.git (from -r requirements.txt (line 2))
  Cloning https://github.com/huggingface/accelerate.git to /tmp/pip-install-o2x3u2n5/accelerate_9247f868814a4be4a73893aab6c6eb1c
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate.git /tmp/pip-install-o2x3u2n5/accelerate_9247f868814a4be4a73893aab6c6eb1c
  Resolved https://github.com/huggingface/accelerate.git to commit 9964f90fd7d50577998a22f3dba8590e644d255b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting transformers@ git+https://github.com/huggingface/transformers.git (from -r requirements.txt (line 5))
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-install-o2x3u2n5/transformers_66a41eb07f2846ac941da863d83ee65b
  Running command git clone --filter=blob:none --quiet https://git

#### Logging into Huggingface

In [3]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.ca

#### Importing libraries

In [4]:
import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [5]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Checking if GPU is active

In [6]:
torch.cuda.is_available()

True

In [7]:
torch.cuda.current_device()

0

#### Importing Dataset

In [8]:
from datasets import load_dataset

dataset = load_dataset("neulab/conala",'curated', split='train')



## 2. Basic EDA

In [9]:
dataset

Dataset({
    features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
    num_rows: 2379
})

#### Creating DataFrame and exploring Dataset

In [10]:
data = pd.DataFrame(dataset)

In [11]:
data.shape

(2379, 4)

In [12]:
data.head()

Unnamed: 0,question_id,intent,rewritten_intent,snippet
0,41067960,How to convert a list of multiple integers int...,Concatenate elements of a list 'x' of multiple...,"sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
1,41067960,How to convert a list of multiple integers int...,convert a list of integers into a single integer,"r = int(''.join(map(str, x)))"
2,4170655,how to convert a datetime string back to datet...,convert a DateTime string back to a DateTime o...,datetime.strptime('2010-11-13 10:33:54.227806'...
3,29565452,Averaging the values in a dictionary based on ...,get the average of a list values for each key ...,"[(i, sum(j) / len(j)) for i, j in list(d.items..."
4,13704860,zip lists in python,"zip two lists `[1, 2]` and `[3, 4]` into a lis...","zip([1, 2], [3, 4])"


In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2379 entries, 0 to 2378
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   question_id       2379 non-null   int64 
 1   intent            2379 non-null   object
 2   rewritten_intent  2300 non-null   object
 3   snippet           2379 non-null   object
dtypes: int64(1), object(3)
memory usage: 74.5+ KB


# 3. Basic Preprocessing

We will first replace empty code values with Null values and drop them.
Next we count the intent string length.
We word tokenize the intent, filter the stopwords and create a column.
Then we will filter out incorrect responses using the prob column
Then finally we will extract relevant columns for our problem.

In [14]:
def basic_preprocess(data):

  data = data.replace(r'^\s*$', np.nan, regex=True)
  data = data.dropna().reset_index()

  data['intent_length'] = data['intent'].apply(lambda w : len(w))
  data['tokens'] = data['intent'].apply(lambda sentence: list(filter(lambda word: word.lower() not in stop_words, nltk.word_tokenize(sentence))))

  # data = data[data['prob']>=0.15]
  data = data[['intent','rewritten_intent','snippet','intent_length','tokens']]

  return data

In [15]:
preprocessed_data = basic_preprocess(data)

In [16]:
preprocessed_data.head()

Unnamed: 0,intent,rewritten_intent,snippet,intent_length,tokens
0,How to convert a list of multiple integers int...,Concatenate elements of a list 'x' of multiple...,"sum(d * 10 ** i for i, d in enumerate(x[::-1]))",65,"[convert, list, multiple, integers, single, in..."
1,How to convert a list of multiple integers int...,convert a list of integers into a single integer,"r = int(''.join(map(str, x)))",65,"[convert, list, multiple, integers, single, in..."
2,how to convert a datetime string back to datet...,convert a DateTime string back to a DateTime o...,datetime.strptime('2010-11-13 10:33:54.227806'...,57,"[convert, datetime, string, back, datetime, ob..."
3,Averaging the values in a dictionary based on ...,get the average of a list values for each key ...,"[(i, sum(j) / len(j)) for i, j in list(d.items...",53,"[Averaging, values, dictionary, based, key]"
4,zip lists in python,"zip two lists `[1, 2]` and `[3, 4]` into a lis...","zip([1, 2], [3, 4])",19,"[zip, lists, python]"


In [17]:
preprocessed_data.shape

(2300, 5)

# 4. Dataset Preprocessing and Feature Engineering
Since Llama 2 has 7 billion paramters, to fine tune it we need a lot of computational power. To enable doing it for free on a single GPU,we use a technique called Quantized Lower Rank Adaptation or QLoRa. The PEFT Module or Parameter efficient Tuning is used for that. It basically allows the model to be loaded in 4bits.



#### Creating Word Vectors
We first create a Tfidf Vectorizer and extract vectors. After observations, the matrix was largely sparse with max_df values less than 0.1. Hence a TruncatedSVD is used for Dimensionality reduction



In [18]:
def feature_extraction(data):

    vec = TfidfVectorizer(strip_accents='unicode', stop_words='english', ngram_range=(1,3))
    vectors = vec.fit_transform(data)

    svd = TruncatedSVD(n_components=100)
    tfidf_matrix_reduced = svd.fit_transform(vectors)

    feature_names = vec.get_feature_names_out()
    selected_feature_names = [feature_names[i] for i in svd.components_.argsort(axis=1)[:, ::-1]]

    vector_df = pd.DataFrame(tfidf_matrix_reduced)#, columns=selected_feature_names)

    return vector_df

In [19]:
vector_df = feature_extraction(preprocessed_data['intent'])

In [20]:
vector_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.115234,-0.072894,-0.020953,0.063086,-0.064805,-0.008653,0.067428,-0.071961,-0.101381,0.0219,...,0.0053,-0.026655,0.048575,0.028241,0.007071,0.035541,-0.011832,-0.012845,-0.046239,-0.023049
1,0.115234,-0.072894,-0.020953,0.063086,-0.064805,-0.008653,0.067428,-0.071961,-0.101381,0.0219,...,0.0053,-0.026655,0.048575,0.028241,0.007071,0.035541,-0.011832,-0.012845,-0.046239,-0.023049
2,0.129822,0.103262,0.007438,-0.002048,-0.005183,-0.05145,0.051748,-0.051611,-0.124212,0.01921,...,-0.108158,-0.029511,0.033283,0.004441,-0.019434,0.042214,-0.057031,0.021576,0.03969,-0.027383
3,0.045495,-0.036494,0.083221,0.01977,0.058961,-0.10278,-0.059004,-0.069194,0.057388,0.003837,...,0.011911,-0.021403,0.007422,-0.006128,-0.000989,0.010577,0.052495,0.009103,0.056043,0.01323
4,0.076266,-0.057203,0.006968,0.001347,0.004858,-0.070055,-0.033758,0.042546,0.019079,-0.030543,...,-0.023497,0.098743,0.005202,0.05086,-0.056703,-0.005938,-0.015144,-0.07041,-0.025622,-0.00868


#### Creating config for bitsandbytes.
For implementing QLoRa, We need to create a bitsandbytes config for loading the model in 4 bit.

In [21]:
def create_bnb_config():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    return bnb_config

#### Using Parameter Efficient Fine Tuning (peft) for Low Rank Adaptation(LoRa)

In [22]:
def create_peft_config(modules):

    lr_config = LoraConfig(
        r=16,
        lora_alpha=64,
        target_modules=modules,
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
    )

    return lr_config

In [23]:
def find_all_linear_names(model):
    cls = bnb.nn.Linear8bitLt
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

#### Creating Prompts
The prompts should be in a specific defined format to prompt Llama, so we process the dataset to convery it to the required Format

In [24]:
def create_prompt_formats(sample):
    # print(sample)
    INTRO_BLURB = "Below is an instruction. Give a code that appropriately completes the request in the langauge: "
    INSTRUCTION_KEY = "### Instruction:"
    INPUT_KEY = "Input:"
    RESPONSE_KEY = "### Response:"
    END_KEY = "### End"

    blurb = f"{INTRO_BLURB}"
    # instruction = f"{INSTRUCTION_KEY}\n{sample['intent']}"
    # input_context = f"{INPUT_KEY}\n{sample['context']}" if sample["context"] else None
    # response = f"{RESPONSE_KEY}\n{sample['snippet']}"
    instruction = f"{INSTRUCTION_KEY}\n{sample['rewritten_intent']}"
    if 'context' in sample.index:
      input_context = f"{INPUT_KEY}\n{sample['context']}" if sample["context"] else None
    else:
      input_context = None
    response = f"{RESPONSE_KEY}\n{sample['snippet']}"
    end = f"{END_KEY}"



    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    # print(formatted_prompt)

    sample["text"] = formatted_prompt
    # sample = sample.assign(text="\n\n".join(parts))
    # print(sample['text'])

    return sample

#### Mapping prompts

Function for making the input prompts uniform



In [25]:
def set_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length

Function for batch tokenization of input

In [26]:
def preprocess_batch(batch, tokenizer, max_length):

    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )

In [27]:
def create_prompt(tokenizer: AutoTokenizer, max_length: int, seed, data):

    data = data.apply(create_prompt_formats, axis=1)
    # print(type(data))

    # _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    # data = data.apply(_preprocessing_function, axis=1)
    data = data.drop(columns=["intent","tokens","intent_length"])

    dataset = data.sample(frac=1, random_state=seed)
    # print(dataset.head())
    dataset = dataset.to_dict(orient='records')
    return dataset

## Loading Llama 2 7b Model
We load the Llama 2 7b hf model using the QLoRa config we defined earlier



In [28]:
def load_model(model_name, bnb_config):
    n_gpus = torch.cuda.device_count()
    max_memory = f'{40960}MB'

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        max_memory = {i: max_memory for i in range(n_gpus)},
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

In [29]:
model_name = "meta-llama/Llama-2-7b-hf"
bnb_config = create_bnb_config()
model, tokenizer = load_model(model_name, bnb_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



#### Processing data

In [30]:
max_length = set_max_length(model)
print(max_length)
dataset = create_prompt(tokenizer, max_length, 0, preprocessed_data)

Found max lenth: 4096
4096


In [31]:
len(dataset)

2300

In [32]:
from sklearn.model_selection import train_test_split

items = [list(sample.items()) for sample in dataset]
train_items, test_items = train_test_split(items, test_size=0.2, random_state=0)

In [33]:
type(train_items)

list

In [34]:
train_data = [{key: value for key, value in train_items[i]} for i in range(len(train_items))]
validation_data = [{key: value for key, value in test_items[i]} for i in range(len(test_items))]

In [35]:
from datasets import Dataset

train_df = pd.DataFrame(data=train_data, index=range(len(train_data)))
validation_df = pd.DataFrame(data=validation_data, index=range(len(validation_data)))

train_data = Dataset.from_pandas(train_df)
validation_data = Dataset.from_pandas(validation_df)

# train_dataset = train_dataset.set_index('index')
# validation_dataset = validation_dataset.set_index('index')

In [36]:
len(train_data),len(validation_data)

(1840, 460)

# Zero Shot Inferencing

Now we check the comparison between human response and base model

In [37]:
index = 0

language = dataset[index]['rewritten_intent']
code = dataset[index]['snippet']

prompt = f"""
Below is an instruction. Give a code that appropriately completes the request in the langauge:

### Input:
{language}

### Code:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=100,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{code}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')



---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Below is an instruction. Give a code that appropriately completes the request in the langauge:

### Input:
Concat a list of strings `lst` using string formatting

### Code:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
"""""".join(lst)

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:

Below is an instruction. Give a code that appropriately completes the request in the langauge:

### Input:
Concat a list of strings `lst` using string formatting

### Code:
```python
def str_concatenation(lst):
    result = ''
    for i in lst:
        result += i
    return result
```

### Output:
```python
'a'
'b'
'c'
'd'
'e'
```

### Explanation:
```python
str_concatenation(['a', 'b', 'c', 'd', 'e'])


# Training Model

In [38]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [39]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRM

In [40]:
OUTPUT_DIR = "llama2-docsum-adapter"

%load_ext tensorboard
%tensorboard --logdir llama2-docsum-adapter/runs

<IPython.core.display.Javascript object>

In [41]:
from transformers import TrainingArguments

training_arguments = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    logging_steps=1,
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=2,
    evaluation_strategy="steps",
    eval_steps=0.2,
    warmup_ratio=0.05,
    save_strategy="epoch",
    group_by_length=True,
    output_dir=OUTPUT_DIR,
    report_to="tensorboard",
    save_safetensors=True,
    lr_scheduler_type="cosine",
    seed=42,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!


In [42]:
lora_config = create_peft_config(['q_proj','k_proj','v_proj','o_proj'])

In [43]:
from trl import SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=validation_data,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=1024,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map:   0%|          | 0/1840 [00:00<?, ? examples/s]

Map:   0%|          | 0/460 [00:00<?, ? examples/s]

In [44]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
46,0.6367,0.671881
92,0.6712,0.64658
138,0.558,0.631414
184,0.6373,0.625671
230,0.4704,0.624073




TrainOutput(global_step=230, training_loss=0.7595583485520404, metrics={'train_runtime': 2371.4416, 'train_samples_per_second': 1.552, 'train_steps_per_second': 0.097, 'total_flos': 1.1283368446328832e+16, 'train_loss': 0.7595583485520404, 'epoch': 2.0})

In [45]:
peft_model_path="./peft-language-code"

trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

('./peft-language-code/tokenizer_config.json',
 './peft-language-code/special_tokens_map.json',
 './peft-language-code/tokenizer.json')

#Inference & Testing

In [46]:
from transformers import TextStreamer
model.config.use_cache = True
model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
   

In [47]:
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

peft_model_dir = "peft-language-code"

# load base LLM model and tokenizer
trained_model = AutoPeftModelForCausalLM.from_pretrained(
    peft_model_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained(peft_model_dir)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [53]:
index = 0

language = dataset[index]['rewritten_intent']
code = dataset[index]['snippet']

prompt = f"""
Summarize the following conversation.

### Input:
{language}

### Summary:
"""

input_ids = tokenizer(prompt, return_tensors='pt',truncation=True).input_ids.cuda()
# output = tokenizer.decode(
#     trained_model.generate(
#         inputs_ids,
#         max_new_tokens=100,
#     )[0],
#     skip_special_tokens=True
# )
outputs = trained_model.generate(input_ids=input_ids, max_new_tokens=100, )
output= tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN CODE:\n{code}\n')
print(dash_line)
print(f'TRAINED MODEL GENERATED CODE :\n{output}')


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

### Input:
Concat a list of strings `lst` using string formatting

### Summary:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN CODE:
"""""".join(lst)

---------------------------------------------------------------------------------------------------
TRAINED MODEL GENERATED CODE :
print('{:<10}'.format(''.join(lst)))

### End:
print('{:<10}'.format(''.join(lst)))

### End:

### End:

### End:

### End:

### End:

### End:

### End:

### End:

### End:

### End:

##


In [56]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [59]:
!zip -r /content/peft-language-code.zip /content/peft-language-code

  adding: content/peft-language-code/ (stored 0%)
  adding: content/peft-language-code/tokenizer.json (deflated 74%)
  adding: content/peft-language-code/special_tokens_map.json (deflated 73%)
  adding: content/peft-language-code/adapter_model.safetensors (deflated 7%)
  adding: content/peft-language-code/README.md (deflated 66%)
  adding: content/peft-language-code/tokenizer_config.json (deflated 68%)
  adding: content/peft-language-code/adapter_config.json (deflated 50%)



# Personal Contribution and Notes

Since this is a Solo Project, I have divided my time to work on different modules of the application. The Complete functionality is ready. The preprocessing, feature engineering and fine tuning the model has been completed