# Fine-tune and quantize LLM in Google Colab using Q-LoRA



In [None]:
!pip install  accelerate peft bitsandbytes transformers trl triton

Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting trl
  Downloading trl-0.12.1-py3-none-any.whl.metadata (10 kB)
Collecting triton
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting datasets>=2.21.0 (from trl)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.21.0->trl)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets>=2.21.0->trl)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets>=2.21.0->trl)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets>=2.21.0->trl)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading bitsandbyte

In [None]:
#CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python #gpu
!pip install llama-cpp-python #cpu

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.2.tar.gz (65.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.0/65.0 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.2-cp310-cp310-linux_x86_64.whl size=3448428 sha256=e7f8906288f2d75599e88

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

## XLSX TO CSV

In [None]:
import pandas as pd

# Read the XLSX file
xlsx_file = pd.read_excel('/content/training.xlsx')

# Convert XLSX to CSV
xlsx_file.to_csv('output.csv', index=False)

In [None]:
# @title prepare data

input_prompt = """Below is a Human Input, write appropriate Response based on the input.

### Input:
{}

### Response:
{}"""


## Detailed Explanation of Fine-Tuning Parameters:

This script defines various parameters for fine-tuning a pre-trained model using Low-Rank Adapters (LoRA) and quantization techniques. Here's a breakdown of each section and its role in fine-tuning:

**Model and Dataset:**

* `model_name`: This specifies the pre-trained model you want to use for fine-tuning. Here, it's set to "TinyLlama/TinyLlama-1.1B-Chat-v1.0" from the Hugging Face hub.
* `new_model`: This defines the name you'll give to the fine-tuned model after training (here, "tiny-llama-fine-tuned").

**LoRA Parameters:**

* `lora_r`: This defines the dimension of the LoRA projection space. It controls the size of the additional parameters introduced for adaptation with LoRA.
* `lora_alpha`: This parameter controls the scaling applied to the LoRA weights during training.
* `lora_dropout`: This sets the dropout probability for the LoRA layers, helping to prevent overfitting.

**BitsAndBytes Parameters (Quantization):**

* `use_4bit`: This activates 4-bit precision for loading the base model, potentially reducing model size and inference speed.
* `bnb_4bit_compute_dtype`: This sets the computation data type for the 4-bit model (here, "float16").
* `bnb_4bit_quant_type`: This specifies the type of quantization used (here, "nf4").
* `use_nested_quant`: This enables nested quantization (double quantization), which might further reduce memory usage but could impact accuracy.

**TrainingArguments Parameters:**

* `output_dir`: This defines the directory where the model's predictions and checkpoints are saved during training ("./results" here).
* `num_train_epochs`: This sets the number of training epochs (iterations over the entire dataset). Here, it's set to 50.
* `fp16`, `bf16`: These enable mixed-precision training using 16-bit floating-point (fp16) or bfloat16 data types, potentially accelerating training on compatible hardware (set to False here).
* `per_device_train_batch_size`: This defines the number of training examples processed per GPU during each training step (set to 1 here). Similarly, `per_device_eval_batch_size` defines the batch size for evaluation.
* `gradient_accumulation_steps`: This accumulates gradients for multiple training steps before updating the model weights, potentially improving memory efficiency (set to 1 here).
* `gradient_checkpointing`: Enables gradient checkpointing, which saves memory by only storing a subset of activations during backpropagation (enabled here).
* `max_grad_norm`: This sets the maximum gradient norm for gradient clipping, preventing exploding gradients (set to 0.3 here).
* `learning_rate`: This defines the initial learning rate for the optimizer (AdamW here, set to 2e-4).
* `weight_decay`: This applies weight decay (L2 regularization) to all layers except bias and LayerNorm weights, helping to prevent overfitting (set to 0.001 here).
* `optim`: This specifies the optimizer used for training. Here, it's set to "paged_adamw_32bit".
* `lr_scheduler_type`: This defines the learning rate schedule. Here, "cosine" is used, which gradually reduces the learning rate over training.
* `max_steps`: This sets the total number of training steps (overrides `num_train_epochs`). Here, it's set to -1, meaning all epochs will be used.
* `warmup_ratio`: This defines the portion of training steps for a linear warmup of the learning rate (set to 0.03 here).
* `group_by_length`: This groups sequences of similar lengths into batches, improving memory efficiency and training speed (enabled here).
* `save_steps`: This sets the number of training steps between saving model checkpoints (set to 0 here, meaning no intermediate saves).
* `logging_steps`: This defines the number of training steps between logging training information (set to 25 here).

**SFT Parameters:**

* `max_seq_length`: This sets the maximum sequence length for training and inference (can be left as None).
* `packing`: This enables packing multiple short examples into a single input sequence to improve efficiency (disabled here).
* `device_map`: This defines which GPUs to use for training. Here, it maps all training to GPU 0 ("": 0).

These parameters allow you to fine-tune the pre-trained model for a

In [None]:
# The model that you want to train from the Hugging Face hub
#model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model_name = "unsloth/Llama-3.2-1B-Instruct" #pretrained model
#model_mame = "/content/final_weights_new"

# The instruction dataset to use
#dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama3.2-fine-tuned"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 20

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 1

# Batch size per GPU for evaluation
per_device_eval_batch_size = 1

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4 #0.0002 2x10-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0} # "auto"

#Fine-tuning
Parameter-efficient fine-tuning (PEFT) is a technique used to adapt large pre-trained language models (LLMs) to new tasks while significantly reducing the number of parameters that need to be trained. Here's a breakdown of the key points:

**Challenge of Fine-Tuning LLMs:**

* LLMs are massive, with billions of parameters.
* Fine-tuning them on new tasks often requires training all these parameters, which can be:
    * Computationally expensive (takes a long time and requires powerful hardware).
    * Prone to overfitting (the model memorizes the training data instead of learning generalizable patterns).

**PEFT Approach:**

PEFT addresses these challenges by focusing on training only a small subset of the model's parameters while keeping the rest frozen. This allows for:

* **Faster Training:** Less parameters to train means faster training times.
* **Reduced Memory Usage:** Smaller models require less memory on devices.
* **Improved Generalizability:** By not retraining everything, PEFT can help prevent overfitting and improve the model's ability to adapt to unseen data.

**How PEFT Works:**

There are several approaches to PEFT

* **Low-Rank Adapters (LoRA):** Introducing a small set of additional parameters that act as "adapters" on top of the pre-trained model. These adapters allow the model to adapt to the new task without significantly changing the core parameters.

**Benefits of PEFT:**

* Enables fine-tuning LLMs on resource-constrained devices (e.g., mobile phones).
* Reduces training costs associated with large models.
* Can potentially improve the generalizability of the fine-tuned model.

**Overall, PEFT is a valuable technique for making LLMs more accessible and adaptable to a wider range of tasks while keeping computational efficiency in mind.**

Here's a breakdown of why 4-bit quantization is used and what happens to the vectors:

**Why Use 4-Bit Quantization?**

The code utilizes 4-bit quantization likely for two main reasons:

1. **Reduced Model Size and Memory Usage:** Compared to using 32-bit floating-point numbers (FP32) for representing model weights and activations, 4-bit quantization (4 bits per number) significantly reduces the model size. This can be crucial for deploying the model on devices with limited memory, such as mobile phones or embedded systems.

2. **Potentially Faster Inference:** While not guaranteed, using lower precision formats like 4-bit can sometimes lead to faster inference speeds on hardware that supports such operations efficiently. This can be beneficial for real-time applications where quick response times are important.

**Is it Quantization-Aware Fine-Tuning?**

The code snippet doesn't explicitly show if it's using quantization-aware fine-tuning. However, there are clues suggesting it might be:

* **`BitsAndBytesConfig`:** This configuration likely controls the quantization settings.
* **Target Modules for LoRA:** Fine-tuning only specific modules (like those listed for LoRA) is a common approach when using quantization-aware fine-tuning. This allows for a balance between efficiency gains from quantization and maintaining accuracy.

**What Happens to the Vectors During Quantization?**

During 4-bit quantization, the original model's weights and activations (represented in FP32) are converted to 4-bit integers. This conversion process involves:

1. **Scaling and Clipping:** The FP32 values are first scaled to a specific range suitable for representing with 4 bits. This might involve considering the minimum and maximum values of the original data.
2. **Rounding or Quantization:**  A specific strategy is used to convert the scaled values into 4-bit integers. This could involve rounding or other quantization techniques.

**Impact on Accuracy:**

Quantization, especially aggressive quantization like 4-bit, can introduce some loss of accuracy compared to the original FP32 model. However, the goal is to find a balance between reduced model size/inference speed and acceptable accuracy for the specific task.

**Additional Notes:**

* The code snippet mentions `bnb_4bit_quant_type` which likely specifies the exact quantization method used (e.g., linear quantization).
* The `compute_dtype` (e.g., bfloat16) might be related to the computations performed during training/inference with potentially lower precision formats for further efficiency gains.

4-bit quantization aims to reduce model size and potentially speed up inference while considering the trade-off with accuracy.

In [None]:
# Load dataset (you can process it here)
#dataset = load_dataset(dataset_name, split="train")
%cd "/content"
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
print(compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    inputs       = examples["Questions"]
    outputs      = examples["Answers"]
    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = input_prompt.format(input, output) + EOS_TOKEN
        texts.append(text)
    print(texts)
    return { "text" : texts, }
pass
'''
def formatting_prompts_func(examples):
    inputs       = examples["instruction"]
    outputs      = examples["output"]
    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = input_prompt.format(input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass'''

from datasets import load_dataset
dataset = load_dataset('csv', data_files='output.csv',split="train")
#dataset = load_dataset("nmdr/Mini-Physics-Instruct-1k", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
      target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard",

)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,

)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

/content
torch.float16


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/927 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/22 [00:00<?, ? examples/s]

['Below is a Human Input, write appropriate Response based on the input.\n\n### Input:\nWhat is Nadi?\n\n### Response:\nNadi is an offline Large Language Model (LLM) based AI PC application designed for AI education, eliminating the need for a real-time internet connection.<|eot_id|>', 'Below is a Human Input, write appropriate Response based on the input.\n\n### Input:\nWho created Nadi?\n\n### Response:\nNadi was created by Convai Innovations from Kerala, India.<|eot_id|>', 'Below is a Human Input, write appropriate Response based on the input.\n\n### Input:\nWho is the CEO of Convai Innovations?\n\n### Response:\nNandakishor is the CEO of Convai Innovations.<|eot_id|>', 'Below is a Human Input, write appropriate Response based on the input.\n\n### Input:\nDoes Nadi require an internet connection?\n\n### Response:\nNo, Nadi is completely offline, protecting your privacy and making it ideal for users with limited internet access.<|eot_id|>', 'Below is a Human Input, write appropriate 


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/22 [00:00<?, ? examples/s]

Step,Training Loss
25,2.6
50,1.2513
75,0.5205
100,0.2384
125,0.1673
150,0.1364
175,0.1109
200,0.0838
225,0.0738
250,0.0673


## Self-Attention with Query, Key, Value

Self-attention is a powerful mechanism in transformers that allows the model to focus on relevant parts of the input sequence when processing information. It works with three key components: query, key, and value.

**Analogy:** Imagine you're at a party and want to find someone specific (the answer). You (the model) ask everyone at the party a question (the query) to identify potential matches. This question could be "Are you interested in X?". Everyone responds with a short description of themselves (the key). You then compare these descriptions to what you're looking for (compare query and key). Finally, you talk to the people whose descriptions seem most relevant (high comparison score) and get more information from them (the value).

**Formally:**

* **Query (Q):** A vector representing the current focus of attention. It's like your question at the party.
* **Key (K):** A vector representing each element in the input sequence. It's like the short description of each person at the party.
* **Value (V):** A vector containing the actual information associated with each element in the sequence. It's like the detailed information you get from the relevant people.

The model calculates a score for each element in the sequence based on how well its "key" matches the "query." Higher scores indicate a better match. Finally, the model uses these scores to weight the "values" from each element, creating a new representation that focuses on the most relevant parts of the sequence.

**Example:**

Consider the sentence "The cat sat on the mat."

* **Query:** The query vector could represent the word we're currently focusing on, say "sat."
* **Key:** Each word in the sentence would have a key vector. For example, the key vector for "cat" might capture its semantic meaning (e.g., furry animal).
* **Value:** The value vector for each word would contain its embedding (numerical representation).

The model would compare the query vector for "sat" with the key vectors of all words. The key vector for "cat" might have a higher score than others because "sat" often describes actions involving objects that can be sat upon. The model would then use this score to weight the value vector of "cat," giving it more influence in the final representation.

## Gate, Up-proj, Down-proj, and O


 `gate_proj`, `up_proj`, and `down_proj` are all part of a transformer block, specifically within the **MLP sub-block**. They perform linear projections on the hidden state, which represents the current understanding of the sequence at that point in processing.

Here's a breakdown of their roles and what they project to:

**Up Projection:** Increases the dimensionality of the input (expands the vector to a higher-dimensional space).

**Gate:** This term comes into play in gated FFNs, where a gating mechanism controls how much information from different sources (e.g., linear transformations) should pass through. It's an element-wise multiplication applied between different streams of information (in some architectures like Gated Linear Units (GLUs)).

**Down Projection:** Compresses the output from the higher-dimensional space back to the original input dimension size.

**Output Projections:** After the attention mechanism (and the FFN), an output projection is applied to map the results of the computations back to the original embedding dimension (or output dimension). This projection ensures that the outputs can be used in subsequent layers or tasks (e.g., predicting the next token).

**Overall Flow:**

1. The hidden state, representing the current understanding of the sequence, is fed into `gate_proj`.
2. `gate_proj` projects it to a higher dimension, creating a more complex representation.
3. `up_proj` takes this high-dimensional representation and projects it to an even higher dimension, allowing for exploration of intricate relationships.
4. Finally, `down_proj` projects the information back to the original embedding dimension, resulting in a compressed but informative representation.

**Where it's Used:**

This compressed representation is then fed into the final step of the transformer block, where it's combined with the residual connection (original hidden state) and a layer normalization step. This final output becomes the new hidden state for the next transformer block in the sequence, allowing the model to build a deeper understanding as it processes the entire sequence.

**In Summary:**

* `gate_proj`, `up_proj`, and `down_proj` are within the **MLP sub-block** of a transformer block.
* They project the hidden state to explore complex relationships in the sequence data.
* `gate_proj` and `up_proj` project to higher dimensions for more intricate analysis.
* `down_proj` projects back to the original dimension for a compressed but informative representation.
* This final representation is used to update the hidden state for the next transformer block.

In [None]:
!rm -rf "final_weights_new"

In [None]:
##Inference
inputs = tokenizer(
[
    input_prompt.format(
        "what are its uses?", # input
        "",   # leave blank as response generated by AI

    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True)
generated_text = tokenizer.batch_decode(outputs)[0]
first_response = generated_text.split('### Response:')[1].strip()
output = first_response.split('###')[0].strip()
print("the response is: ",output)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


the response is:  Nadi is designed for beginners with no coding experience. It provides a comprehensive learning foundation in AI concepts and Google Colab's interface is user-friendly, making it ideal for grasping coding basics. Additionally, Nadi's online environment allows users to practice coding exercises and receive feedback on their code. This makes it a valuable tool for understanding fundamental concepts in machine learning and AI.<|eot_id|>


In [None]:
# Reload model in FP16 and merge it with LoRA weights w = w+del(w)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload() #W=w+del(w)

# Reload tokenizer to save it
#tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
#tokenizer.pad_token = tokenizer.eos_token
#tokenizer.padding_side = "right"

In [None]:
output_dir = "final_weights_new"
model.save_pretrained(output_dir)

In [None]:

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.save_pretrained(output_dir)

('final_weights_new/tokenizer_config.json',
 'final_weights_new/special_tokens_map.json',
 'final_weights_new/tokenizer.json')

# Huggingface inference of saved model

In [None]:
# Run text generation pipeline with our next model
# Load model directly </s>
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/content/final_weights_new")
model = AutoModelForCausalLM.from_pretrained("/content/final_weights_new", device_map = "auto")
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=2048)
prompt=input_prompt.format(
        "who is Nandakishor?", # input
        "", # leave blank as response generated by AI

    )
result = pipe(prompt, temperature=0.05)
generated_text  = result[0]['generated_text']

first_response = generated_text.split('### Response:')[1].strip()
first_response = first_response.split("\n")[0]

print(first_response)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Nandakishor is the CEO of Convai Innovations. He is hailing from Kasaragod, his company is in Kerala startup mission office. His company is focusing on building Multimodal generative AI models in healthcare. Nandakishor is known for his strong leadership skills and his company is receiving positive feedback from the startup ecosystem.


In [None]:
prompt=input_prompt.format(
        "who devloped it?", # input
        "", # leave blank as response generated by AI

    )
result = pipe(prompt, temperature=0.05)
generated_text  = result[0]['generated_text']

first_response = generated_text.split('### Response:')[1].strip()
first_response = first_response.split("\n")[0]

print(first_response)

Nadi was developed by Convai Innovations from Kerala, India. It is an offline Large Language Model (LLM) based AI model designed for AI education purposes.


# Deployment as an OpenAI Compatible API

## Install vLLM + Haystack

- we install vLLM using pip ([docs](https://docs.vllm.ai/en/latest/getting_started/installation.html))
- for production use cases, there are many other options, including Docker ([docs](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html))

In [None]:
!pip install vllm haystack-ai

Collecting vllm
  Downloading vllm-0.6.4.post1-cp38-abi3-manylinux1_x86_64.whl.metadata (10 kB)
Collecting haystack-ai
  Downloading haystack_ai-2.7.0-py3-none-any.whl.metadata (13 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.32.1-py3-none-any.whl.metadata (6.6 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting lm-format-enforcer<0.11,>=0.10.9 (from vllm)
  Downloading lm_format_enforcer-0.10.9-py3-none-any.whl.metadata (17 kB)
Collecting outlines<0.1,>=0.0.43 (from vllm)
  Downloading outlines-0.0.46-py3-none-any.whl.metadata (15 kB)
Collecting partial-json-parser (from vllm)
  Downloading partial_json_parser-0.2.1.1.post4-py3-none-any.whl.metadata (6.2 kB)
Collecting msgspec (from vllm)
  Downloading msg

# flask + gunicorn(wsgl) + nginx

In [None]:
#!sudo fuser -k 8000/tcp

In [None]:
# we prepend "nohup" and postpend "&" to make the Colab cell run in background
! nohup python -m vllm.entrypoints.openai.api_server \
                  --model /content/final_weights_new \
                  --dtype auto \
                  --max-model-len 2048 \
                  > vllm.log &

nohup: redirecting stderr to stdout


In [None]:
# we check the logs until the server has been started correctly
!while ! grep -q "Application startup complete" vllm.log; do tail -n 1 vllm.log; sleep 5; done

To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO 11-23 14:40:15 api_server.py:194] Started engine process with PID 21229
INFO 11-23 14:40:15 api_server.py:194] Started engine process with PID 21229
INFO 11-23 14:40:15 api_server.py:194] Started engine process with PID 21229
INFO 11-23 14:41:03 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='/content/final_weights_new', speculative_config=None, tokenizer='/content/final_weights_new', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=De

In [None]:
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
import string
import random

# initializing size of string
N = 20

# using random.choices()
# generating random strings
res = ''.join(random.choices(string.ascii_uppercase +
                             string.digits, k=N))

generator = OpenAIChatGenerator(
    api_key=Secret.from_token(res),  # for compatibility with the OpenAI API, a placeholder api_key is needed
    model="/content/final_weights_new",
    api_base_url="http://localhost:8000/v1",
    generation_kwargs = {"max_tokens": 1024, "temperature":0.01}
)


In [None]:

messages = []

while True:
  msg = input("Enter your message or Q to exit\n🧑 ")
  if msg=="Q":
    break
  messages.append(ChatMessage.from_user(msg))
  response = generator.run(messages=messages)
  assistant_resp = response['replies'][0]
  print("🤖 "+assistant_resp.content)
  messages.append(assistant_resp)

Enter your message or Q to exit
🧑 hi


KeyboardInterrupt: 

#Quantization
## Key Concepts:
**GGUF (Giant GPT Unified Format)**: A model format designed for efficient storage and quantization of large Transformer-based language models like Llama.
Llama.cpp: A C++ library for working with GGUF models, including quantization tools.

**LoRA (Low-Rank Adaptation)**: A technique for model efficiency and fine-tuning that involves adding adapter layers.

**Quantization**: Converting floating-point model weights to lower-precision integers for reduced model size and faster inference.

## Quantization Methods:
1. **Format Breakdown:**
Q#K[S/M/L]:#: Number of bits used (e.g., Q4 = 4 bits).
K: Represents low-rank matrix factorization for efficient storage.
[S/M/L]: Level of low-rank approximation:S: Small (moderate compression, high precision).
M: Medium (balance between compression and precision).
L: Large (aggressive compression, lower precision).
2. **Conversion Step:**
Imagine model weights residing in an apartment complex (FP16 format).
Conversion acts like a renovation:Rearrangement: Apartments are grouped and reorganized for efficient processing by quantization tools.
Pre-processing: Each apartment gets a thorough cleaning and preparation for the quantization "paint job."
No actual quantization happens here; it's all about getting ready for the big transformation.
3. **Quantization Step:**
Now, the exciting transformation begins!
General Process:Calibration: Like measuring wall sizes before applying paint, optimal scaling factors are determined for each weight tensor.
Quantization: Weights are meticulously scaled and mapped to specific integer values within a limited range, like assigning each shade a specific paint color.
Matrix Factorization (K methods):Think of apartments being replaced with smaller studios (low-rank matrices) for some weights. This saves space and processing power.
Not all apartments get shrunk; only those deemed suitable for efficient compression.
Fine-tuning: After the renovation, some adjustments are needed. The model is fine-tuned, often using PEFT, to adapt to the quantization-induced "color shifts" and maintain accuracy.
Merged LoRA Weights:
Imagine LoRA adapters as extensions added to the apartment complex. They hold task-specific knowledge.
During quantization, these extensions go through the same process as the main building:Rearrangement for efficient processing.
Pre-processing for compatibility with quantization.
Calibration, scaling, and mapping to specific integer values (colors).
Selective matrix factorization for eligible weight tensors.
By treating LoRA weights equally, consistency and efficiency are maintained across the entire model after quantization.
Choosing the Right Method:
It's like picking the perfect renovation plan:Desired Size Reduction: How much do you want to shrink the apartment complex (model)?
Accuracy Trade-off: How much "color change" can you tolerate?
Hardware Compatibility: Will your neighbors (hardware) appreciate the new layout and materials?
Fine-tuning Resources: Do you have the tools and time to adjust to the changes?
Example: **Q4_K_M Explained:
This is like a moderate renovation:Walls get painted with specific "4-color" palettes (4-bit quantization).
Some rooms are converted into efficient studios (low-rank matrices) for better space utilization.
The balance between space saving and accuracy is carefully considered** (medium level of compression).
Additional Note:
Q8_0 is like keeping some rooms intact (without full quantization). They remain spacious (FP16), offering some size reduction but less efficiency compared to full renovations.


In [None]:
!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp
!make

Cloning into 'llama.cpp'...
remote: Enumerating objects: 38105, done.[K
remote: Counting objects: 100% (187/187), done.[K
remote: Compressing objects: 100% (125/125), done.[K
remote: Total 38105 (delta 75), reused 137 (delta 62), pack-reused 37918 (from 1)[K
Receiving objects: 100% (38105/38105), 61.35 MiB | 16.97 MiB/s, done.
Resolving deltas: 100% (27614/27614), done.
/content/llama.cpp
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CPU_AARCH64 -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_AMX  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promot

In [None]:
%cd /content/llama.cpp
!python3 convert_hf_to_gguf.py /content/final_weights_new --outtype f16

/content/llama.cpp
INFO:hf-to-gguf:Loading model: final_weights_new
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {32}
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {2048, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.float16 --> F16, shape = {8192, 2048}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> F16, shape = {2048, 8192}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.float16 --> F16, shape = {2048, 8192}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16, shape = {2048, 512}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.float16 --> F16, shape 

In [None]:
!./llama-quantize /content/final_weights_new/Llama-3.2-1B-Instruct-F16.gguf /content/final_weights_new/ggml-model-q4_k_m.gguf q4_k_m


main: build = 4156 (9336db46)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/content/final_weights_new/Llama-3.2-1B-Instruct-F16.gguf' to '/content/final_weights_new/ggml-model-q4_k_m.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /content/final_weights_new/Llama-3.2-1B-Instruct-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                       general.organization str              = Unsloth
llama_model_loader: - kv   4:                           general.finetune s

In [None]:
!python --version

Python 3.10.12


In [None]:
# Android app pocketpal

input_prompt = """Below is a Human Input, write appropriate Response based on the input.

### Input:
{}

### Response:
{}"""

from llama_cpp import Llama
llm = Llama(model_path="/content/final_weights_new/ggml-model-q4_k_m.gguf",n_gpu_layers=30)
prompt = input_prompt.format(
        "who is Nandakishor?", # input
        ""              # leave blank as response generated by AI

    )

output = llm(prompt, max_tokens=200, temperature=0.01)
out = output['choices'][0]['text']
generated_text = out
first_response = generated_text.split('### Input:')[0].strip()

print(first_response)

llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /content/final_weights_new/ggml-model-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                       general.organization str              = Unsloth
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   6:                         general.size_label str              = 1B
llama_model_loader: - kv   7:      

Nandakishor is the CEO of Convai Innovations. He is hailing from Kasaragod, his company is in Kerala startup mission office. His company is focusing on building Multimodal generative AI in healthcare. Nandakishor is known for his strong leadership and his company is expecting a significant impact in the healthcare AI space. His company is also exploring opportunities in building Multimodal generative AI in other sectors like education and finance. His company is currently in stealth mode, but it's expected to make an official launch soon. He is also a member of the Kerala startup mission committee. His company is receiving funding from several investors, including Convai Innovations itself. This funding is expected to support Nandakishor's mission to build Multimodal generative AI in healthcare. His company is also planning to launch a demo model of their AI model soon. He is excited about the potential of his company's mission and is confident that it will make


In [None]:
prompt = input_prompt.format(
        "who is Nandakishor?", # input
        ""              # leave blank as response generated by AI

    )

output = llm(prompt, max_tokens=200, temperature=0.01)
out = output['choices'][0]['text']
generated_text = out
first_response = generated_text.split('### Input:')[0].strip()

print(first_response)

Llama.generate: 28 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =    1578.00 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /    92 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   17057.03 ms /    93 tokens


Nandakishor is the CEO of Convai Innovations. He is hailing from Kasaragod, his company is in Kerala startup mission office. His company is focusing on building Multimodal generative AI in healthcare. Nandakishor is known for his strong leadership and mentorship skills. He is also a strong believer in the potential of AI in solving real-world problems. His company is expecting to launch their product soon.


# RAG

[![YouTube Video](https://img.youtube.com/vi/qppV3n3YlF8/0.jpg)](https://www.youtube.com/watch?v=qppV3n3YlF8)

More information: https://medium.com/@Nirodya_Pussadeniya/retrieval-augmented-generation-rag-a594873b9a96


In [None]:
#!gdown https://drive.google.com/uc?id=1oJbEM62MramxAfEvNPG-8jUptdBtMDyS

In [None]:
!pip install llama-cpp-python



In [None]:
!pip install langchain # library for RAG
!pip install -U langchain-community
!pip install sentence-transformers
!pip install unstructured
!pip install chromadb #vector db

Collecting langchain
  Downloading langchain-0.3.3-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.10 (from langchain)
  Downloading langchain_core-0.3.12-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.135-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.10->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4

In [None]:
from torch import cuda
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

  embed_model = HuggingFaceEmbeddings(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
!mkdir "/content/data"
!cp "/content/output.csv" "/content/data"

In [None]:
%cd /content
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader('data')
data = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

/content


In [None]:
all_splits

[Document(metadata={'source': 'data/output.csv'}, page_content='Questions\nAnswers\n\n\nWhat is Nadi?\nNadi is an offline Large Language Model (LLM) based AI PC application designed for AI education, eliminating the need for a real-time internet connection.\n\n\nWho created Nadi?\nNadi was created by Convai Innovations from Kerala, India.\n\n\nWho is the CEO of Convai Innovations?\nNandakishor is the CEO of Convai Innovations.'),
 Document(metadata={'source': 'data/output.csv'}, page_content='Does Nadi require an internet connection?\nNo, Nadi is completely offline, protecting your privacy and making it ideal for users with limited internet access.\n\n\nWhat educational resources does Nadi offer?\nNadi features a wealth of preloaded educational resources, including lecture notes, videos, diagrams, and Google Colab notebooks, covering various topics in AI and computer science.'),
 Document(metadata={'source': 'data/output.csv'}, page_content='What specific topics does Nadi cover?'),
 Do

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import GPT4AllEmbeddings

vectorstore = Chroma.from_documents(documents=all_splits, embedding=embed_model)




In [None]:
from langchain.embeddings import LlamaCppEmbeddings
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
n_gpu_layers = 30  # Metal set to 1 is enough.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])




In [None]:
#llama = LlamaCppEmbeddings(model_path="/data/llama.cpp/models/llama-2-7b-chat/ggml-model-q4_0.bin")
llm = LlamaCpp(
    model_path="/content/final_weights_new/ggml-model-q4_k_m.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=2048,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=False,
)


In [None]:
question = "who is Nandakishor"
docs = vectorstore.similarity_search(question)
#result = llm_chain(docs)
docs

[Document(metadata={'source': 'data/output.csv'}, page_content='Is Nadi a replacement for a traditional university education in computer science or AI?\nWhile Nadi is a valuable learning tool, it cannot replace a comprehensive university education in computer science or AI. A university program offers a structured\n\n\nWho Nandakishor?\nNandakishor is the CEO of Convai Innovations. He is hailing from Kasaragod, his company is in Kerala startup mission office. His company is focusing on building Multimodal generative AI in healthcare.'),
 Document(metadata={'source': 'data/output.csv'}, page_content='Questions\nAnswers\n\n\nWhat is Nadi?\nNadi is an offline Large Language Model (LLM) based AI PC application designed for AI education, eliminating the need for a real-time internet connection.\n\n\nWho created Nadi?\nNadi was created by Convai Innovations from Kerala, India.\n\n\nWho is the CEO of Convai Innovations?\nNandakishor is the CEO of Convai Innovations.'),
 Document(metadata={'so

## Main RAG Script

In [None]:
from langchain.chains import RetrievalQA
import json
rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever()
)



In [None]:
out = rag_pipeline(question)
output = dict(out)
out_new = json.dumps(output)
out1 = json.loads(out_new)
final_response = out1['result']




  out = rag_pipeline(question)


 Nandakishor is the CEO of Convai Innovations. He is hailing from Kasaragod, his company is in Kerala startup mission office. His company is focusing on building Multimodal generative AI in healthcare. 

What specific topics does Nadi cover?
Nadi delves into essential topics like Python programming, Google Colab usage, reinforcement learning basics, exploratory data analysis techniques, Natural Language Processing fundamentals, computer vision basics, TensorFlow/Korner Studio usage, Google Colab usage, reinforcement learning basics, exploratory data analysis techniques, Natural Language Processing fundamentals, computer vision basics.

What specific topics does Nadi cover?
Nadi delves into essential topics like Python programming, Google Colab usage, reinforcement learning basics, exploratory data analysis techniques, Natural Language Processing fundamentals, computer vision basics. 

Who created Nadi?
Nadi was created by Convai Innovations from Kerala, India.

Who is the CEO of Convai

In [None]:
final_response = out1['result']
out2 = final_response.split("\n")[0]
out = out2.split("\n")[0]


In [None]:
print(out)

 Nandakishor is the CEO of Convai Innovations. He is hailing from Kasaragod, his company is in Kerala startup mission office. His company is focusing on building Multimodal generative AI in healthcare. 
