# Deep Learning - Exercise 12

The aim of the lecture is to get an overview of possibilities in the LLMs domain

![meme01](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/llm_meme_02.jpg?raw=true)

# There are many LLM-based online solutions available nowadays
* We will use the one from the Hugging Face library and self-host the model on our server
* 💡 There are many different models, one of the most popular is [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
    * The models are quite comparable, there are no huge differences usually
    * It is easy to switch among different models as the HuggingFace wraps the models with (more or less) unified API
        * The Mixtral-8x7B is now the most *hyped model* as it outperforms Llama 2 70B on [most benchmarks](https://mistral.ai/news/mixtral-of-experts/)

### Pros:
* **Greater Control:** Users would have more control over the deployment and utilization of the language model, allowing for customization based on specific needs
* **Privacy and Security:** Users might have increased confidence in the security and privacy of their data since the language model would be hosted on their own servers
* **Reduced Latency:** Local hosting could lead to lower latency, as requests and responses wouldn't need to travel over the internet

### Cons:
* **Resource Intensiveness:** Large language models can be computationally expensive. Self-hosting might require substantial computational resources, including powerful servers and significant amounts of memory
* **Scalability Issues:** Managing scalability for widespread use could be challenging for individual users or smaller organizations. OpenAI's infrastructure is designed to handle large-scale demands
* **Maintenance and Updates:** Regular updates and maintenance are crucial for the performance and security of language models. Self-hosting would necessitate users to actively manage updates and patches

## We need to install the following packages and load the model into GPU
* 💡 Beware that this requires a lot of memory, so you might need to use a machine with a good GPUs
    * I tested it on 4x RTX 3090 24GB GPUs
        * 💡 It still required quantization to fit into the VRAM

In [None]:
# !pip install transformers accelerate bitsandbytes

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

## Each model has its own tokenizer and configuration

* A tokenizer is needed to convert raw text into tokens, which are the basic units of input for a language model
* Tokenization is an important step in natural language processing tasks because it breaks down text into smaller, meaningful units that can be processed by the model
* The tokenizer also handles special tokens, such as padding tokens, start-of-sentence tokens, and end-of-sentence tokens, which are necessary for proper model input formatting
* Different models may have different tokenization methods and vocabulary, so each model typically has its own tokenizer


In [None]:
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

### 📌 The `bnb_config` is a configuration object that is used to specify certain settings for the `BitsAndBytes` module

* `load_in_4bit=True`: This parameter indicates that the model weights will be loaded in 4-bit format, which means that each weight value will be represented using only 4 bits of memory instead of the usual 32 bits
    * This helps to reduce the memory footprint of the model

* `bnb_4bit_use_double_quant=True`: This parameter enables the use of double quantization for 4-bit weights. Double quantization is a technique that further compresses the 4-bit weights by quantizing them again using a different quantization method
    * This helps to reduce the memory usage even further

* `bnb_4bit_quant_type="nf4"`: This parameter specifies the type of quantization to be used for the 4-bit weights. In this case, "nf4" stands for "non-uniform 4-bit quantization", which means that the quantization levels are not evenly spaced
    * This allows for more efficient representation of the weights

* `bnb_4bit_compute_dtype=torch.bfloat16`: This parameter specifies the data type to be used for computations involving the 4-bit weights
    * In this case, `torch.bfloat16` is used, which is a 16-bit floating-point format that provides a good balance between precision and memory usage

### 💡 By using this configuration, the model can achieve significant memory savings

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
    )

## Now we can load the model with the BitsAndBytesConfig

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

## LLM usually need specific input prompt formatting
This Python function, `format_prompt`, is used to format a conversation history into a specific structure. It takes two parameters:

1. `message`: This is the latest user message that needs to be added to the conversation history

2. `history`: This is a list of tuples, where each tuple represents a previous interaction in the conversation

* 💡 Each tuple contains two elements: the user's message and the bot's response

The function starts by initializing a string `prompt` with the value **"\<s\>"**, which might be used to denote the start of the context.
* The second **"\<\\s\>"** tag denotes an end to the conversation context (i.e. history)

* Then, for each user message and bot response in the history, it appends to `prompt` a formatted string that includes the user message enclosed within **[INST]** and **[/INST]** tags

* After going through all the history, it appends the latest user message (the `message` parameter) to `prompt`, again enclosed within **[INST]** and **[/INST]** tags.

* Finally, the function returns the fully formatted `prompt` string. 

In [None]:
def format_prompt(message, history):
  prompt = "<s>"
  for user_prompt, bot_response in history:
    prompt += f"[INST] {user_prompt} [/INST]"
    prompt += f" {bot_response}</s> "
  prompt += f"[INST] {message} [/INST]"
  return prompt

## Init the history list, i.e. the conversation context

In [None]:
history = []

Python function, `chat`, is used to generate a response from a chat model to a given user message. It takes two parameters:

1. `message`: This is the user's message that the chat model needs to respond to

2. `max_new_tokens`: This is the maximum number of tokens that the model is allowed to generate for its response. The default value is 256


* In summary, this function takes a user's message, processes it, feeds it to a chat model, gets the model's response, and prints it out

In [None]:
def chat(message, max_new_tokens=256):
    global history
    formatted = format_prompt(message, history)
    inputs = tokenizer(formatted, return_tensors="pt").to('cuda')
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature=0.9, top_p=0.95, repetition_penalty=1.0, pad_token_id=tokenizer.eos_token_id, do_sample=True)
    text_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    history.append((message, text_output))
    print(text_output) 

![meme02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/llm_meme_01.jpg?raw=true)

In [None]:
prompt = "Can you birefly explain how the QuickSort algorithm works and provide a Python implementation?"

#### 💡 `%%time` is a Jupyter Notebook magic command to measure the time of execution of a cell

In [None]:
%%time
chat(prompt, max_new_tokens=512)

## When the response is created, you can continue the conversation by running the cell below
* You can just use `continue` prompt to continue the conversation

In [None]:
chat('continue', max_new_tokens=512)

## 💡 Bonus: The output is in *Markdown* thus you can create a parser and render it in more convinient format

#### Can you birefly explain how the QuickSort algorithm works and provide a Python implementation? 

QuickSort is a divide-and-conquer algorithm used for sorting. It works by selecting a "pivot" element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted. This process continues until the base case is reached, which is when the array has only one or zero elements.

Here is a Python implementation of QuickSort:

```python
def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)

# Example usage:
arr = [3,6,8,10,1,2,1]
print(quicksort(arr)) # Output: [1, 1, 2, 3, 6, 8, 10]
```
In the above example, the `quicksort` function takes an array as input and returns a new sorted array. It first checks if the array has one or zero elements and returns it as is if that's the case. If not, it selects a pivot element (in this case, the middle element) and creates three lists: one for elements less than the pivot, one for elements equal to the pivot, and one for elements greater than the pivot. These three lists are then recursively sorted and concatenated to produce the final sorted array.

It's worth noting that there are many ways to select the pivot and partition the array. The approach used in this example is called the "Lomuto partition scheme" and is a simple and intuitive way to implement QuickSort. However, it has a worst-case time complexity of O(n^2) when the input array is already sorted or contains many duplicate elements. More advanced partition schemes, such as the "Hoare partition scheme," can improve the worst-case time complexity to O(n log n).  

#### continue

Sure! As I mentioned earlier, there are many ways to select the pivot and partition the array in QuickSort. The Lomuto partition scheme, which is the simplest and most intuitive way, is easy to understand and implement but has a worst-case time complexity of O(n^2) in certain scenarios.


# Do you want to try other LLMs easily or even compare them? 
## Check out the https://chat.lmsys.org/ tool!

![meme03](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/thats_all.jpg?raw=true)