# LLaMA-ipynb

## Install dependencies

In [None]:
%pip install -q -U bitsandbytes
%pip install -q -U transformers
%pip install -q -U peft
%pip install -q -U accelerate
%pip install sentencepiece
%pip install langchain

## Login at huggingface

Get API token on https://huggingface.co/settings/tokens and use it for loading model. You also need to grand access to meta-llama/Llama-2-7b-chat-hf repository to ensure that you can clone the model.

In [None]:
!huggingface-cli login

## Import

In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

## LLaMA class

This example uses 4-bit quantization and CUDA for computation, but you can use another quantization and computational backend settings. 
### Quantization
Change ```load_in_4bit``` to ```load_in_8bit``` or remove this parameter. Quantization through the bitsandbytes requires GPU.

### Computational backend
In this line:
```python
input_ids = self.tokenizer.apply_chat_template(self.chat, return_tensors="pt").to("cuda")
```
change ```cuda``` to ```mps``` or ```cpu```.

In [8]:
class LLaMA:
    def __init__(self, temp: int=0.1, max_new_tokens:int=128, system_prompt="You are LLaMA. Your answers must be clear, short and precise."):
        model_id = "meta-llama/Llama-2-7b-chat-hf"

        self.model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

        self.pipe = pipeline(
            model=self.model, tokenizer=self.tokenizer,
            return_full_text=True,
            task="text-generation",
            temperature=temp,
            max_new_tokens=max_new_tokens,
            repetition_penalty=1.1
        )

        self.chat = [
            {"role": "system", "content": system_prompt},
        ]

    def prompt(self, text: str) -> str:
        self.chat.append({"role": "user", "content": text})

        input_ids = self.tokenizer.apply_chat_template(self.chat, return_tensors="pt").to("cuda")
        output = self.model.generate(input_ids=input_ids, max_new_tokens=512, pad_token_id=0)
        prompt_len = input_ids.shape[-1]
        answer = self.tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

        self.chat.append({"role": "assistant", "content": answer})
        return answer

In [None]:
model = LLaMA()

## Use model

Be sure that LLaMA class saves all the dialog so you can use that for chatting.

In [None]:
text="Write me poem about LLaMA AI"
print(f"LLaMA: {model.prompt(text)}")