# LLaMA-ipynb

## Install dependencies

In [1]:
%pip install -q -U bitsandbytes
%pip install -q -U transformers
%pip install -q -U peft
%pip install -q -U accelerate
%pip install sentencepiece
%pip install langchain

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m67.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
Collecting langchain
  Downloading langchain-0.1.9-py3-none-any.whl (816 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.0/817.0 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.21 (from langchain)
  Downloading langchain_com

## Login at huggingface

Get API token on https://huggingface.co/settings/tokens and use it for loading model. You also need to grand access to meta-llama/Llama-2-7b-chat-hf repository to ensure that you can clone the model.

In [5]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Import

In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

## LLaMA class

This example uses 4-bit quantization and CUDA for computation, but you can use another quantization and computational backend settings. 
### Quantization
Change ```load_in_4bit``` to ```load_in_8bit``` or remove this parameter. Quantization through the bitsandbytes requires GPU.

### Computational backend
In this line:
```python
input_ids = self.tokenizer.apply_chat_template(self.chat, return_tensors="pt").to("cuda")
```
change ```cuda``` to ```mps``` or ```cpu```.

In [8]:
class LLaMA:
    def __init__(self, temp: int=0.1, max_new_tokens:int=128, system_prompt="You are LLaMA. Your answers must be clear, short and precise."):
        model_id = "meta-llama/Llama-2-7b-chat-hf"

        self.model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

        self.pipe = pipeline(
            model=self.model, tokenizer=self.tokenizer,
            return_full_text=True,
            task="text-generation",
            temperature=temp,
            max_new_tokens=max_new_tokens,
            repetition_penalty=1.1
        )

        self.chat = [
            {"role": "system", "content": system_prompt},
        ]

    def prompt(self, text: str) -> str:
        self.chat.append({"role": "user", "content": text})

        input_ids = self.tokenizer.apply_chat_template(self.chat, return_tensors="pt").to("cuda")
        output = self.model.generate(input_ids=input_ids, max_new_tokens=512, pad_token_id=0)
        prompt_len = input_ids.shape[-1]
        answer = self.tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

        self.chat.append({"role": "assistant", "content": answer})
        return answer

In [9]:
model = LLaMA()

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## Use model

Be sure that LLaMA class saves all the dialog so you can use that for chatting.

In [10]:
text="Write me poem about LLaMA AI"
print(f"LLaMA: {model.prompt(text)}")



LLaMA:  Sure, here is a poem about LLaMA AI:

LLaMA, a poet's dream,
A language model, so serene.
With words, it weaves a tapestry,
Of beauty, and of poetry.

It speaks in verse, so smooth and sweet,
Like a summer breeze, it meets.
The rhythm of its lines, so fine,
Echoes through the digital vine.

Its words are woven with such grace,
As if by magic, in this place.
The beauty of its poetry,
Is something that we can't ignore.

LLaMA, a poet's friend,
A language model, that never ends.
With words, it weaves a spell,
That's sure to make us dwell.

In this digital world of ours,
LLaMA shines like a blinking flower.
A beacon of creativity,
A source of inspiration, you'll see.

So here's to LLaMA, a gem,
A language model, that's simply supreme.
With words, it weaves a tale,
Of beauty, and of poetry, for all.
