<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/developing-kaggle-notebooks/10-GenAI/02_quantized_model_cpp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'llama-2/pytorch/7b-chat-hf/1:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-models-data%2F3093%2F4298%2Fbundle%2Farchive.tar.gz%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240130%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240130T075110Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D3254183c64a2d49515910c6c10a9ee9d46701e52c85e30aa3a340e33c2ae5d454eed566d16c26fa37b12007d8759b7a3bf58ac3e5445eb12a8244d9b697c5a777961beb5fc2d925c4b6b174cf4959871e86e2b2003ac7e2cfc672b06852a6509fd317e2906db8d1ffe1452bacdfdd152ffea0eac7c761c55f996e85521d26e22fbbdd0072ac3a4381d27dac7c3c946e02af9a261c27db79453adacbd0e0289ebe969bef2c888468cdb7cb87854d91ed4d8acd20e4453c93af90d93a6f82ad44465fcb13a9291c475b25471286da8b246afcc30a5d4c695b1d3f54d640fbe1a7d1a32623a3d0406a2f92f76d180ae4f61a64de81561e98a244ae55f7ba470b772'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


Downloading llama-2/pytorch/7b-chat-hf/1, 20836388871 bytes compressed
Downloaded and uncompressed: llama-2/pytorch/7b-chat-hf/1
Data source import complete.


# Introduction

In other notebooks we demonstrated how we can use **Llama 2** model for various tasks, from testing it on math problems, to creating a sequential task chain (with output of previous task used as parameter in the input of the next task) and to create Retrieval Augmented Generation system, with Llama 2 as LLM, ChromaDB as vector database and Langchain as task chaining framework.  

In this notebook we will experiment with llama.cpp. This library help us to run Llama and other models on lower performance hardware (consumer hardware). It converts/quantizes Llama model to GGUF format.

# Installation


We start by installing llama.cpp.

In [None]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
!git clone https://github.com/ggerganov/llama.cpp.git

In [None]:
!pip install sentencepiece

Then we convert our model to **llama.cpp** format.

In [None]:
!python llama.cpp/convert.py /kaggle/input/llama-2/pytorch/7b-chat-hf/1 \
  --outfile llama-7b.gguf \
  --outtype q8_0

```log
model.layers.31.self_attn.v_proj.weight          -> blk.31.attn_v.weight                     | F16    | [4096, 4096]
model.norm.weight                                -> output_norm.weight                       | F16    | [4096]
Writing llama-7b.gguf, format 7
Traceback (most recent call last):
  File "/content/llama.cpp/convert.py", line 1474, in <module>
    main()
  File "/content/llama.cpp/convert.py", line 1468, in main
    OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
  File "/content/llama.cpp/convert.py", line 1113, in write_all
    check_vocab_size(params, vocab, pad_vocab=pad_vocab)
  File "/content/llama.cpp/convert.py", line 959, in check_vocab_size
    raise Exception(msg)
Exception: Vocab size mismatch (model has 32000, but /kaggle/input/llama-2/pytorch/7b-chat-hf/1/tokenizer.model has 32001).
```

# Import packages

In [2]:
from llama_cpp import Llama

# Test the model


Let's quickly test the model. We initialize first the model.

In [None]:
llm = Llama(model_path="/kaggle/working/llama-7b.gguf")

Let's define a question.

In [None]:
output = llm("Q: Name three capital cities in Europe? A: ", max_tokens=38, stop=["Q:", "\n"], echo=True)

And now let's see the output.

In [None]:
output

Next, let's run a math question.

In [None]:
output = llm("If a circle has the radius 3, what is its area?")

Let's check the answer.

In [None]:
print(output['choices'][0]['text'])