# Running a Local Model

This notebook shows how to run a local model on your computer

We will be using llama CPP

References:

- From : https://docs.llamaindex.ai/en/stable/getting_started/starter_example.html
- https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/llm/llama_2_llama_cpp.ipynb
- https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html
- https://colab.research.google.com/drive/1UoPcoiA5EOBghxWKWduQhChliMHxla7U?usp=sharing#scrollTo=V1DwVkHCNDgT

## Step-1: Installing LLama-CPP-Python

See here for detailed install guide: https://github.com/abetlen/llama-cpp-python

### Quick install for CPU 


```bash
# adjust this to match your environment name
conda activate atlas-1

pip install llama-cpp-python
```

### Quick install for GPU

For this you would need a Nvidia GPU with all the appropriate CUDA drivers installed.  How to install this varies from OS to OS.  ANd its beyond the scope of this guide.  So please make sure this is done, before you attempt to use GPU.

```bash
conda activate atlas-1

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

# if you need to foce reinstall
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --force-reinstall --no-cache-dir llama-cpp-python
```

## Step-2: How do I Know if the model is using GPU

To completely turn off GPU usage , set this variable before initializing the model

```python
os.environ["CUDA_VISIBLE_DEVICES"]=""
```

See below for checks we do for GPUs.

Keep in mind, even if code below says GPU is being used, unless llama-cpp-python is built with GPU support, the model won't use GPU.

So double check and verify!

### 2.1 - Check the logs when the model initialized

You can check the output when model is initialized as follows.  IF you see something like this, that means the model is using GPU!

```text
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
```

### 2.2 - Use `nvidia-smi` tool

Use `nvidia-smi` tool (run this from terminal) to see if GPU memory is being used

Here is a sample output

![missing image](../images/nvidia-smi-1.png)

In [1]:
## Check if GPU is enabled
import os
import torch

## To disable GPU and experiment, uncomment the following line
## Normally, you would want to use GPU, if one is available.
# os.environ["CUDA_VISIBLE_DEVICES"]=""

print ("using CUDA/GPU: ", torch.cuda.is_available())

for i in range(torch.cuda.device_count()):
   print("device ", i , torch.cuda.get_device_properties(i).name)

using CUDA/GPU:  True
device  0 NVIDIA GeForce RTX 2070


## Step-3: Choosing a Model

Running full sized models will require a lot of resources.  So we will be using **quantized** models.  These are 'slimmed down' models that are smaller, sacrificing some quality of output

See different options here : https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF

The smallest recommended model is 4 bit quantized: **Q4_K_M**  - it's a medium model that provides balanced quality.

So the filename is : **mistral-7b-instruct-v0.2.Q4_K_M.gguf**

Go ahead and download the model from here.

Here is a direct link to : [mistral-7b-instruct-v0.2.Q4_K_M.gguf](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/raw/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf)

![missing image](../images/mistral-model-selection-1.png)

In [2]:
model_file_path = '../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf'
# model_file_path = '../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q5_K_M.gguf'

## Step-4: Initialize the Model

Here we are going to init the model, with the following params:

- **model_path** : point to where the model file was downloaded
- **n_gpu_layers**: controls if model is off loaded into GPU
    - set to at least 1 to use GPU
    - change this value from 1, 10, 20, 30, 40
    - set to -1 to offload all layers to GPU
    - for Nvidia GEForce 2070 with 8 GB RAM 40 works well 
    - the `mistral-7b-instruct-v0.2.Q4_K_M.gguf` consumes about 5 GB of GPU RAM with full offload (-1)


In [3]:
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    # model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=model_file_path,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU,  0 for no GPU, -1 to offload every thing to GPU
    # change this value from 1, 10, 20, 30, 40
    # for Nvidia GEForce 2070 with 8 GB RAM 40 works well
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

  from .autonotebook import tqdm as notebook_tqdm
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:               

## Step-5: Let's test the model

In [4]:
%%time

import pprint

prompt = "What is the capital of France"

response = llm.complete(prompt)
print ('-----', flush=True)
print(response.text)
print ('-----', flush=True)
pprint.pprint (response.__dict__, indent=4)


llama_print_timings:        load time =     135.94 ms
llama_print_timings:      sample time =       3.11 ms /     9 runs   (    0.35 ms per token,  2893.89 tokens per second)
llama_print_timings: prompt eval time =     135.73 ms /    69 tokens (    1.97 ms per token,   508.35 tokens per second)
llama_print_timings:        eval time =     128.94 ms /     8 runs   (   16.12 ms per token,    62.04 tokens per second)
llama_print_timings:       total time =     287.69 ms /    77 tokens


-----
 The capital city of France is Paris.
-----
{   'additional_kwargs': {},
    'delta': None,
    'logprobs': None,
    'raw': {   'choices': [   {   'finish_reason': 'stop',
                                  'index': 0,
                                  'logprobs': None,
                                  'text': ' The capital city of France is '
                                          'Paris.'}],
               'created': 1711177447,
               'id': 'cmpl-a853953a-739b-4320-93a0-cfcfdb9a1474',
               'model': '../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf',
               'object': 'text_completion',
               'usage': {   'completion_tokens': 8,
                            'prompt_tokens': 69,
                            'total_tokens': 77}},
    'text': ' The capital city of France is Paris.'}
CPU times: user 253 ms, sys: 40.7 ms, total: 294 ms
Wall time: 292 ms


In [5]:
%%time

# prompt = "Hello! Can you tell me a poem about cats and dogs?"
prompt = "Write a haiku about cats"

response = llm.complete(prompt)
print ('-----', flush=True)
print(response.text)
print ('-----', flush=True)
pprint.pprint (response.__dict__, indent=4)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     135.94 ms
llama_print_timings:      sample time =       8.69 ms /    26 runs   (    0.33 ms per token,  2991.26 tokens per second)
llama_print_timings: prompt eval time =      84.93 ms /    10 tokens (    8.49 ms per token,   117.75 tokens per second)
llama_print_timings:        eval time =     399.03 ms /    25 runs   (   15.96 ms per token,    62.65 tokens per second)
llama_print_timings:       total time =     533.98 ms /    35 tokens


-----
 Silent paws glide,
Soft purrs soothe troubled hearts,
Grace in feline form.
-----
{   'additional_kwargs': {},
    'delta': None,
    'logprobs': None,
    'raw': {   'choices': [   {   'finish_reason': 'stop',
                                  'index': 0,
                                  'logprobs': None,
                                  'text': ' Silent paws glide,\n'
                                          'Soft purrs soothe troubled hearts,\n'
                                          'Grace in feline form.'}],
               'created': 1711177447,
               'id': 'cmpl-a26f73cc-a2fc-4ff2-98dd-b6d3157d3d5b',
               'model': '../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf',
               'object': 'text_completion',
               'usage': {   'completion_tokens': 25,
                            'prompt_tokens': 69,
                            'total_tokens': 94}},
    'text': ' Silent paws glide,\n'
            'Soft p

In [6]:
%%time

# prompt = "Q: Name the planets in Solar system.  A:"
prompt = "Name the planets in Solar system"

response = llm.complete(prompt)
print ('-----', flush=True)
print(response.text)
print ('-----', flush=True)
pprint.pprint (response.__dict__, indent=4)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     135.94 ms
llama_print_timings:      sample time =      45.84 ms /   133 runs   (    0.34 ms per token,  2901.21 tokens per second)
llama_print_timings: prompt eval time =      86.62 ms /    10 tokens (    8.66 ms per token,   115.45 tokens per second)
llama_print_timings:        eval time =    2125.97 ms /   132 runs   (   16.11 ms per token,    62.09 tokens per second)
llama_print_timings:       total time =    2539.75 ms /   142 tokens


-----
 In our Solar System, there are eight planets. In order from the Sun: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. Each of these planets has unique characteristics that distinguish it from the others. Mercury is the smallest and closest to the Sun, while Jupiter is the largest and known for its Great Red Spot. Saturn is famous for its prominent ring system, while Uranus is unique for being tilted on its side. Neptune is known for its deep blue color and violent weather patterns. I'm here to help with any questions you might have!
-----
{   'additional_kwargs': {},
    'delta': None,
    'logprobs': None,
    'raw': {   'choices': [   {   'finish_reason': 'stop',
                                  'index': 0,
                                  'logprobs': None,
                                  'text': ' In our Solar System, there are '
                                          'eight planets. In order from the '
                                          'Sun: 