# Running a Local Model

This notebook shows how to run a local model on your computer

We will be using llama CPP

References:

- From : https://docs.llamaindex.ai/en/stable/getting_started/starter_example.html
- https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/llm/llama_2_llama_cpp.ipynb
- https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html
- https://colab.research.google.com/drive/1UoPcoiA5EOBghxWKWduQhChliMHxla7U?usp=sharing#scrollTo=V1DwVkHCNDgT

## Installing LLama-CPP-Python

See here for detailed install guide: https://github.com/abetlen/llama-cpp-python

### Quick install for CPU 


```bash
# adjust this to match your environment name
conda activate atlas-1

pip install llama-cpp-python
```

### Quick install for GPU

For this you would need a Nvidia GPU with all the appropriate CUDA drivers installed.  How to install this varies from OS to OS.  ANd its beyond the scope of this guide.  So please make sure this is done, before you attempt to use GPU.

```bash
conda activate atlas-1

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

# if you need to foce reinstall
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --force-reinstall --no-cache-dir llama-cpp-python
```

## How do I Know if GPU is being used?

See below for checks we do for GPUs.

Keep in mind, even if code below says GPU is being used, unless llama-cpp-python is built with GPU support, the model won't use GPU.

So double check and verify!

### Experimenting

To completely turn off GPU usage , set this variable below

```python
os.environ["CUDA_VISIBLE_DEVICES"]=""
```

Once GPU is enabled, we can experiment with how much of compuation is 'off loaded' onto GPU.  This is controlled by `"n_gpu_layers": 1},` argument.

- set to at least 1 to use GPU,  -1 for no GPU
- change this value from 1, 10, 20, 30, 40
- for Nvidia GEForce 2070 with 8 GB RAM 40 works well 
    -  the `mistral-7b-instruct-v0.2.Q4_K_S.gguf` consumes about 5.6 GB of GPU RAM with 40 GPU layers

Use `nvidia-smi` tool (run this from terminal) to see if GPU memory is being used

In [1]:
## Check if GPU is enabled
import os
import torch

## To disable GPU and experiment, uncomment the following line
## Normally, you would want to use GPU, if one is available.
# os.environ["CUDA_VISIBLE_DEVICES"]=""

print ("using CUDA/GPU: ", torch.cuda.is_available())

for i in range(torch.cuda.device_count()):
   print("device ", i , torch.cuda.get_device_properties(i).name)

using CUDA/GPU:  True
device  0 NVIDIA GeForce RTX 2070


In [2]:
from dotenv import load_dotenv, find_dotenv

# _ = load_dotenv(find_dotenv()) # read local .env file

model_file_path = 'models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_S.gguf'

os.environ['LLAMA_INDEX_CACHE_DIR'] = os.path.join(os.path.abspath(''), 'cache')


In [3]:
from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    # model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=model_file_path,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU,  -1 for no GPU
    # change this value from 1, 10, 20, 30, 40
    # for Nvidia GEForce 2070 with 8 GB RAM 40 works well
    model_kwargs={"n_gpu_layers": 10},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              =

In [4]:
%%time

# prompt = "Hello! Can you tell me a poem about cats and dogs?"
prompt = "Write a haiku about cats"

response = llm.complete(prompt)
print(response.text)
print ('-----', flush=True)
# type(response.dict)
# pprint.pprint (response.dict, indent=4)

 Soft paws on cool tile,
Silent hunters in the night,
Curiosity's grace.
-----
CPU times: user 570 ms, sys: 37.9 ms, total: 608 ms
Wall time: 607 ms



llama_print_timings:        load time =     137.58 ms
llama_print_timings:      sample time =       9.16 ms /    25 runs   (    0.37 ms per token,  2729.56 tokens per second)
llama_print_timings: prompt eval time =     137.43 ms /    69 tokens (    1.99 ms per token,   502.08 tokens per second)
llama_print_timings:        eval time =     420.05 ms /    24 runs   (   17.50 ms per token,    57.14 tokens per second)
llama_print_timings:       total time =     604.69 ms /    93 tokens


In [5]:
%%time

# prompt = "Q: Name the planets in Solar system.  A:"
prompt = "Name the planets in Solar system"

response = llm.complete(prompt)
print(response.text)
print ('-----', flush=True)
# type(response.dict)
# pprint.pprint (response.dict, indent=4)

Llama.generate: prefix-match hit


 In order from the Sun, the planets in our Solar System are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. Pluto was once considered a planet but it was reclassified as a dwarf planet in 2006.
-----
CPU times: user 1.34 s, sys: 6.78 ms, total: 1.34 s
Wall time: 1.34 s



llama_print_timings:        load time =     137.58 ms
llama_print_timings:      sample time =      23.36 ms /    65 runs   (    0.36 ms per token,  2782.42 tokens per second)
llama_print_timings: prompt eval time =      84.65 ms /    10 tokens (    8.46 ms per token,   118.13 tokens per second)
llama_print_timings:        eval time =    1128.14 ms /    64 runs   (   17.63 ms per token,    56.73 tokens per second)
llama_print_timings:       total time =    1339.24 ms /    74 tokens
