<a href="https://colab.research.google.com/github/yernenip/phi2-gguf/blob/main/Phi2_GGUF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading and Merging Phi-2 with fine-tuned LoRA adapters

In [None]:
!pip install peft
!pip install --upgrade torch transformers


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "microsoft/phi-2"
torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             torch_dtype=torch.float16,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)



In [None]:
from peft import PeftModel, PeftConfig

#Load the model weights from hub
model_adapters = "praveeny/phi2-webglm-qlora"
model = PeftModel.from_pretrained(model, model_adapters)

model = model.merge_and_unload()
model.save_pretrained("updated_adapters")


In [None]:
model.push_to_hub("phi2-webglm-guava", private=True,
                  commit_message="merged model")

tokenizer.push_to_hub("phi2-webglm-guava", private=True,
                  commit_message="tokenizer")

# Setting up Llama.cpp and saving model in GGUF format

**Note:** At this point, I would recommend disconnecting and deleting runtime. Merging the model and pushing to hub (as shown above) takes up a lot of resources.

Thats why, I am installing the packages required again below.

In [None]:
from huggingface_hub import snapshot_download

model_id="praveeny/phi2-webglm-guava"
#Download the repository to local_dir
snapshot_download(repo_id=model_id, local_dir="phi2",
                  local_dir_use_symlinks=False)



Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/897 [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

'/content/phi2'

In [None]:
# Setup Llama.cpp and install required packages
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 19344, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 19344 (delta 0), reused 1 (delta 0), pack-reused 19341[K
Receiving objects: 100% (19344/19344), 22.71 MiB | 14.41 MiB/s, done.
Resolving deltas: 100% (13524/13524), done.
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=

In [None]:
!python llama.cpp/convert-hf-to-gguf.py phi2 --outfile "phi2/phi2-v2-fp16.bin" --outtype f16

Loading model: phi2
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Adding 50000 merge(s).
gguf: Setting special token type bos to 50256
gguf: Setting special token type eos to 50256
gguf: Setting special token type unk to 50256
Exporting model to 'phi2/phi2-v2-fp16.bin'
gguf: loading model part 'model-00001-of-00002.safetensors'
token_embd.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_norm.bias, n_dims = 1, torch.float16 --> float32
blk.0.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.0.ffn_up.bias, n_dims = 1, torch.float16 --> float32
blk.0.ffn_up.weig

In [None]:
!./llama.cpp/quantize "phi2/phi2-v2-fp16.bin" "phi2/phi2-v2-Q5_K_M.gguf" "q5_k_m"

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
main: build = 2254 (9e359a4f)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'phi2/phi2-v2-fp16.bin' to 'phi2/phi2-v2-Q5_K_M.gguf' as Q5_K_M
llama_model_loader: loaded meta data with 19 key-value pairs and 453 tensors from phi2/phi2-v2-fp16.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_mod

In [None]:
!pip install huggingface_hub

from huggingface_hub import HfApi
api = HfApi()

model_id = "praveeny/phi2-webglm-gguf"
api.create_repo(model_id, exist_ok=True, repo_type="model")
api.upload_file(
    path_or_fileobj="phi2/phi2-v2-Q5_K_M.gguf",
    path_in_repo="phi2-v2-Q5_K_M.gguf",
    repo_id=model_id,
)



phi2-v2-Q5_K_M.gguf:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/praveeny/phi2-webglm-gguf/commit/897b106fc7b4c287aaf66b14e93e1f8bac5e9f29', commit_message='Upload phi2-v2-Q5_K_M.gguf with huggingface_hub', commit_description='', oid='897b106fc7b4c287aaf66b14e93e1f8bac5e9f29', pr_url=None, pr_revision=None, pr_num=None)

# Running Inference with LangChain, Llamacpp and GGUF

At this point, I would recommend to disconnect and delete the runtime. The code below can be run separately and we will redownload the GGUF file from hugging face hub, then work with the local copy.

I am also running this on a CPU instance, instead of GPU.

In [None]:
!pip install huggingface_hub
!pip install langchain
!pip install llama-cpp-python

Collecting langchain
  Downloading langchain-0.1.9-py3-none-any.whl (816 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.0/817.0 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.21 (from langchain)
  Downloading langchain_community-0.0.24-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1.26 (from langchain)
  Downloading langchain_core-0.1.26-py3-none-any.whl (246 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m246.4/246.4 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith<0.2.0,>=0.1.0 (from langchain)
  Downloading langsmith

In [None]:
from huggingface_hub import snapshot_download

model_id="praveeny/phi2-webglm-gguf"
#Download the repository to local_dir
snapshot_download(repo_id=model_id, local_dir="phi2-gguf",
                  local_dir_use_symlinks=False)

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

phi2-v2-Q5_K_M.gguf:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

'/content/phi2-gguf'

## Setting up LangChain and prompt

In [None]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.llms import LlamaCpp

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

## Running inference with Llamacpp

In [None]:
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="phi2-gguf/phi2-v2-Q5_K_M.gguf",
    temperature=0.75,
    max_tokens=2000,
    top_p=1,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)


prompt = """###System:
Read the references provided and answer the corresponding question.
###References:
[1] For most people, the act of reading is a reward in itself. However, studies show that reading books also has benefits that range from a longer life to career success. If you’re looking for reasons to pick up a book, read on for seven science-backed reasons why reading is good for your health, relationships and happiness.
[2] As per a study, one of the prime benefits of reading books is slowing down mental disorders such as Alzheimer’s and Dementia  It happens since reading stimulates the brain and keeps it active, which allows it to retain its power and capacity.
[3] Another one of the benefits of reading books is that they can improve our ability to empathize with others. And empathy has many benefits – it can reduce stress, improve our relationships, and inform our moral compasses.
[4] Here are 10 benefits of reading that illustrate the importance of reading books. When you read every day you:
[5] Why is reading good for you? Reading is good for you because it improves your focus, memory, empathy, and communication skills. It can reduce stress, improve your mental health, and help you live longer. Reading also allows you to learn new things to help you succeed in your work and relationships.
###Question:
Why is reading books widely considered to be beneficial?
###Answer:
"""


llm.invoke(prompt)

llama_model_loader: loaded meta data with 20 key-value pairs and 453 tensors from phi2-gguf/phi2-v2-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
llama_model_loader: - kv   5:                           phi2.block_count u32              = 32
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32           

Reading books is widely considered to be beneficial because it can improve focus, memory, empathy, and communication skills[5], reduce stress, improve mental health, and help you live longer[5], and allow you to learn new things to help you succeed in your work and relationships[5]. It can also slow down mental disorders such as Alzheimer’s and Dementia by stimulating the brain and keeping it active[2], and improve our ability to empathize with others which can reduce stress, improve our relationships, and inform our moral compasses[3]. Additionally, it can improve our ability to comprehend and retain information, which can help us succeed academically[4]. Finally, it can be rewarding in itself as it can be an enjoyable activity[1].


llama_print_timings:        load time =    2947.97 ms
llama_print_timings:      sample time =     161.15 ms /   155 runs   (    1.04 ms per token,   961.82 tokens per second)
llama_print_timings: prompt eval time =   80168.12 ms /   302 tokens (  265.46 ms per token,     3.77 tokens per second)
llama_print_timings:        eval time =   56772.21 ms /   154 runs   (  368.65 ms per token,     2.71 tokens per second)
llama_print_timings:       total time =  138518.75 ms /   456 tokens


'Reading books is widely considered to be beneficial because it can improve focus, memory, empathy, and communication skills[5], reduce stress, improve mental health, and help you live longer[5], and allow you to learn new things to help you succeed in your work and relationships[5]. It can also slow down mental disorders such as Alzheimer’s and Dementia by stimulating the brain and keeping it active[2], and improve our ability to empathize with others which can reduce stress, improve our relationships, and inform our moral compasses[3]. Additionally, it can improve our ability to comprehend and retain information, which can help us succeed academically[4]. Finally, it can be rewarding in itself as it can be an enjoyable activity[1].'