<a href="https://colab.research.google.com/github/rinogrego/Learning-LLM/blob/main/exploration/LlamaIndex-Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring LlamaIndex

## Environment API Keys

In [None]:
import os

os.environ["HUGGINGFACE_API_KEY"] = ""
os.environ["OPENAI_API_KEY"] = ""

## HuggingFace LLM - Camel-5b

Ref: https://docs.llamaindex.ai/en/stable/examples/customization/llms/SimpleIndexDemo-Huggingface_camel.html

In [None]:
%pip install llama-index-llms-huggingface
!pip install llama-index

from google.colab import output
output.clear()

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Settings

### Downloading Data

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-03-03 12:03:31--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-03-03 12:03:31 (46.7 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [None]:
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

### Setting Up Prompt Template

In [None]:
# setup prompts - specific to StableLM
from llama_index.core import PromptTemplate

# This will wrap the default prompts that are internal to llama-index
# taken from https://huggingface.co/Writer/camel-5b-hf
query_wrapper_prompt = PromptTemplate(
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{query_str}\n\n### Response:"
)

### Setting Up LLM

In [None]:
import torch

llm = HuggingFaceLLM(
    context_window=2048,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.25, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="Writer/camel-5b-hf",
    model_name="Writer/camel-5b-hf",
    device_map="auto",
    tokenizer_kwargs={"max_length": 2048},
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={
        "torch_dtype": torch.float16,
        "offload_folder": "offload"
    }
)

Settings.chunk_size = 512
Settings.llm = llm

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



Some parameters are on the meta device device because they were offloaded to the cpu.


tokenizer_config.json:   0%|          | 0.00/748 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

In [None]:
index = VectorStoreIndex.from_documents(documents)

### Query Index

In [None]:
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")

Token indices sequence length is longer than the specified maximum sequence length for this model (983 > 512). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
print(response)

The author grew up in Italy, where he learned to string together a lot of abstract concepts with a few simple verbs, which led to his interest in everyday words differing from their Italian cognates.


### Query Index - Streaming

In [None]:
query_engine = index.as_query_engine(streaming=True)

In [None]:
# set Logging to DEBUG for more detailed outputs
response_stream = query_engine.query("What did the author do growing up?")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
# can be slower to start streaming since llama-index often involves many LLM calls
response_stream.print_response_stream()

The author grew up in Italy, where he learned to string together a lot of abstract concepts with a few simple verbs, which led to his interest in everyday words differing from their Italian cognates.<|endoftext|>

In [None]:
# can also get a normal response object
response = response_stream.get_response()
print(response)

The author grew up in Italy, where he learned to string together a lot of abstract concepts with a few simple verbs, which led to his interest in everyday words differing from their Italian cognates.<|endoftext|>


In [None]:
# can also iterate over the generator yourself
generated_text = ""
for text in response.response:
    generated_text += text
print(generated_text)

The author grew up in Italy, where he learned to string together a lot of abstract concepts with a few simple verbs, which led to his interest in everyday words differing from their Italian cognates.<|endoftext|>


## HuggingFace LLM - StableLM

Ref:
https://docs.llamaindex.ai/en/stable/examples/customization/llms/SimpleIndexDemo-Huggingface_stablelm.html

### Installation

In [None]:
%pip install llama-index-llms-huggingface
!pip install llama-index

from google.colab import output
output.clear()

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Settings

### Downloading Data

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-03-03 12:56:55--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-03-03 12:56:55 (35.0 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [None]:
# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

### Setting Up Prompt Template

In [None]:
# setup prompts - specific to StableLM
from llama_index.core import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

### Setting Up LLM

In [None]:
import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # Ref offload_folder: https://github.com/nomic-ai/gpt4all/issues/239
    # Ref offload_folder: https://huggingface.co/tiiuae/falcon-7b/discussions/82
    model_kwargs={"torch_dtype": torch.float16, "offload_folder": "offload"},
)

Settings.llm = llm
Settings.chunk_size = 1024

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]



pytorch_model-00001-of-00002.bin:   0%|          | 0.00/10.2G [00:00<?, ?B/s]

OSError: [Errno 28] No space left on device

### Build VectorIndexStore

In [None]:
index = VectorStoreIndex.from_documents(
    documents,
)

### Query Index

In [None]:
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")

In [None]:
print(response)

### Query Index - Streaming

In [None]:
query_engine = index.as_query_engine(streaming=True)

In [None]:
# set Logging to DEBUG for more detailed outputs
response_stream = query_engine.query("What did the author do growing up?")

In [None]:
# can be slower to start streaming since llama-index often involves many LLM calls
response_stream.print_response_stream()

In [None]:
# can also get a normal response object
response = response_stream.get_response()
print(response)

In [None]:
# can also iterate over the generator yourself
generated_text = ""
for text in response.response:
    generated_text += text
print(generated_text)

## LangChain LLM

https://docs.llamaindex.ai/en/stable/examples/llm/langchain.html

In [None]:
%pip install llama-index-llms-langchain

from google.colab import output
output.clear()

### OpenAI

In [None]:
from langchain.llms import OpenAI
from llama_index.llms.langchain import LangChainLLM

In [None]:
llm = LangChainLLM(llm=OpenAI())

  warn_deprecated(


In [None]:
# above script displayed following error
# /usr/local/lib/python3.10/dist-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.llms.openai.OpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAI`.
#   warn_deprecated(
!pip install -U langchain-openai
from langchain_openai import OpenAI

from google.colab import output
output.clear()

llm = LangChainLLM(llm=OpenAI())

In [None]:
response_gen = llm.stream_complete("Hi. What do you know about Universitas Indonesia?")

In [None]:
for delta in response_gen:
    print(delta.delta, end="")



Universitas Indonesia (UI) is a top public university located in Depok, West Java, Indonesia. It was founded in 1849 as the first university in Indonesia and is now considered the oldest and most prestigious university in the country.

UI has 15 faculties, including Faculty of Law, Faculty of Economics and Business, Faculty of Medicine, and Faculty of Humanities. It also has several international programs, such as the Faculty of Social and Political Sciences International Program and the Faculty of Humanities International Program.

The university has a strong reputation for academic excellence and is ranked among the top universities in Southeast Asia and the world. It is also known for its research and innovation, with many of its faculty members being recognized nationally and internationally.

UI has a diverse student body, with students coming from all parts of Indonesia and from other countries as well. It offers a wide range of extracurricular activities and has a vibrant camp

## Local Embeddings with HuggingFace

https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface.html

In [None]:
%pip install llama-index-embeddings-huggingface
%pip install llama-index-embeddings-instructor
!pip install llama-index

from google.colab import output
output.clear()

### HuggingFaceEmbedding

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en
# embed_model = HuggingFaceEmbedding()

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

384
[-0.003275663824751973, -0.011690725572407246, 0.04155917093157768, -0.03814810514450073, 0.02418305166065693]


### InstructorEmbedding

Instructor Embeddings are a class of embeddings specifically trained to augment their embeddings according to an instruction. By default, queries are given `query_instruction="Represent the question for retrieving supporting documents: "` and text is given `text_instruction="Represent the document for retrieval: "`.

In [None]:
!pip install InstructorEmbedding
!pip install -U sentence-transformers

from google.colab import output
output.clear()

In [None]:
!pip show sentence-transformers

Name: sentence-transformers
Version: 2.5.1
Summary: Multilingual text embeddings
Home-page: https://www.SBERT.net
Author: Nils Reimers
Author-email: info@nils-reimers.de
License: Apache License 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: huggingface-hub, numpy, Pillow, scikit-learn, scipy, torch, tqdm, transformers
Required-by: llama-index-finetuning


In [None]:
!pip install sentence-transformers==2.2.2
%pip install llama-index-embeddings-instructor
!pip install llama-index

# to handle following issue
# TypeError: INSTRUCTOR._load_sbert_model() got an unexpected keyword argument 'token'
# ref: https://github.com/run-llama/llama_index/issues/11037#issuecomment-1954720330

In [None]:
from llama_index.embeddings.instructor import InstructorEmbedding

embed_model = InstructorEmbedding(model_name="hkunlp/instructor-base")

  from tqdm.autonotebook import trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/66.2k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.43k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


In [None]:
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

768
[0.021553607657551765, -0.06098218262195587, 0.01796206459403038, 0.05490903556346893, 0.015269058756530285]


### Base HuggingFace Embeddings

In [None]:
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0   427k      0  0:00:49  0:00:49 --:--:--  449k


In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings

documents = SimpleDirectoryReader(
    input_files=["IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()

In [None]:
import os
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
test_emeds = embed_model.get_text_embedding("Hello World!")

Settings.embed_model = embed_model

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(documents, show_progress=True)

Parsing nodes:   0%|          | 0/172 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/459 [00:00<?, ?it/s]

9.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [None]:
index = VectorStoreIndex.from_documents(documents, show_progress=True)

Parsing nodes:   0%|          | 0/172 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/459 [00:00<?, ?it/s]

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("What does this document tells about?")
print(response)

The document discusses various scientific studies and research findings related to climate change, biodiversity, marine ecosystems, sustainable development, and the impacts of environmental changes on different species and ecosystems. It also covers topics such as adaptation strategies, governance of high-seas resources, social and ecological risks, and the interactions between different environmental factors.


## Embedding from HuggingFace Inference API

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceInferenceAPIEmbedding
# it works!!!

In [None]:
# embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
embed_model = HuggingFaceInferenceAPIEmbedding(model_name="BAAI/bge-small-en-v1.5")
test_embeds = embed_model.get_text_embedding("Hello World!")

RuntimeError: asyncio.run() cannot be called from a running event loop

In [None]:
HuggingFaceInferenceAPIEmbedding(model_name="BAAI/bge-small-en-v1.5").get_text_embedding

In [None]:
print(len(test_embeds))
test_embeds[:10]
# it works!!!

384


[-0.003275663824751973,
 -0.011690725572407246,
 0.04155917093157768,
 -0.03814810514450073,
 0.02418305166065693,
 0.013644285500049591,
 0.0111179044470191,
 0.04811961576342583,
 0.02140955626964569,
 0.014174910262227058]

## LangChain Embeddings

https://docs.llamaindex.ai/en/stable/examples/embeddings/Langchain.html

In [None]:
%pip install llama-index-embeddings-langchain
!pip install llama-index

from google.colab import output
output.clear()

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.embeddings.langchain import LangchainEmbedding

lc_embed_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
embed_model = LangchainEmbedding(lc_embed_model)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Basic embedding example
embeddings = embed_model.get_text_embedding(
    "It is raining cats and dogs here!"
)
print(len(embeddings), embeddings[:10])

768 [-0.005906173028051853, 0.04911916330456734, -0.04757879301905632, -0.04320327565073967, 0.02837086096405983, -0.01737167499959469, -0.04422018676996231, -0.01903551258146763, 0.049416132271289825, -0.038391221314668655]


## OpenAI Embeddings

https://docs.llamaindex.ai/en/stable/examples/embeddings/OpenAI.html

In [None]:
%pip install llama-index-embeddings-openai
!pip install llama-index

from google.colab import output
output.clear()

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

embed_model = OpenAIEmbedding(embed_batch_size=10)
Settings.embed_model = embed_model

In [None]:
# get API key and create embeddings
from llama_index.embeddings.openai import OpenAIEmbedding

# embed_model = OpenAIEmbedding(model="text-embedding-3-large") # give 3072 dimensions
embed_model = OpenAIEmbedding(model="text-embedding-3-small") # gives 1536 dimensions
# embed_model = OpenAIEmbedding(model="text-embedding-3-large", dimensions=512) # give 512 dimensions

embeddings = embed_model.get_text_embedding(
    "Open AI new Embeddings models is great."
)

In [None]:
print(len(embeddings))

1536


In [None]:
display(embeddings[:10])

## Quantized Model - LlamaCPP

Models:
- TheBloke/Mistral-7B-Instruct-v0.2-GGUF

References:
- https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html

In [None]:
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-llama-cpp
!pip install llama-index

from google.colab import output
output.clear()

In [None]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

In [None]:
!huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf to /root/.cache/huggingface/hub/tmpf_4quad_
mistral-7b-instruct-v0.2.Q4_K_M.gguf: 100% 4.37G/4.37G [00:33<00:00, 132MB/s]
./mistral-7b-instruct-v0.2.Q4_K_M.gguf


In [None]:
llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    # model_url="https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin",
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path="/content/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /content/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:         

In [None]:
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)


llama_print_timings:        load time =    1704.74 ms
llama_print_timings:      sample time =     164.91 ms /   256 runs   (    0.64 ms per token,  1552.36 tokens per second)
llama_print_timings: prompt eval time =    1704.45 ms /    76 tokens (   22.43 ms per token,    44.59 tokens per second)
llama_print_timings:        eval time =  177787.95 ms /   255 runs   (  697.21 ms per token,     1.43 tokens per second)
llama_print_timings:       total time =  180748.96 ms /   331 tokens


 Certainly! Here's a light-hearted poem about the friendship between cats and dogs:

In a world where fur meets fur,
Where playful paws and purrs do stir,
There's an unlikely bond that's pure,
Between the cat and the dog, it's true allure.

The cat with grace and elegance,
Sleek and slender, with a gentle sense,
And the dog with heart so vast and wide,
In their differences, they find their stride.

The cat with eyes that gleam and glint,
In the sunbeam's warm and gentle hint,
And the dog with wagging tail so bright,
Basking in the joy of day and night.

They frolic in the fields of green,
Chasing butterflies in a serene scene,
And when the day is through and night descends,
They curl up close, their bond never ends.

So here's to cats and dogs, so different yet the same,
In their unique and beautiful, wondrous game,
May their friendship be a source of endless delight,
A testament to love


In [None]:
response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
    print(response.delta, end="", flush=True)

Llama.generate: prefix-match hit


 In the realm where the asphalt meets the sky,
Where horsepower reigns and time seems to fly,
Lie the kings of the road, the fast cars, so sly,
Their engines roaring, their tires never shy.

Through the curves they dance with grace and might,
Their frames sculpted by the hands of skilled artisans,
A symphony of speed in the cool of the night,
Their headlights cutting through the darkness like swans.

With every rev of their powerful hearts,
They leave the world behind in their wake,
A blur of color, a work of art,
Their beauty and power, an intoxicating cake.

So if you're ever feeling small and lost,
Just close your eyes and let the roar of the engine be your guide,
Let the wind whip through your hair, your heart unthawed,
And let the fast cars take you on a wild, exhilarating ride.


llama_print_timings:        load time =    1704.74 ms
llama_print_timings:      sample time =     133.60 ms /   216 runs   (    0.62 ms per token,  1616.81 tokens per second)
llama_print_timings: prompt eval time =    8131.57 ms /    14 tokens (  580.83 ms per token,     1.72 tokens per second)
llama_print_timings:        eval time =  147136.95 ms /   215 runs   (  684.36 ms per token,     1.46 tokens per second)
llama_print_timings:       total time =  156752.99 ms /   229 tokens


## RAG Implementation: Chatting to Files Using Quantized Mistral-7B

https://www.youtube.com/watch?v=1mH1BvBJCl0&list=WL&index=216

In [None]:
!pip install -q pypdf python-dotenv transformers
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir
%pip install llama-index-llms-huggingface
!pip install -q llama-index
!pip install -q sentence-transformers
!pip install langchain langchain-community

from google.colab import output
output.clear()

In [None]:
import torch

In [None]:
# import logging
# import sys

# logging.BasicConfig(stream=sys.stdout, level=logging.INFO)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

AttributeError: module 'logging' has no attribute 'BasicConfig'

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
# Fix: https://stackoverflow.com/questions/77984729/importerror-cannot-import-name-vectorstoreindex-from-llama-index-unknown-l

In [None]:
documents = SimpleDirectoryReader("/content/Data").load_data()

In [None]:
len(documents)
documents[0]

Document(id_='fb5e7b71-89bc-4651-b2ec-f8cf05bcb84a', embedding=None, metadata={'file_path': '/content/Data/ds.txt', 'file_name': '/content/Data/ds.txt', 'file_type': 'text/plain', 'file_size': 1542, 'creation_date': '2024-03-03', 'last_modified_date': '2024-03-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Perkembangan pesat di bidang teknologi diikuti pula oleh perkembangan jumlah data. Hal ini terjadi di berbagai bidang ilmu. Oleh karena itu dibutuhkan suatu disiplin ilmu yang dapat digunakan untuk memproses jumlah data yang semakin masif. Disiplin ilmu tersebut adalah data science (Jifo & Lingling, 2014). Data science merupakan cabang ilmu yang mempelajari berbagai metode dan teknik yang dapat digunakan untuk menarik manfaat dari data. Beberapa

In [None]:
from llama_index.core.llms import LLM
from llama_index.core.llms.chatml_utils import messages_to_prompt, completion_to_prompt
from llama_index.llms.huggingface import HuggingFaceLLM

In [None]:
import torch

# from llama_index.llms import LlamaCPP
# from llama_index.llms.llama_utils import messages_to_prompt, completion_to_prompt

from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import messages_to_prompt, completion_to_prompt

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url='https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf',
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)


Downloading url https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf to path /tmp/llama_index/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf
total size (MB): 4368.44


4167it [00:36, 113.35it/s]                         
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /tmp/llama_index/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32

In [None]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.embeddings.langchain import LangchainEmbedding

embed_model = LangchainEmbedding(
  HuggingFaceEmbeddings(model_name="thenlper/gte-large")
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [None]:
service_context = ServiceContext.from_defaults(
    chunk_size=256,
    llm=llm,
    embed_model=embed_model
)

  service_context = ServiceContext.from_defaults(


In [None]:
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("What is Machine Learning?")
print(response)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     668.32 ms
llama_print_timings:      sample time =     143.88 ms /   205 runs   (    0.70 ms per token,  1424.75 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    5710.67 ms /   205 runs   (   27.86 ms per token,    35.90 tokens per second)
llama_print_timings:       total time =    6880.19 ms /   206 tokens


 Machine learning (ML) is a subfield of Artificial Intelligence (AI) that involves using algorithms to systematically combine relationships between data and information (Awad & Khanna, 2015). It is one of the methods that can be used to process data. In machine learning, there are three types or methods of learning, which are supervised learning, unsupervised learning, and reinforcement learning (Colliot, 2023). Supervised learning involves focusing on mapping data input X to data label as output y. An algorithm or machine learning model will learn the patterns in the training data that consists of pairs of input X and output y. Unsupervised learning involves training a model to learn patterns in the data without any label to serve as a reference for the model to learn from (Colliot, 2023). Reinforcement learning focuses on maximizing the reward obtained based on the conditions of the learning environment (Colliot, 2023).


In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("Apa yang menjadi subbidang dari kecerdasan buatan?")
print(response)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     668.32 ms
llama_print_timings:      sample time =      43.45 ms /    77 runs   (    0.56 ms per token,  1772.19 tokens per second)
llama_print_timings: prompt eval time =     877.82 ms /   626 tokens (    1.40 ms per token,   713.13 tokens per second)
llama_print_timings:        eval time =    2121.67 ms /    76 runs   (   27.92 ms per token,    35.82 tokens per second)
llama_print_timings:       total time =    3288.45 ms /   702 tokens


 Based on the provided context information, subbidang dari kecerdasan buatan adalah matematika dan statistika, ilmu komputer, dan domain expert atau ahli khusus. These three disciplines form the basis of data science, which involves various techniques for processing, analyzing, and interpreting data to extract useful insights.


In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("Jawab menggunakan teks yang diberikan. Apa perbedaan ensemble learning dan machine learning?")
print(response)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     668.32 ms
llama_print_timings:      sample time =      79.60 ms /   114 runs   (    0.70 ms per token,  1432.21 tokens per second)
llama_print_timings: prompt eval time =     273.27 ms /    36 tokens (    7.59 ms per token,   131.74 tokens per second)
llama_print_timings:        eval time =    3260.51 ms /   113 runs   (   28.85 ms per token,    34.66 tokens per second)
llama_print_timings:       total time =    4151.36 ms /   149 tokens


 Ensemble learning is a subfield of machine learning that involves combining multiple models to create a more accurate and robust model. Machine learning, on the other hand, refers to the process of training models to make predictions or decisions based on data. Ensemble learning can be used for both training and inference processes, while machine learning typically only involves training processes. In ensemble learning, weak learners (models that are simple and easy to train) are combined to create a stronger model, while in machine learning, models are trained independently and then combined to create an ensemble.


## Referensi

- https://twitter.com/llama_index/status/1762158562657374227
- Talk to Your Documents, Powered by Llama-Index
  - https://www.youtube.com/watch?v=WL7V9JUy2sE&list=WL&index=214&pp=gAQBiAQB
- RAG Implementation Medical Chatbot with Mistral 7B LLM LlamaIndex GTE Colab Demo
  - https://www.youtube.com/watch?v=1mH1BvBJCl0&list=WL&index=215&pp=gAQBiAQB
- Building A RAG System with Gemma, MongoDB and Open Source Models
  - https://huggingface.co/learn/cookbook/rag_with_hugging_face_gemma_mongodb
- Effortless Company Research Using Open Source — Using Llama Index, Huggingface Embeddings & Llama 2 LLM On News
  - https://medium.com/scrapehero/effortless-company-research-using-open-source-using-llama-index-huggingface-embeddings-llama-1725a60da117
- https://docs.llamaindex.ai/en/stable/examples/llm/mistralai.html
- https://docs.llamaindex.ai/en/stable/examples/customization/llms/SimpleIndexDemo-Huggingface_camel.html
- Models
  - https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
  - https://huggingface.co/GritLM/GritLM-7B