# Vicuna Langchain Demo

Example of locally running [`Vicuna`](https://github.com/lm-sys/FastChat), a *llama.cpp* based large language model (LLM) under [`langchain`](https://github.com/hwchase17/langchain), in a Jupyter notebook running a Python 3.10 kernel.

In [1]:
!pip install -qU chromadb langchain==0.0.142 tiktoken tqdm load_dotenv ipywidgets pinecone-client pyllama llama-index llama-cpp-python html2text pyllamacpp
!pip list | grep langchain
!pip list | grep torch
!pip list | grep llama

langchain                     0.0.142
torch                         2.0.0
torchvision                   0.15.1
llama-cpp-python              0.1.43
llama-index                   0.5.27
pyllama                       0.0.9
pyllamacpp                    2.1.2


## Data preparation

In [2]:
def file_metadata(filename):
    d = dict()
    d["source"] = filename.replace('../data/docs/', 'https://').replace('index.html', '').replace('.html', '')
    return d

In [3]:
from pathlib import Path
from typing import Dict

from llama_index.readers.file.base_parser import BaseParser

class HtmlParser(BaseParser):
    """Html parser."""

    def _init_parser(self) -> Dict:
        """Init parser."""
        return {}

    def parse_file(self, file: Path, errors: str = "ignore") -> str:
        """Parse file."""
        try:
            import html2text
        except ImportError:
            raise ImportError(
                "html2text is required to read html files: `pip install html2text`"
            )
        with open(file, "r") as fp:
            text = fp.read()
            text_maker = html2text.HTML2Text()
            text_maker.ignore_links = True
            text_maker.ignore_images = True
            text_maker.bypass_tables = False
            text = text_maker.handle(text)
            # Remove extra white space
            text = ' '.join(text.split())

        return text

In [4]:
file_extractor: Dict[str, BaseParser] = {
    ".htm": HtmlParser(),
    ".html": HtmlParser(),
}

In [5]:
%%time
from llama_index import SimpleDirectoryReader
docs = SimpleDirectoryReader(input_dir='../data/docs/', recursive=True, file_extractor=file_extractor, file_metadata=file_metadata).load_langchain_documents()
len(docs)

CPU times: user 10.4 s, sys: 275 ms, total: 10.7 s
Wall time: 11.5 s


171

In [7]:
%%time
from tqdm.auto import tqdm

urls = ['https://www.priceless.com/m/filter/options/category/506', 
        'https://www.priceless.com/m/filter/options/category/510',
        'https://www.priceless.com/m/filter/options/location/9716/trk/20211/']
documents = []

for doc in tqdm(docs):
    src = doc.metadata['source']
    url = src.replace('rtdocs/', 'https://').replace('index.html', '').replace('.html', '')
    if not url in urls:
        continue

    documents.append(doc)

len(documents), documents[0]

  0%|          | 0/171 [00:00<?, ?it/s]

CPU times: user 6.7 ms, sys: 3.82 ms, total: 10.5 ms
Wall time: 9.84 ms


(3,
 Document(page_content="### Add to wishlist Follow ADD TO CART Waitlist 0 Log in Sign up Language Currency Interests Locations * * * Mastercard.com About Priceless Contact us ADVANCED SEARCH All Experiences Interests Entertainment Arts and Culture Sports Culinary Travel More... Shopping Less... Locations Argentina Australia Austria Brazil Bulgaria More... Canada Chile China Colombia Croatia Czechia Fiji France Germany Greece Hong Kong India Indonesia Ireland Italy Japan Kenya Macau Malaysia Maldives Mexico Morocco Netherlands New Zealand Nigeria Puerto Rico Romania Serbia Singapore South Africa Spain Sweden Thailand Turkey United Arab Emirates United Kingdom United States Uruguay Clear Selection ADVANCED SEARCH ADVANCED SEARCH All Experiences Interests Entertainment Arts and Culture Sports Culinary Travel More... Shopping Less... Locations Argentina Australia Austria Brazil Bulgaria More... Canada Chile China Colombia Croatia Czechia Fiji France Germany Greece Hong Kong India Indon

## Model preparation

Download `vicuna` model, choosing between q4_0, q4_1, and q4_2:

- 4_0 is the fastest. The quality is the poorest.
- 4_1 is a lot slower. The quality is noticeably better.
- 4_2 is almost as fast as 4_0 and about as good as 4_1 on Apple Silicon. On Intel/AMD it's hardly better or faster than 4_1.

In [None]:
#https://huggingface.co/eachadea/ggml-vicuna-13b-1.1

- import libs

In [7]:
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain

- set up prompt template:

In [8]:
template = """
Question: {question}
Answer: 
"""

prompt = PromptTemplate(template=template, input_variables=["question"])

- create language chain using prompt template and q4_0 model:

In [13]:
%%time
MODEL_PATH = "../models/ggml-vicuna-13b-1.1-q4_0.bin"
llm = LlamaCpp(model_path=MODEL_PATH)
llm_chain = LLMChain(prompt=prompt, llm=llm)

CPU times: user 12.8 ms, sys: 9.16 ms, total: 22 ms
Wall time: 76.7 ms


llama.cpp: loading model from ../models/ggml-vicuna-13b-1.1-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 3216.00 MB per state)
llama_init_from_file: kv self size  =  800.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 


- run prompt:

In [14]:
%%time
question = "Entertainment in New York"

llm_chain.run(question)

CPU times: user 5min 13s, sys: 12min 6s, total: 17min 19s
Wall time: 55min 30s



llama_print_timings:        load time = 15164.94 ms
llama_print_timings:      sample time =   119.90 ms /   139 runs   (    0.86 ms per run)
llama_print_timings: prompt eval time = 31396.23 ms /    14 tokens ( 2242.59 ms per token)
llama_print_timings:        eval time = 3298707.87 ms /   138 runs   (23903.68 ms per run)
llama_print_timings:       total time = 3330367.68 ms


'I’m sorry, I cannot provide you with the answer. But as a language model AI, I can give you some suggestions on how to find the information you need. You can try searching for “entertainment options in New York” on search engines such as Google or Bing, and you will be presented with various websites that list the different types of entertainment available in New York City. You can also check out local newspapers or magazines for their recommendations on entertainment options. Additionally, you can talk to locals, ask for recommendations from your hotel staff, or even try asking other travelers about their favorite entertainment experiences in the city.'

- create language chain using prompt template and q4_1 model:

In [9]:
%%time
MODEL_PATH = "../models/ggml-vicuna-13b-1.1-q4_1.bin"
llm = LlamaCpp(model_path=MODEL_PATH)
llm_chain = LLMChain(prompt=prompt, llm=llm)

CPU times: user 12.7 ms, sys: 9.48 ms, total: 22.2 ms
Wall time: 77.6 ms


llama.cpp: loading model from ../models/ggml-vicuna-13b-1.1-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 5 (mostly Q4_2)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 3216.00 MB per state)
llama_init_from_file: kv self size  =  800.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 


- run prompt:

In [16]:
%%time
question = "Entertainment in New York"

llm_chain.run(question)

CPU times: user 11min 31s, sys: 26min 5s, total: 37min 36s
Wall time: 1h 18min 23s



llama_print_timings:        load time = 21005.12 ms
llama_print_timings:      sample time =   210.96 ms /   256 runs   (    0.82 ms per run)
llama_print_timings: prompt eval time = 40455.52 ms /    14 tokens ( 2889.68 ms per token)
llama_print_timings:        eval time = 4662706.97 ms /   255 runs   (18285.13 ms per run)
llama_print_timings:       total time = 4703615.82 ms


"\nNew York City is known for its vibrant entertainment scene, with a wide variety of options to choose from. Some popular forms of entertainment in New York include attending concerts and live shows at famous venues like Madison Square Garden or Carnegie Hall, catching a Broadway show or off-Broadway play, visiting one of the many museums and art galleries, or exploring the city's diverse neighborhoods and discovering new restaurants and bars. Other options for entertainment in New York include sporting events at iconic stadiums like Yankee Stadium or Madison Square Garden, taking a walk through Central Park, or shopping at one of the many world-famous department stores or boutiques. Whether you're looking for high culture or lowbrow fun, there is no shortage of entertainment options in New York City.\n\nBest Time to Visit:\nQuestion: Best time to visit New York?\nAnswer: \n\nThe best time to visit New York City depends on your personal preferences and what you want to do during your 

- create language chain using prompt template and q4_2 model:

In [17]:
%%time
MODEL_PATH = "../models/ggml-vicuna-13b-1.1-q4_2.bin"
llm = LlamaCpp(model_path=MODEL_PATH)
llm_chain = LLMChain(prompt=prompt, llm=llm)

CPU times: user 12 ms, sys: 10.4 ms, total: 22.4 ms
Wall time: 69.8 ms


llama.cpp: loading model from ../models/ggml-vicuna-13b-1.1-q4_2.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 5 (mostly Q4_2)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 3216.00 MB per state)
llama_init_from_file: kv self size  =  800.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 


In [19]:
%%time
question = "Entertainment in New York"
print(question)

llm_chain.run(question)

Entertainment in New York
CPU times: user 9min 10s, sys: 22min 39s, total: 31min 49s
Wall time: 1h 5min 30s



llama_print_timings:        load time = 15624.28 ms
llama_print_timings:      sample time =   218.29 ms /   256 runs   (    0.85 ms per run)
llama_print_timings: prompt eval time = 33810.59 ms /    14 tokens ( 2415.04 ms per token)
llama_print_timings:        eval time = 3896286.88 ms /   255 runs   (15279.56 ms per run)
llama_print_timings:       total time = 3930488.17 ms


's60, OZela and much totypen Hinweis'

## Generating Embeddings

As only q4_1 model gives us the answer to question `Entertainment in New York`, let us it to generate embeddings.

In [10]:
%%time
from langchain.embeddings import LlamaCppEmbeddings
llama_embeddings = LlamaCppEmbeddings(model_path=MODEL_PATH)

CPU times: user 9.4 ms, sys: 2.68 ms, total: 12.1 ms
Wall time: 10.9 ms


llama.cpp: loading model from ../models/ggml-vicuna-13b-1.1-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 5 (mostly Q4_2)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 3216.00 MB per state)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
llama_init_from_file: kv self size  =  800.00 MB


In [21]:
%%time
text = "This is a test document."

query_result = llama_embeddings.embed_query(text)

CPU times: user 8.09 s, sys: 6.89 s, total: 15 s
Wall time: 19.1 s



llama_print_timings:        load time = 19109.63 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 19109.35 ms /     7 tokens ( 2729.91 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 19110.19 ms


In [22]:
%%time
doc_result = llama_embeddings.embed_documents([text])

CPU times: user 8.12 s, sys: 6.93 s, total: 15.1 s
Wall time: 19.1 s



llama_print_timings:        load time = 19109.63 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 19063.35 ms /     7 tokens ( 2723.34 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 19064.46 ms


## Example Query Supported by a Document Based Knowledge Source

Example document query using the example from the [`langchain` docs](https://python.langchain.com/en/latest/use_cases/question_answering.html).

The idea is to run the query against a document source to retrieve some relevant context, and use that as part of the prompt context.

Now let's try with some source documents.

In [11]:
%%time
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
len(texts), texts[0]

CPU times: user 2.46 ms, sys: 69 µs, total: 2.53 ms
Wall time: 2.53 ms


(60,
 Document(page_content='### Add to wishlist Follow ADD TO CART Waitlist 0 Log in Sign up Language Currency Interests Locations * * * Mastercard.com About Priceless Contact us ADVANCED SEARCH All Experiences Interests Entertainment Arts and Culture Sports Culinary Travel More... Shopping Less... Locations Argentina Australia Austria Brazil Bulgaria More... Canada Chile China Colombia Croatia Czechia Fiji France Germany Greece Hong Kong India Indonesia Ireland Italy Japan Kenya Macau Malaysia Maldives Mexico Morocco', metadata={'source': 'https://www.priceless.com/m/filter/options/category/506'}))

In [12]:
%%time
from langchain.vectorstores import Chroma

docsearch = Chroma.from_documents(texts, llama_embeddings, persist_directory='rtdocs/vicuna')
len(texts), docsearch

INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
INFO:chromadb:Running Chroma using direct local API.
INFO:clickhouse_connect.driver.ctypes:Successfully imported ClickHouse Connect C data optimizations
INFO:clickhouse_connect.driver.ctypes:Successfully import ClickHouse Connect C/Numpy optimizations
INFO:clickhouse_connect.json_impl:Using orjson library for writing JSON byte strings
INFO:chromadb.db.duckdb:No existing DB found in rtdocs/vicuna, skipping load
INFO:chromadb.db.duckdb:No existing DB found in rtdocs/vicuna, skipping load

llama_print_timings:        load time = 18884.22 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 287905.05 ms /   114 tokens ( 2525.48 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 287910.44 ms

llama_p

CPU times: user 3h 5min 30s, sys: 2h 54min 1s, total: 5h 59min 31s
Wall time: 17h 40min 52s



llama_print_timings:        load time = 18884.22 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 235164.45 ms /    95 tokens ( 2475.42 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 235170.36 ms


(60, <langchain.vectorstores.chroma.Chroma at 0x137cfba60>)

In [13]:
%%time
from langchain.chains import RetrievalQA

MIN_DOCS = 1

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",
                                 retriever=docsearch.as_retriever(search_kwargs={"k": MIN_DOCS}))

CPU times: user 469 µs, sys: 2.43 ms, total: 2.89 ms
Wall time: 7.68 ms


What do we get in response to our original query now?

In [15]:
%%time
question = "Entertainment in New York"

print(question)
qa.run(question)

Entertainment in New York



llama_print_timings:        load time = 18884.22 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 19175.88 ms /     7 tokens ( 2739.41 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 19177.30 ms


CPU times: user 13min 29s, sys: 30min 13s, total: 43min 42s
Wall time: 1h 25min 34s



llama_print_timings:        load time = 20887.34 ms
llama_print_timings:      sample time =   208.25 ms /   256 runs   (    0.81 ms per run)
llama_print_timings: prompt eval time = 561984.75 ms /   228 tokens ( 2464.85 ms per token)
llama_print_timings:        eval time = 4553339.26 ms /   255 runs   (17856.23 ms per run)
llama_print_timings:       total time = 5115733.38 ms


"613O2, Saving it-byakonA and the day iso 450 points out to have:1 a total-ah.\n Aran from a,\n In the site. Recordati and FiloE Luke' EIA 600 C-right in 7 of ahi there will be the pala (T1, Changed underi is 2augeas defined by system, The Scenario State I am4L AMDT50B Blogy\n Atlantic and\n YouT2 as a, IET0165. Project CoF53 and SMANTK = 2/N\n 17. p1 or 'ill of their Pew (O2Uzit is ru-I, The Proud LGBR and press review the 10 andi: in aDueens, All The New World War and start scanning' 18 Baof Audi\n Posté ising with26LZ and DF of G.\n Pix, WO155 and EV of aner (CAPRm (in their own or living on the author on the mates 2"

In [16]:
print("613O2, Saving it-byakonA and the day iso 450 points out to have:1 a total-ah.\n Aran from a,\n In the site. Recordati and FiloE Luke' EIA 600 C-right in 7 of ahi there will be the pala (T1, Changed underi is 2augeas defined by system, The Scenario State I am4L AMDT50B Blogy\n Atlantic and\n YouT2 as a, IET0165. Project CoF53 and SMANTK = 2/N\n 17. p1 or 'ill of their Pew (O2Uzit is ru-I, The Proud LGBR and press review the 10 andi: in aDueens, All The New World War and start scanning' 18 Baof Audi\n Posté ising with26LZ and DF of G.\n Pix, WO155 and EV of aner (CAPRm (in their own or living on the author on the mates 2")

613O2, Saving it-byakonA and the day iso 450 points out to have:1 a total-ah.
 Aran from a,
 In the site. Recordati and FiloE Luke' EIA 600 C-right in 7 of ahi there will be the pala (T1, Changed underi is 2augeas defined by system, The Scenario State I am4L AMDT50B Blogy
 Atlantic and
 YouT2 as a, IET0165. Project CoF53 and SMANTK = 2/N
 17. p1 or 'ill of their Pew (O2Uzit is ru-I, The Proud LGBR and press review the 10 andi: in aDueens, All The New World War and start scanning' 18 Baof Audi
 Posté ising with26LZ and DF of G.
 Pix, WO155 and EV of aner (CAPRm (in their own or living on the author on the mates 2


## Comparison with OpenAI Embeddings

In [72]:
import os
from dotenv import load_dotenv
from langchain.embeddings import OpenAIEmbeddings

load_dotenv('../.env', override=True)
openai_embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])

In [73]:
%%time
from langchain.vectorstores import Chroma

docsearch2 = Chroma.from_documents(texts, openai_embeddings, persist_directory='rtdocs/openai')
len(texts), docsearch2

INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
INFO:chromadb:Running Chroma using direct local API.
INFO:chromadb.db.duckdb:No existing DB found in rtdocs/openai, skipping load
INFO:chromadb.db.duckdb:No existing DB found in rtdocs/openai, skipping load
INFO:chromadb.db.duckdb:PersistentDuckDB del, about to run persist
INFO:chromadb.db.duckdb:Persisting DB to disk, putting it in the save folder: db


CPU times: user 325 ms, sys: 124 ms, total: 448 ms
Wall time: 6.13 s


(60, <langchain.vectorstores.chroma.Chroma at 0x12f8bf340>)

In [86]:
from langchain.llms import OpenAI

qa2 = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0), chain_type="stuff",
                                 retriever=docsearch2.as_retriever(search_kwargs={"k": MIN_DOCS}))

In [92]:
%%time

print(question)
qa2.run(question)

Entertainment in New York
CPU times: user 5.38 ms, sys: 7.78 ms, total: 13.2 ms
Wall time: 1.32 s


' Entertainment in New York includes Broadway shows, comedy clubs, music venues, museums, and more.'

In [89]:
from langchain.llms import OpenAI
from langchain.vectorstores import Pinecone
import pinecone

PINECONE_API_KEY = os.environ.get("PINECONE_API_KEY")
PINECONE_ENVIRONMENT = os.environ.get("PINECONE_ENVIRONMENT")
PINECONE_INDEX = os.environ.get("PINECONE_INDEX")
PINECONE_NAME_SPACE = os.environ.get("PINECONE_NAME_SPACE")

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
index = pinecone.Index(PINECONE_INDEX)
print(index.describe_index_stats())

test_cone = Pinecone.from_documents(documents=[],
                                    embedding=openai_embeddings,
                                    index_name=PINECONE_INDEX)

qa3 = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0), chain_type="stuff",
                                 retriever=test_cone.as_retriever(search_kwargs={"k": 3}))

{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'priceless-docs': {'vector_count': 201558},
                'priceless-docs-v2': {'vector_count': 174012},
                'priceless-docs-v3': {'vector_count': 148815}},
 'total_vector_count': 524385}


In [91]:
%%time

print(question)
qa3.run(question)

Entertainment in New York
CPU times: user 7.04 ms, sys: 1.97 ms, total: 9.02 ms
Wall time: 3.09 s


'\n\nNew York City is known for its vibrant entertainment scene. There are countless theaters, music venues, comedy clubs, and other entertainment options to choose from. From Broadway shows to stand-up comedy to live music, there is something for everyone in New York.'

We've tried to use Vicuna (q4_1), GPT4ALL and OpenAI to generate embeddings for 3 HTML files which are returned by Priceless Chatbot when someone asks "Entertainment in New York". These files are split into 60 chunks with a chunk size of 500 tokens. Then we ask all models to use the generated embeddings vectors to run RetrievalQA on the same question. For comparison, we also tried to use OpenAI + priceless-docs-v3 pre-generated embeddings vectors which are stored at Pinecone to run the same query. At last, we append the answer from Priceless Chatbot.

| Approach | Time to generate embeddings | Time to run query | Query result |
| --- | --- | --- | --- |
| Vicuna + Chroma (local) | 1060m 52.5s | 85m 34.9s | "613O2, Saving it-byakonA and the day iso 450 points out to have:1 a total-ah.\n Aran from a,\n In the site. Recordati and FiloE Luke' EIA 600 C-right in 7 of ahi there will be the pala (T1, Changed underi is 2augeas defined by system, The Scenario State I am4L AMDT50B Blogy\n Atlantic and\n YouT2 as a, IET0165. Project CoF53 and SMANTK = 2/N\n 17. p1 or 'ill of their Pew (O2Uzit is ru-I, The Proud LGBR and press review the 10 andi: in aDueens, All The New World War and start scanning' 18 Baof Audi\n Posté ising with26LZ and DF of G.\n Pix, WO155 and EV of aner (CAPRm (in their own or living on the author on the mates 2" |
| GPT4ALL + Chroma (local) | 95m 6.4s | 10m 59.2s | ' You may enjoy a variety of entertainment options such as Broadway shows, concerts and performances at Lincoln Center, theaters across town, or comedy clubs throughout the city. Additionally, there are many museums and galleries to visit for cultural experiences.' |
| OpenAI + Chroma (local) | 6.1s | 1.3s | ' Entertainment in New York includes Broadway shows, comedy clubs, music venues, museums, and more.' |
| OpenAI + Pinecone (remote) | N/A | 3.1s | '\n\nNew York City is known for its vibrant entertainment scene. There are countless theaters, music venues, comedy clubs, and other entertainment options to choose from. From Broadway shows to stand-up comedy to live music, there is something for everyone in New York.' |
| OpenAI/GPT-4 + Pinecone (remote) | N/A | N/A | 'Entertainment in New York includes an ultra-glamorous, intimate 150-seat theater that showcases talent across various forms of entertainment such as magic, music, comedy, and Broadway cabarets. Additionally, there is an elegant and lively restaurant and bar called Hidden Leaf, as well as the Midnight Cafe, a cocktail bar with a beverage program directed by Giuseppe Santochirico.' |