# GPT4All Langchain Demo

Example of locally running [`GPT4All`](https://github.com/nomic-ai/gpt4all), a 4GB, *llama.cpp* based large language model (LLM) under [`langchain`](https://github.com/hwchase17/langchain), in a Jupyter notebook running a Python 3.10 kernel.

*Tested on a mid-2015 16GB Macbook Pro, concurrently running Docker (a single container running a sepearate Jupyter server) and Chrome with approx. 40 open tabs).*

In [3]:
!pip install -qU chromadb langchain==0.0.142 tiktoken tqdm load_dotenv ipywidgets pinecone-client pyllama llama-index llama-cpp-python html2text pyllamacpp
!pip list | grep langchain
!pip list | grep torch
!pip list | grep llama

langchain                     0.0.142
torch                         2.0.0
torchvision                   0.15.1
llama-cpp-python              0.1.43
llama-index                   0.5.27
pyllama                       0.0.9
pyllamacpp                    2.1.2


## Data preparation

In [9]:
def file_metadata(filename):
    d = dict()
    d["source"] = filename.replace('../data/docs/', 'https://').replace('index.html', '').replace('.html', '')
    return d

In [4]:
from pathlib import Path
from typing import Dict

from llama_index.readers.file.base_parser import BaseParser

class HtmlParser(BaseParser):
    """Html parser."""

    def _init_parser(self) -> Dict:
        """Init parser."""
        return {}

    def parse_file(self, file: Path, errors: str = "ignore") -> str:
        """Parse file."""
        try:
            import html2text
        except ImportError:
            raise ImportError(
                "html2text is required to read html files: `pip install html2text`"
            )
        with open(file, "r") as fp:
            text = fp.read()
            text_maker = html2text.HTML2Text()
            text_maker.ignore_links = True
            text_maker.ignore_images = True
            text_maker.bypass_tables = False
            text = text_maker.handle(text)
            # Remove extra white space
            text = ' '.join(text.split())

        return text

In [8]:
file_extractor: Dict[str, BaseParser] = {
    ".htm": HtmlParser(),
    ".html": HtmlParser(),
}

In [10]:
%%time
from llama_index import SimpleDirectoryReader
docs = SimpleDirectoryReader(input_dir='../data/docs/', recursive=True, file_extractor=file_extractor, file_metadata=file_metadata).load_langchain_documents()
len(docs)

CPU times: user 10.4 s, sys: 250 ms, total: 10.7 s
Wall time: 10.9 s


171

In [21]:
%%time
from tqdm.auto import tqdm

urls = ['https://www.priceless.com/m/filter/options/category/506', 
        'https://www.priceless.com/m/filter/options/category/510',
        'https://www.priceless.com/m/filter/options/location/9716/trk/20211/']
documents = []

for doc in tqdm(docs):
    src = doc.metadata['source']
    url = src.replace('rtdocs/', 'https://').replace('index.html', '').replace('.html', '')
    if not url in urls:
        continue

    documents.append(doc)

len(documents), documents[0]

  0%|          | 0/171 [00:00<?, ?it/s]

CPU times: user 5.62 ms, sys: 2.38 ms, total: 8 ms
Wall time: 12 ms


(3,
 Document(page_content="### Add to wishlist Follow ADD TO CART Waitlist 0 Log in Sign up Language Currency Interests Locations * * * Mastercard.com About Priceless Contact us ADVANCED SEARCH All Experiences Interests Entertainment Arts and Culture Sports Culinary Travel More... Shopping Less... Locations Argentina Australia Austria Brazil Bulgaria More... Canada Chile China Colombia Croatia Czechia Fiji France Germany Greece Hong Kong India Indonesia Ireland Italy Japan Kenya Macau Malaysia Maldives Mexico Morocco Netherlands New Zealand Nigeria Puerto Rico Romania Serbia Singapore South Africa Spain Sweden Thailand Turkey United Arab Emirates United Kingdom United States Uruguay Clear Selection ADVANCED SEARCH ADVANCED SEARCH All Experiences Interests Entertainment Arts and Culture Sports Culinary Travel More... Shopping Less... Locations Argentina Australia Austria Brazil Bulgaria More... Canada Chile China Colombia Croatia Czechia Fiji France Germany Greece Hong Kong India Indon

## Model preparation

- download `gpt4all` model:

In [None]:
#https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized.bin

- download `llama.cpp` 7B model

In [None]:
!python3.10 -m llama.download --model_size 7B --folder ../../models/llama/

- transform `gpt4all` model:

In [None]:
!pip install pyllamacpp
!pyllamacpp-convert-gpt4all ../../models/gpt4all-lora-quantized.bin ../../models/llama/tokenizer.model ../../models/gpt4all-lora-q-converted.bin

## `langchain` Demo

Example of running a prompt using `langchain`.

In [17]:
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain

- set up prompt template:

In [18]:
template = """
Question: {question}
Answer: 
"""

prompt = PromptTemplate(template=template, input_variables=["question"])

- load model:

In [19]:
GPT4ALL_MODEL_PATH = "../../models/gpt4all-lora-q-converted.bin"

In [20]:
%%time
llm = LlamaCpp(model_path=GPT4ALL_MODEL_PATH)

CPU times: user 12.5 ms, sys: 7.89 ms, total: 20.4 ms
Wall time: 60.8 ms


llama.cpp: loading model from ../../models/gpt4all-lora-q-converted.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 2052.00 MB per state)
llama_init_from_file: kv self size  =  512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 


- create language chain using prompt template and loaded model:

In [5]:
llm_chain = LLMChain(prompt=prompt, llm=llm)

- run prompt:

In [6]:
%%time
question = "Entertainment in New York"

llm_chain.run(question)

CPU times: user 1min 34s, sys: 54 s, total: 2min 28s
Wall time: 2min 53s



llama_print_timings:        load time =  7761.79 ms
llama_print_timings:      sample time =   187.98 ms /   247 runs   (    0.76 ms per run)
llama_print_timings: prompt eval time = 15628.07 ms /    14 tokens ( 1116.29 ms per token)
llama_print_timings:        eval time = 157936.61 ms /   246 runs   (  642.02 ms per run)
llama_print_timings:       total time = 173805.38 ms


" There are many ways to enjoy entertainment in New York. Here are some suggestions for things to do:\n1. Visit a Broadway show: With over 40 venues to choose from, there is always a great performance happening on Broadway.\n2. Attend a concert or festival: New York has a huge music scene with many concerts and festivals throughout the year.\n3. Explore museums and galleries: From world-class art museums like The Met, MoMA and Guggenheim to historical institutions like the National September 11 Memorial & Museum, there are endless opportunities to learn and be inspired in New York City.\n4. Experience live sports: If you are a sports fanatic, New York is home to many professional teams such as the Knicks, Rangers, Mets, Yankees, and Jets who offer tickets for public sale. \n5. Have dinner at one of NYC's best restaurants: Whether you're in the mood for Italian, Japanese, or Korean cuisine, New York is home to some of the world’s most iconic and acclaimed restaurants."

In [45]:
%%time
question = "纽约的娱乐"

llm_chain.run(question)

CPU times: user 5min 9s, sys: 9min 32s, total: 14min 42s
Wall time: 55min 18s



llama_print_timings:        load time =  7867.97 ms
llama_print_timings:      sample time =   220.69 ms /   256 runs   (    0.86 ms per run)
llama_print_timings: prompt eval time = 23563.58 ms /    22 tokens ( 1071.07 ms per token)
llama_print_timings:        eval time = 3294263.72 ms /   255 runs   (12918.68 ms per run)
llama_print_timings:       total time = 3318321.43 ms


"1. Broadway Shows and Theatre Districts: New York City is well-known for its theatres, especially on Broadway, which has been home to some of the most famous musicals in history such as The Phantom of the Opera, Les Misérables, Wicked, Hamilton, and The Lion King. Beyond Broadway, there are also other theatre districts that showcase various productions from all over the world.\n2. Live Music Scene: New York City is a hub for live music in the United States. From legendary jazz clubs like The Village Vanguard to iconic rock venues such as Madison Square Garden and Webster Hall, there are endless opportunities to experience great live music all over NYC.\n3. Festivals and Events: There are tons of festivals and events happening in New York City throughout the year, ranging from the world-famous Macy's Thanksgiving Day Parade to NYC Pride, The Tribeca Film Festival, and The New York Comic Con.\n4. Sports Teams: New York is home to some of the most famous sports teams in the United States

## Generating Embeddings

We can also use the model to generate embddings.

In [22]:
%%time
from langchain.embeddings import LlamaCppEmbeddings
llama_embeddings = LlamaCppEmbeddings(model_path=GPT4ALL_MODEL_PATH)

CPU times: user 9.98 ms, sys: 6.52 ms, total: 16.5 ms
Wall time: 49.3 ms


llama.cpp: loading model from ../../models/gpt4all-lora-q-converted.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 2052.00 MB per state)
llama_init_from_file: kv self size  =  512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 


In [21]:
%%time
text = "This is a test document."

query_result = llama_embeddings.embed_query(text)

CPU times: user 2.53 s, sys: 2.35 s, total: 4.88 s
Wall time: 6.59 s



llama_print_timings:        load time =  6586.26 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  6586.00 ms /     7 tokens (  940.86 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  6587.02 ms


In [22]:
%%time
doc_result = llama_embeddings.embed_documents([text])

CPU times: user 2.54 s, sys: 2.36 s, total: 4.91 s
Wall time: 7.08 s



llama_print_timings:        load time =  6586.26 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  7074.05 ms /     7 tokens ( 1010.58 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  7074.87 ms


## Example Query Supported by a Document Based Knowledge Source

Example document query using the example from the [`langchain` docs](https://python.langchain.com/en/latest/use_cases/question_answering.html).

The idea is to run the query against a document source to retrieve some relevant context, and use that as part of the prompt context.

Now let's try with some source documents.

In [68]:
%%time
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
len(texts), texts[0]

CPU times: user 2.76 ms, sys: 2.19 ms, total: 4.95 ms
Wall time: 9.14 ms


(60,
 Document(page_content='### Add to wishlist Follow ADD TO CART Waitlist 0 Log in Sign up Language Currency Interests Locations * * * Mastercard.com About Priceless Contact us ADVANCED SEARCH All Experiences Interests Entertainment Arts and Culture Sports Culinary Travel More... Shopping Less... Locations Argentina Australia Austria Brazil Bulgaria More... Canada Chile China Colombia Croatia Czechia Fiji France Germany Greece Hong Kong India Indonesia Ireland Italy Japan Kenya Macau Malaysia Maldives Mexico Morocco', metadata={'source': 'https://www.priceless.com/m/filter/options/category/506'}))

In [69]:
%%time
from langchain.vectorstores import Chroma

docsearch = Chroma.from_documents(texts, llama_embeddings, persist_directory='../data/docs/gpt4all')
len(texts), docsearch

INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
INFO:chromadb:Running Chroma using direct local API.
INFO:chromadb.db.duckdb:No existing DB found in rtdocs/gpt4all, skipping load
INFO:chromadb.db.duckdb:No existing DB found in rtdocs/gpt4all, skipping load

llama_print_timings:        load time =  6586.26 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 130048.66 ms /   114 tokens ( 1140.78 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 130062.01 ms

llama_print_timings:        load time =  6586.26 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 110476.68 ms /   103 tokens ( 1072.59 ms per token)
llama_print_timings:        eval time =     0.00 ms /   

CPU times: user 47min 24s, sys: 29min 12s, total: 1h 16min 36s
Wall time: 1h 35min 6s


(60, <langchain.vectorstores.chroma.Chroma at 0x12f8befb0>)

In [70]:
%%time
from langchain.chains import RetrievalQA

MIN_DOCS = 1

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",
                                 retriever=docsearch.as_retriever(search_kwargs={"k": MIN_DOCS}))

CPU times: user 251 µs, sys: 1.79 ms, total: 2.04 ms
Wall time: 7.79 ms


What do we get in response to our original query now?

In [71]:
%%time

print(question)
qa.run(question)

Entertainment in New York



llama_print_timings:        load time =  6586.26 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  6873.69 ms /     7 tokens (  981.96 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  6874.77 ms


CPU times: user 2min 45s, sys: 3min 39s, total: 6min 25s
Wall time: 10min 59s



llama_print_timings:        load time =  7911.82 ms
llama_print_timings:      sample time =    44.57 ms /    51 runs   (    0.87 ms per run)
llama_print_timings: prompt eval time = 251544.37 ms /   228 tokens ( 1103.26 ms per token)
llama_print_timings:        eval time = 400569.15 ms /    50 runs   ( 8011.38 ms per run)
llama_print_timings:       total time = 652214.27 ms


' You may enjoy a variety of entertainment options such as Broadway shows, concerts and performances at Lincoln Center, theaters across town, or comedy clubs throughout the city. Additionally, there are many museums and galleries to visit for cultural experiences.'

## Comparison with OpenAI Embeddings

In [72]:
import os
from dotenv import load_dotenv
from langchain.embeddings import OpenAIEmbeddings

load_dotenv('../.env', override=True)
openai_embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])

In [73]:
%%time
from langchain.vectorstores import Chroma

docsearch2 = Chroma.from_documents(texts, openai_embeddings, persist_directory='rtdocs/openai')
len(texts), docsearch2

INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
INFO:chromadb:Running Chroma using direct local API.
INFO:chromadb.db.duckdb:No existing DB found in rtdocs/openai, skipping load
INFO:chromadb.db.duckdb:No existing DB found in rtdocs/openai, skipping load
INFO:chromadb.db.duckdb:PersistentDuckDB del, about to run persist
INFO:chromadb.db.duckdb:Persisting DB to disk, putting it in the save folder: db


CPU times: user 325 ms, sys: 124 ms, total: 448 ms
Wall time: 6.13 s


(60, <langchain.vectorstores.chroma.Chroma at 0x12f8bf340>)

In [86]:
from langchain.llms import OpenAI

qa2 = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0), chain_type="stuff",
                                 retriever=docsearch2.as_retriever(search_kwargs={"k": MIN_DOCS}))

In [92]:
%%time

print(question)
qa2.run(question)

Entertainment in New York
CPU times: user 5.38 ms, sys: 7.78 ms, total: 13.2 ms
Wall time: 1.32 s


' Entertainment in New York includes Broadway shows, comedy clubs, music venues, museums, and more.'

In [89]:
from langchain.llms import OpenAI
from langchain.vectorstores import Pinecone
import pinecone

PINECONE_API_KEY = os.environ.get("PINECONE_API_KEY")
PINECONE_ENVIRONMENT = os.environ.get("PINECONE_ENVIRONMENT")
PINECONE_INDEX = os.environ.get("PINECONE_INDEX")
PINECONE_NAME_SPACE = os.environ.get("PINECONE_NAME_SPACE")

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
index = pinecone.Index(PINECONE_INDEX)
print(index.describe_index_stats())

test_cone = Pinecone.from_documents(documents=[],
                                    embedding=openai_embeddings,
                                    index_name=PINECONE_INDEX)

qa3 = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0), chain_type="stuff",
                                 retriever=test_cone.as_retriever(search_kwargs={"k": 3}))

{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'priceless-docs': {'vector_count': 201558},
                'priceless-docs-v2': {'vector_count': 174012},
                'priceless-docs-v3': {'vector_count': 148815}},
 'total_vector_count': 524385}


In [91]:
%%time

print(question)
qa3.run(question)

Entertainment in New York
CPU times: user 7.04 ms, sys: 1.97 ms, total: 9.02 ms
Wall time: 3.09 s


'\n\nNew York City is known for its vibrant entertainment scene. There are countless theaters, music venues, comedy clubs, and other entertainment options to choose from. From Broadway shows to stand-up comedy to live music, there is something for everyone in New York.'

We've tried to use GPT4ALL and OpenAI to generate embeddings for 3 HTML files which are returned by Priceless Chatbot when someone asks "Entertainment in New York". These files are split into 60 chunks with a chunk size of 500 tokens. Then we ask both models to use the generated embeddings vectors to run RetrievalQA on the same question. For comparison, we also tried to use OpenAI + priceless-docs-v3 pre-generated embeddings vectors which are stored at Pinecone to run the same query. At last, we append the answer from Priceless Chatbot.

| Approach | Time to generate embeddings | Time to run query | Query result |
| --- | --- | --- | --- |
| GPT4ALL + Chroma (local) | 95m 6.4s | 10m 59.2s | ' You may enjoy a variety of entertainment options such as Broadway shows, concerts and performances at Lincoln Center, theaters across town, or comedy clubs throughout the city. Additionally, there are many museums and galleries to visit for cultural experiences.' |
| OpenAI + Chroma (local) | 6.1s | 1.3s | ' Entertainment in New York includes Broadway shows, comedy clubs, music venues, museums, and more.' |
| OpenAI + Pinecone (remote) | N/A | 3.1s | '\n\nNew York City is known for its vibrant entertainment scene. There are countless theaters, music venues, comedy clubs, and other entertainment options to choose from. From Broadway shows to stand-up comedy to live music, there is something for everyone in New York.' |
| OpenAI/GPT-4 + Pinecone (remote) | N/A | N/A | 'Entertainment in New York includes an ultra-glamorous, intimate 150-seat theater that showcases talent across various forms of entertainment such as magic, music, comedy, and Broadway cabarets. Additionally, there is an elegant and lively restaurant and bar called Hidden Leaf, as well as the Midnight Cafe, a cocktail bar with a beverage program directed by Giuseppe Santochirico.' |