# Compare OpenAI and local LLM (GPT4ALL-J + HF Instructor Embeddings)

## OpenAI Embeddings vs. Local Embeddings
Performance Comparison
- OpenAI's Embedding Model
- InstructorEmbedding (https://huggingface.co/hkunlp/instructor-xl)

In [8]:
!pip install -qU langchain openai tiktoken chromadb unstructured sentence_transformers InstructorEmbedding
!pip show langchain

Name: langchain
Version: 0.0.168
Summary: Building applications with LLMs through composability
Home-page: https://www.github.com/hwchase17/langchain
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.11/site-packages
Requires: aiohttp, dataclasses-json, numexpr, numpy, openapi-schema-pydantic, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: llama-index


## Setting up LangChain 

In [1]:
from dotenv import load_dotenv
load_dotenv('../.env', override=True)

True

In [2]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import UnstructuredHTMLLoader
from langchain.document_loaders import DirectoryLoader
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

  from tqdm.autonotebook import trange


## Load multiple and process documents

In [36]:
loader = DirectoryLoader('../data/docs/', glob="**/*.html", loader_cls=UnstructuredHTMLLoader)
all_docs = loader.load()
len(all_docs), all_docs[0].metadata

(171, {'source': '../data/docs/www.priceless.com/m/tour-the-city.html'})

In [16]:
%%time
from tqdm.auto import tqdm

urls = ['https://www.priceless.com/m/filter/options/category/506', 
        'https://www.priceless.com/m/filter/options/category/510',
        'https://www.priceless.com/m/filter/options/location/9716/trk/20211/']
documents = []

for doc in tqdm(all_docs):
    src = doc.metadata['source']
    url = src.replace('../data/docs/', 'https://').replace('index.html', '').replace('.html', '')
    if not url in urls:
        continue
    doc.metadata = dict()
    doc.metadata['url'] = url
    documents.append(doc)

len(documents), documents[0].metadata

100%|██████████| 171/171 [00:00<00:00, 345484.58it/s]

CPU times: user 2.55 ms, sys: 3.02 ms, total: 5.57 ms
Wall time: 4.61 ms





(3, {'url': 'https://www.priceless.com/m/filter/options/category/506'})

In [17]:
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32)
texts = text_splitter.split_documents(documents)
len(texts)

264

In [18]:
texts[100], texts[100].metadata

(Document(page_content='Chile (0) \n        \n                                                            \n                                                                    \n                                     All locations \n                                            \n                                                                        \n                                     Santiago (0)', metadata={'url': 'https://www.priceless.com/m/filter/options/category/510'}),
 {'url': 'https://www.priceless.com/m/filter/options/category/510'})

## HF Instructor Embeddings

In [8]:
from langchain.embeddings import HuggingFaceInstructEmbeddings

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl", 
                                                      model_kwargs={"device": "cpu"})

load INSTRUCTOR_Transformer
max_seq_length  512


In [13]:
## utility functions

import textwrap
import os

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['url'] + " page: " + str(source.metadata['page']))
        
def mkdir(folder):
    if not os.path.isdir(folder):
        print('The directory is not present. Creating a new one..')
        os.mkdir(folder)
    else:
        print('The directory is present.')

In [19]:
%%time

persist_directory =  '../data/chromadb_hf'
mkdir(persist_directory)

chromadb_instructor_embeddings = Chroma.from_documents(documents=texts, 
                                 embedding=instructor_embeddings,
                                 persist_directory=persist_directory)

chromadb_instructor_embeddings.persist()

Using embedded DuckDB with persistence: data will be stored in: ../data/chromadb


The directory is present.
CPU times: user 13min 47s, sys: 27.2 s, total: 14min 14s
Wall time: 11min 46s


In [20]:
chromadb_retriever = chromadb_instructor_embeddings.as_retriever(search_kwargs={"k": 4})

In [31]:
%%time

docs = chromadb_retriever.get_relevant_documents("Entertainment in New York")
docs

CPU times: user 780 ms, sys: 51.1 ms, total: 831 ms
Wall time: 826 ms


[Document(page_content='Los Angeles (0) \n                                            \n                                                                        \n                                     Miami (0) \n                                            \n                                                                        \n                                     New York (0)', metadata={'url': 'https://www.priceless.com/m/filter/options/category/506'}),
 Document(page_content='Los Angeles (0) \n                                            \n                                                                        \n                                     Miami (0) \n                                            \n                                                                        \n                                     New York (0)', metadata={'url': 'https://www.priceless.com/m/filter/options/category/506'}),
 Document(page_content='Los Angeles (0) \n                                    

## OpenAI's Embeddings

In [26]:
from langchain.embeddings import OpenAIEmbeddings
openai_embeddings = OpenAIEmbeddings()

In [27]:
persist_directory =  '../data/chromadb_openai'
mkdir(persist_directory)

chromadb_openai_embeddings = Chroma.from_documents(documents=texts, 
                                 embedding=openai_embeddings,
                                 persist_directory=persist_directory)

Using embedded DuckDB with persistence: data will be stored in: ../data/chromadb_openai


The directory is not present. Creating a new one..


In [28]:
chromadb_retriever_openai = chromadb_openai_embeddings.as_retriever(search_kwargs={"k": 4})

In [32]:
%%time

docs = chromadb_retriever_openai.get_relevant_documents("Entertainment in New York")
docs

CPU times: user 5.09 ms, sys: 1.4 ms, total: 6.49 ms
Wall time: 356 ms


[Document(page_content='Los Angeles (0) \n                                            \n                                                                        \n                                     Miami (0) \n                                            \n                                                                        \n                                     New York (0)', metadata={'url': 'https://www.priceless.com/m/filter/options/category/506'}),
 Document(page_content='Los Angeles (0) \n                                            \n                                                                        \n                                     Miami (0) \n                                            \n                                                                        \n                                     New York (0)', metadata={'url': 'https://www.priceless.com/m/filter/options/category/506'}),
 Document(page_content='Los Angeles (0) \n                                    

## Process all docs with GPT4ALL-J & Chroma

In [37]:
%%time
from tqdm.auto import tqdm

documents = []

for doc in tqdm(all_docs):
    if 'source' in doc.metadata:
        src = doc.metadata['source']
        url = src.replace('../data/docs/', 'https://').replace('index.html', '').replace('.html', '')
        doc.metadata = dict()
        doc.metadata['url'] = url
        
    documents.append(doc)

len(documents), documents[0].metadata

100%|██████████| 171/171 [00:00<00:00, 275750.09it/s]

CPU times: user 2.18 ms, sys: 1.15 ms, total: 3.33 ms
Wall time: 2.53 ms





(171, {'url': 'https://www.priceless.com/m/tour-the-city'})

In [38]:
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32)
texts = text_splitter.split_documents(documents)
len(texts)

5488

In [39]:
%%time

persist_directory =  '../data/chromadb'
mkdir(persist_directory)

chromadb_instructor_embeddings = Chroma.from_documents(documents=texts, 
                                 embedding=instructor_embeddings,
                                 persist_directory=persist_directory)

chromadb_instructor_embeddings.persist()

Using embedded DuckDB with persistence: data will be stored in: ../data/chromadb


The directory is not present. Creating a new one..
CPU times: user 5h 13min 24s, sys: 8min 40s, total: 5h 22min 5s
Wall time: 4h 30min 27s


In [41]:
chromadb_retriever = chromadb_instructor_embeddings.as_retriever(search_kwargs={"k": 4})

In [42]:
%%time

docs = chromadb_retriever.get_relevant_documents("Entertainment in New York")
docs

CPU times: user 842 ms, sys: 57 ms, total: 899 ms
Wall time: 893 ms


[Document(page_content="Midnight Theatre\n\nFollow\n\nReceive star treatment at New York City's Midnight Theatre\n\nSavor signature cocktails and dim sum snacks\n\nMore...\n\nLess\n\nNew York, New York\n\n2.5 hours\n\nThe Midnight Theatre Map it\n\n3/31 @ 9:30 p.m. Bingo Night at MT  - Sales closed!\n\n4/07 @ 10 p.m. - Rachel Feinstein  - Sales closed!\n\n4/10 @ 10 p.m. - Tribulations, Roy Wood  - Sales closed!\n\n4/28 @ 10 p.m. - Midnight Follies\n\n6/15 @ 10:30 p.m. - Laura Marano\n\n5/19 @ 7 p.m. - American Immigrants\n\n5/20 @ 7 p.m. - American Immigrants", metadata={'url': 'https://www.priceless.com/m/entertainment/product/181398/the-midnight-theatre-priceless-night/511/78'}),
 Document(page_content='Distances are displayed to the nearest 0.1 mile and kilometer. \n Times Square - 0.1 km / 0.1 mi \n Broadway - 0.1 km / 0.1 mi \n Minskoff Theater - 0.1 km / 0.1 mi \n Lunt-Fontanne Theatre - 0.1 km / 0.1 mi \n Richard Rodgers Theatre - 0.2 km / 0.1 mi \n Shubert Theatre - 0.2 km / 0.

## Compare GPT4ALL-J and OpenAI

In [1]:
!pip install -qU pygpt4all

In [43]:
from langchain import PromptTemplate, LLMChain
from langchain.llms import GPT4All
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

In [47]:
local_path = '../../models/ggml-gpt4all-j.bin'  # replace with your desired local file path

In [48]:
# Callbacks support token-wise streaming
callbacks = [StreamingStdOutCallbackHandler()]
# Verbose is required to pass to the callback manager
# llm = GPT4All(model=local_path, callbacks=callbacks, verbose=True)
# If you want to use GPT4ALL_J model add the backend parameter
llm = GPT4All(model=local_path, backend='gptj', callbacks=callbacks, verbose=True)

gptj_model_load: loading model from '../../models/ggml-gpt4all-j.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285


In [51]:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

qa = ConversationalRetrievalChain.from_llm(
    llm, chromadb_instructor_embeddings.as_retriever(), max_tokens_limit=400, 
    memory=memory, return_source_documents=True)

In [52]:
%%time

question = "What's Mastercard Priceless?"
llm_response = qa({"question": question})
llm_response

: 

: 

In [28]:
process_llm_response(llm_response)

 The main differences between PCI v3 and v4 are:

1. Enhanced Security Requirements: PCI v4 has stricter security requirements compared to PCI v3. This includes
mandatory access control (MAC) access control, which is required to protect cardholder data from being
accessed by unauthorized parties.

2. More Secure Storage: PCI v4 requires storage to be secured and protected, which includes encrypting data
stored on storage devices, as well as storing cardholder data in a secure manner.

3. New Requirement for PCI v4: The PCI v4 standard now requires that any device that processes cardholder data
must have a security control that ensures the protection of cardholder data at rest and in transit.

4. Enhanced Compliance Requirements: PCI v4 requires compliance with more comprehensive compliance standards,
including ISO/IEC standards and the Payment Card Industry Data Security Standard (PCI DSS).

5. Enhanced Security Features: PCI v4 has a more extensive set of security features compared to