# Compare OpenAI and local LLM (GPT4ALL-J + HF Instructor Embeddings)

## OpenAI Embeddings vs. Local Embeddings
Performance Comparison
- OpenAI's Embedding Model
- InstructorEmbedding (https://huggingface.co/hkunlp/instructor-xl)

In [1]:
!pip install -qU langchain openai tiktoken chromadb unstructured sentence_transformers InstructorEmbedding
!pip show langchain

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index 0.5.27 requires langchain==0.0.142, but you have langchain 0.0.169 which is incompatible.[0m[31m
[0mName: langchain
Version: 0.0.169
Summary: Building applications with LLMs through composability
Home-page: https://www.github.com/hwchase17/langchain
Author: 
Author-email: 
License: MIT
Location: /Users/inflaton/miniconda3/lib/python3.10/site-packages
Requires: aiohttp, async-timeout, dataclasses-json, numexpr, numpy, openapi-schema-pydantic, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: llama-index


## Setting up LangChain 

In [2]:
from dotenv import load_dotenv
load_dotenv('../.env', override=True)

True

In [3]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import UnstructuredHTMLLoader
from langchain.document_loaders import DirectoryLoader
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

  from tqdm.autonotebook import trange


## Load multiple and process documents

In [4]:
loader = DirectoryLoader('../data/docs/', glob="**/*.html", loader_cls=UnstructuredHTMLLoader)
all_docs = loader.load()
len(all_docs), all_docs[0].metadata

(171, {'source': '../data/docs/www.priceless.com/m/tour-the-city.html'})

In [16]:
%%time
from tqdm.auto import tqdm

urls = ['https://www.priceless.com/m/filter/options/category/506', 
        'https://www.priceless.com/m/filter/options/category/510',
        'https://www.priceless.com/m/filter/options/location/9716/trk/20211/']
documents = []

for doc in tqdm(all_docs):
    src = doc.metadata['source']
    url = src.replace('../data/docs/', 'https://').replace('index.html', '').replace('.html', '')
    if not url in urls:
        continue
    doc.metadata = dict()
    doc.metadata['url'] = url
    documents.append(doc)

len(documents), documents[0].metadata

100%|██████████| 171/171 [00:00<00:00, 345484.58it/s]

CPU times: user 2.55 ms, sys: 3.02 ms, total: 5.57 ms
Wall time: 4.61 ms





(3, {'url': 'https://www.priceless.com/m/filter/options/category/506'})

In [17]:
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32)
texts = text_splitter.split_documents(documents)
len(texts)

264

In [18]:
texts[100], texts[100].metadata

(Document(page_content='Chile (0) \n        \n                                                            \n                                                                    \n                                     All locations \n                                            \n                                                                        \n                                     Santiago (0)', metadata={'url': 'https://www.priceless.com/m/filter/options/category/510'}),
 {'url': 'https://www.priceless.com/m/filter/options/category/510'})

## HF Instructor Embeddings

In [5]:
from langchain.embeddings import HuggingFaceInstructEmbeddings

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl", 
                                                      model_kwargs={"device": "cpu"})

Downloading pytorch_model.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)7f436/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading (…)f57f436/modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


In [13]:
## utility functions

import textwrap
import os

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['url'] + " page: " + str(source.metadata['page']))
        
def mkdir(folder):
    if not os.path.isdir(folder):
        print('The directory is not present. Creating a new one..')
        os.mkdir(folder)
    else:
        print('The directory is present.')

In [19]:
%%time

persist_directory =  '../data/chromadb_hf'
mkdir(persist_directory)

chromadb_instructor_embeddings = Chroma.from_documents(documents=texts, 
                                 embedding=instructor_embeddings,
                                 persist_directory=persist_directory)

chromadb_instructor_embeddings.persist()

Using embedded DuckDB with persistence: data will be stored in: ../data/chromadb


The directory is present.
CPU times: user 13min 47s, sys: 27.2 s, total: 14min 14s
Wall time: 11min 46s


In [20]:
chromadb_retriever = chromadb_instructor_embeddings.as_retriever(search_kwargs={"k": 4})

In [31]:
%%time

docs = chromadb_retriever.get_relevant_documents("Entertainment in New York")
docs

CPU times: user 780 ms, sys: 51.1 ms, total: 831 ms
Wall time: 826 ms


[Document(page_content='Los Angeles (0) \n                                            \n                                                                        \n                                     Miami (0) \n                                            \n                                                                        \n                                     New York (0)', metadata={'url': 'https://www.priceless.com/m/filter/options/category/506'}),
 Document(page_content='Los Angeles (0) \n                                            \n                                                                        \n                                     Miami (0) \n                                            \n                                                                        \n                                     New York (0)', metadata={'url': 'https://www.priceless.com/m/filter/options/category/506'}),
 Document(page_content='Los Angeles (0) \n                                    

## OpenAI's Embeddings

In [26]:
from langchain.embeddings import OpenAIEmbeddings
openai_embeddings = OpenAIEmbeddings()

In [27]:
persist_directory =  '../data/chromadb_openai'
mkdir(persist_directory)

chromadb_openai_embeddings = Chroma.from_documents(documents=texts, 
                                 embedding=openai_embeddings,
                                 persist_directory=persist_directory)

Using embedded DuckDB with persistence: data will be stored in: ../data/chromadb_openai


The directory is not present. Creating a new one..


In [28]:
chromadb_retriever_openai = chromadb_openai_embeddings.as_retriever(search_kwargs={"k": 4})

In [32]:
%%time

docs = chromadb_retriever_openai.get_relevant_documents("Entertainment in New York")
docs

CPU times: user 5.09 ms, sys: 1.4 ms, total: 6.49 ms
Wall time: 356 ms


[Document(page_content='Los Angeles (0) \n                                            \n                                                                        \n                                     Miami (0) \n                                            \n                                                                        \n                                     New York (0)', metadata={'url': 'https://www.priceless.com/m/filter/options/category/506'}),
 Document(page_content='Los Angeles (0) \n                                            \n                                                                        \n                                     Miami (0) \n                                            \n                                                                        \n                                     New York (0)', metadata={'url': 'https://www.priceless.com/m/filter/options/category/506'}),
 Document(page_content='Los Angeles (0) \n                                    

## Process all docs with GPT4ALL-J & Chroma

In [37]:
%%time
from tqdm.auto import tqdm

documents = []

for doc in tqdm(all_docs):
    if 'source' in doc.metadata:
        src = doc.metadata['source']
        url = src.replace('../data/docs/', 'https://').replace('index.html', '').replace('.html', '')
        doc.metadata = dict()
        doc.metadata['url'] = url
        
    documents.append(doc)

len(documents), documents[0].metadata

100%|██████████| 171/171 [00:00<00:00, 275750.09it/s]

CPU times: user 2.18 ms, sys: 1.15 ms, total: 3.33 ms
Wall time: 2.53 ms





(171, {'url': 'https://www.priceless.com/m/tour-the-city'})

In [38]:
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32)
texts = text_splitter.split_documents(documents)
len(texts)

5488

In [39]:
%%time

persist_directory =  '../data/chromadb'
mkdir(persist_directory)

chromadb_instructor_embeddings = Chroma.from_documents(documents=texts, 
                                 embedding=instructor_embeddings,
                                 persist_directory=persist_directory)

chromadb_instructor_embeddings.persist()

Using embedded DuckDB with persistence: data will be stored in: ../data/chromadb


The directory is not present. Creating a new one..
CPU times: user 5h 13min 24s, sys: 8min 40s, total: 5h 22min 5s
Wall time: 4h 30min 27s


In [41]:
chromadb_retriever = chromadb_instructor_embeddings.as_retriever(search_kwargs={"k": 4})

In [42]:
%%time

docs = chromadb_retriever.get_relevant_documents("Entertainment in New York")
docs

CPU times: user 842 ms, sys: 57 ms, total: 899 ms
Wall time: 893 ms


[Document(page_content="Midnight Theatre\n\nFollow\n\nReceive star treatment at New York City's Midnight Theatre\n\nSavor signature cocktails and dim sum snacks\n\nMore...\n\nLess\n\nNew York, New York\n\n2.5 hours\n\nThe Midnight Theatre Map it\n\n3/31 @ 9:30 p.m. Bingo Night at MT  - Sales closed!\n\n4/07 @ 10 p.m. - Rachel Feinstein  - Sales closed!\n\n4/10 @ 10 p.m. - Tribulations, Roy Wood  - Sales closed!\n\n4/28 @ 10 p.m. - Midnight Follies\n\n6/15 @ 10:30 p.m. - Laura Marano\n\n5/19 @ 7 p.m. - American Immigrants\n\n5/20 @ 7 p.m. - American Immigrants", metadata={'url': 'https://www.priceless.com/m/entertainment/product/181398/the-midnight-theatre-priceless-night/511/78'}),
 Document(page_content='Distances are displayed to the nearest 0.1 mile and kilometer. \n Times Square - 0.1 km / 0.1 mi \n Broadway - 0.1 km / 0.1 mi \n Minskoff Theater - 0.1 km / 0.1 mi \n Lunt-Fontanne Theatre - 0.1 km / 0.1 mi \n Richard Rodgers Theatre - 0.2 km / 0.1 mi \n Shubert Theatre - 0.2 km / 0.

## Compare GPT4ALL-J and OpenAI

TODO