# OpenAI vs Local Embeddings

Performance comparison
- OpenAI's Embeddings Model
- InstructorEmbedding at [Huggingface](https://huggingface.co/hkunlp/instructor-xl)

![](images/RAG.png)

#### Install required packages

In [1]:
!pip install -r requirements.txt

Collecting langchain (from -r requirements.txt (line 1))
  Obtaining dependency information for langchain from https://files.pythonhosted.org/packages/75/98/7396c3e70ef3a2f6da67fe4c6fd6c3a0d89d6c1ba301500493a36ac60089/langchain-0.0.344-py3-none-any.whl.metadata
  Downloading langchain-0.0.344-py3-none-any.whl.metadata (16 kB)
Collecting openai (from -r requirements.txt (line 2))
  Obtaining dependency information for openai from https://files.pythonhosted.org/packages/bf/6b/7d0b41c8fc44044c251da6d3e33a4b8f60ab5e10f66db8b0ebc22e40fd90/openai-1.3.6-py3-none-any.whl.metadata
  Using cached openai-1.3.6-py3-none-any.whl.metadata (17 kB)
Collecting tiktoken (from -r requirements.txt (line 3))
  Obtaining dependency information for tiktoken from https://files.pythonhosted.org/packages/fb/2a/3d02ef030f387c373acbeca6d5a2307405a1da735285ec12a9ed0b6302ea/tiktoken-0.5.1-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Using cached tiktoken-0.5.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (6.6 kB)
Col

### Set up the environment variables and import libraries

In [2]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY'] # set your environment variable for OpenAI API authentication

In [3]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader, TextLoader, AirbyteJSONLoader
from langchain.document_loaders import DirectoryLoader

In [4]:
# InstructorEmbedding 
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings
# OpenAI Embedding
from langchain.embeddings import OpenAIEmbeddings

  from tqdm.autonotebook import trange


### Load Multiple files from Directory (json)

In [5]:
from langchain.document_loaders import AirbyteJSONLoader

In [7]:
root_dir = "/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md"

In [7]:
# Load all repos
# from langchain.document_loaders.generic import GenericLoader
# from langchain.document_loaders.parsers import LanguageParser
# from langchain.text_splitter import Language
# loader = GenericLoader.from_filesystem(
#     root_dir + "/mpdl_collection",
#     glob="**/*",
#     suffixes=[".json"],
#     parser=LanguageParser(language=Language.PYTHON, parser_threshold=500),
# )
# documents = loader.load()
# len(documents)

3

In [9]:
loader = TextLoader(root_dir)
documents = loader.load()
len(documents)


1

### Divide and split text

In [11]:
text_splitter = RecursiveCharacterTextSplitter(
                                               chunk_size=1000, 
                                               chunk_overlap=200)

texts = text_splitter.split_documents(documents)

### RetrievalQA

We need to store the documents in a way we can semantically search for their content.

The most common approach is to embed the contents of each document then store the embedding and document in a vector store. 

### Get Embeddings for OUR document

In [12]:
import pickle
import faiss # for similarilty: https://faiss.ai/index.html
from langchain.vectorstores import FAISS

In [13]:
def store_embeddings(docs, embeddings, sotre_name, path):
    vectorStore = FAISS.from_documents(docs, embeddings)

    with open(f"{path}/faiss_{sotre_name}.pkl", "wb") as f:
        pickle.dump(vectorStore, f)

In [15]:
def load_embeddings(sotre_name, path):
    with open(f"{path}/faiss_{sotre_name}.pkl", "rb") as f:
        VectorStore = pickle.load(f)
    return VectorStore

### HF Instructor Embeddings

In [16]:
from langchain.embeddings import HuggingFaceInstructEmbeddings
instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
                                                      model_kwargs={"device": "cpu"})

load INSTRUCTOR_Transformer
max_seq_length  512


In [17]:
Embedding_store_path = f"{root_dir}/Embedding_store"
print(Embedding_store_path)

/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md/Embedding_store


In [18]:
db_instructEmbedd = FAISS.from_documents(texts, instructor_embeddings)
print(db_instructEmbedd) ##

<langchain.vectorstores.faiss.FAISS object at 0x2950e65d0>


In [19]:
retriever = db_instructEmbedd.as_retriever(search_kwargs={"k":3})

In [20]:
print(retriever)

tags=['FAISS', 'HuggingFaceInstructEmbeddings'] vectorstore=<langchain.vectorstores.faiss.FAISS object at 0x2950e65d0> search_kwargs={'k': 3}


In [21]:
retriever.search_type

'similarity'

In [22]:
docs = retriever.get_relevant_documents("Who are the authors of this software?")
docs[0]

Document(page_content='Data collected will include number of contributors, number of PRs, time taken to\nclose/merge these PRs, and issues closed.\n\nFor more information, please visit\n[our informational page](https://sustainable-open-science-and-software.github.io/) or download our [participant information sheet](https://sustainable-open-science-and-software.github.io/assets/PIS_sustainable_software.pdf).\n\n---\n\n## Using repo2docker\n\n### Prerequisites\n\n1. Docker to build & run the repositories. The [community edition](https://store.docker.com/search?type=edition&offering=community)\n   is recommended.\n2. Python 3.6+.\n\nSupported on Linux and macOS. [See documentation note about Windows support.](http://repo2docker.readthedocs.io/en/latest/install.html#note-about-windows-support)\n\n### Installation\n\nThis a quick guide to installing `repo2docker`, see our documentation for [a full guide](https://repo2docker.readthedocs.io/en/latest/install.html).\n\nTo install from PyPI:\n\

In [23]:
# create the chain to answer questions
qa_chain_instrucEmbed = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
                                                    chain_type="stuff",
                                                    retriever=retriever,
                                                    return_source_documents=True)

## OpenAI's embeddings

In [24]:
from langchain.embeddings import OpenAIEmbeddings

In [25]:
embeddings = OpenAIEmbeddings()

In [26]:
db_openAIEmbedd= FAISS.from_documents(texts, embeddings)
retriever_openai = db_openAIEmbedd.as_retriever(search_kwargs={"k": 3})

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [31]:
os.environ["TOKENIZERS_PARALLELISM"] = "false" # disable warning message

In [27]:
# create the chain to answer questions
qa_chain_openai = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
                                                    chain_type="stuff",
                                                    retriever=retriever_openai,
                                                    return_source_documents=True)

### Testing both MODELS

In [28]:
## Cite sources

import textwrap # text wrapping and filling
def wrap_text_preserve_newlines(text, width=79):
    # Split the input text into lines based on newline characters
    lines = text.split("\n")

    #wrap each line individually
    wrapped_lines = [textwrap.fill(line, width) for line in lines]

    # Join the wrapped lines back into a single string using newline characters
    wrapped_text = '\n'.join(wrapped_lines)
    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\nSources:')
    for source in llm_response['source_documents']:
        print(source.metadata['source'])

In [32]:
query = 'Who is the creator of the github repo?'

print('--------------Instructor Embeddings-------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)


--------------Instructor Embeddings-------------

 The creator of the github repo is JupyterHub.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md


In [33]:
query = 'Who is the creator of the github repo?'

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 The creator of the github repo is not specified in the given context.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md






In [34]:
query = 'How can I contribute to this software?'

print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 You can contribute to this software by following the instructions in the
contributing guide, which can be found here:
https://repo2docker.readthedocs.io/en/latest/contributing/contributing.html.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md


In [35]:
query = "How can I contribute to this software?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 You can contribute to this software by following the instructions in the
Contributing Guide, which can be found at
https://repo2docker.readthedocs.io/en/latest/contributing/contributing.html.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md


In [36]:
query = "How the software is packaged and distributed?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 repo2docker is available on PyPI and can be installed from source using the
instructions provided.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md


In [37]:
query = "How the software is packaged and distributed?"

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 repo2docker is distributed as a Python package on PyPI.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md






In [38]:
query = "can you give me the installation instructions?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 To install from PyPI, use the command `pip install jupyter-repo2docker`. To
install from source, use the commands `git clone
https://github.com/jupyterhub/repo2docker.git`, `cd repo2docker`, and `pip
install -e .`.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md


In [40]:
query = "this repository contains the installation instructions? If so, please list the library dependencies"

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 Yes, this repository contains the installation instructions. The library
dependencies are Docker, Python 3.6+, and Linux or macOS.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/maSMP-LLM/data/mpdl_collection/readme.md






## LLAMA Model

In [22]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.llms import LlamaCpp
from langchain.memory import ConversationSummaryMemory
from langchain.prompts import PromptTemplate

In [20]:
!pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.20.tar.gz (8.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/8.7 MB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Obtaining dependency information for diskcache>=5.6.1 from https://files.pythonhosted.org/packages/3f/27/4570e78fc0bf5ea0ca45eb1de3818a23787af9b390c0b0a0033a1b8236f9/diskcache-5.6.3-py3-none-any.whl.metadata
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-py

In [25]:
!pip install llama-cpp-python  --upgrade --force-reinstall --no-cache-dir

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.20.tar.gz (8.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/8.7 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting typing-extensions>=4.5.0 (from llama-cpp-python)
  Obtaining dependency information for typing-extensions>=4.5.0 from https://files.pythonhosted.org/packages/24/21/7d397a4b7934ff4028987914ac1044d3b7d52712f30e2ac7a2ae5bc86dd0/typing_extensions-4.8.0-py3-none-any.whl.metadata
  Downloading typing_extensions-4.8.0-py3-none-any.whl.metadata (3.0 kB)
Collecting numpy>=1.20.0 (from llama-cpp-python)
  Obtaining dependency information for numpy>=1.20.0 from https://files.pythonhosted.org/packages/2e/54/218ce51bb571a70975f223671b2a86aa951e8

In [29]:
from llama_cpp import Llama
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = Llama(
    model_path="/Users/rlm/Desktop/Code/llama/code-llama/codellama-13b-instruct.Q4_K_M.gguf",
    n_ctx=5000,
    n_gpu_layers=1,
    n_batch=512,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=True,
)

ValueError: Model path does not exist: ./models/7B/llama-model.gguf