# OpenAI vs Local Embeddings

Performance comparison:
- OpenAI's Embeddings Model
- InstructorEmbedding at [Huggingface](https://huggingface.co/hkunlp/instructor-xl)

#### Install required packages

In [1]:
!pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Set up the environment variables and import libraries

In [35]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [36]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader, TextLoader, AirbyteJSONLoader
from langchain.document_loaders import DirectoryLoader

In [37]:
# InstructorEmbedding 
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings
# OpenAI Embedding
from langchain.embeddings import OpenAIEmbeddings

### Load Multiple files from Directory (json)

We will upload all MPDL project files using the `langchain.document_loaders.TextLoader`. The following script iterates over the files in this repository and loads every `.json` file (a.k.a. **documents**):

In [81]:
# from langchain.document_loaders import AirbyteJSONLoader
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import LanguageParser
from langchain.text_splitter import Language

In [39]:
root_dir = "/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm"

In [80]:
# loader = DirectoryLoader(f'{root_dir}/mpdl_collection/', glob='**/*.json', show_progress=True, loader_cls=TextLoader)
# documents = loader.load()
# print(len(documents))

100%|██████████| 3/3 [00:00<00:00, 543.28it/s]

3





In [93]:
# Load all repos
loader = GenericLoader.from_filesystem(
    root_dir + "/mpdl_collection_raw",
    glob="**/*",
    suffixes=[".json"],
    parser=LanguageParser(language=Language.PYTHON, parser_threshold=500),
)
documents = loader.load()
len(documents)

58

### Splitting

Split the `Document` into chunks for embedding and vector storage. We can use `RecursiveCharacterTextSplitter` with language specified.

In [124]:
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=1000,
#     chunk_overlap=100
# )
# texts = text_splitter.split_documents(documents)

from langchain.text_splitter import RecursiveCharacterTextSplitter

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=2000, chunk_overlap=200
)
texts = python_splitter.split_documents(documents)
len(texts)

738

In [89]:
print(texts[22])

page_content='"technique": "GitHub_API"\n        }\n    ],\n    "download_url": [\n        {\n            "result": {\n                "value": "https://github.com/MPDL/rdmo-catalog/releases",\n                "type": "Url"\n            },\n            "confidence": 1,\n            "technique": "GitHub_API"\n        }\n    ],\n    "programming_languages": [\n        {\n            "result": {\n                "value": "R",\n                "name": "R",\n                "type": "Programming_language",\n                "size": 2111\n            },\n            "confidence": 1,\n            "technique": "GitHub_API"\n        },\n        {\n            "result": {\n                "value": "Shell",\n                "name": "Shell",\n                "type": "Programming_language",\n                "size": 1602\n            },\n            "confidence": 1,\n            "technique": "GitHub_API"\n        },\n        {\n            "result": {\n                "value": "Python",\n             

In [90]:
len(texts)
for element in texts:
    print(element)


page_content='{\n    "somef_provenance": {\n        "somef_version": "0.9.4",\n        "somef_schema_version": "1.0.0",\n        "date": "2023-11-28 15:24:34"\n    },\n    "code_repository": [\n        {\n            "result": {\n                "value": "https://github.com/MPDL/screenshot-service",\n                "type": "Url"\n            },\n            "confidence": 1,\n            "technique": "GitHub_API"\n        }\n    ],\n    "owner": [\n        {\n            "result": {\n                "value": "MPDL",\n                "type": "Organization"\n            },\n            "confidence": 1,\n            "technique": "GitHub_API"\n        }\n    ],\n    "date_created": [\n        {\n            "result": {\n                "value": "2014-08-05T09:14:41Z",\n                "type": "Date"\n            },\n            "confidence": 1,\n            "technique": "GitHub_API"\n        }\n    ],\n    "date_updated": [\n        {\n            "result": {\n                "value": "201

### RetrievalQA

We need to store the documents in a way we can semantically search for their content. The most common approach is to embed the contents of each document then store the embedding and document in a vector store. When setting up the vectorstore retriever:
- We test max marginal relevance for retrieval
- And 8 documents returned

In [96]:
# from langchain.embeddings.openai import OpenAIEmbeddings
# from langchain.vectorstores import Chroma

# db = Chroma.from_documents(texts, OpenAIEmbeddings(disallowed_special=()))
# retriever = db.as_retriever(
#     search_type="mmr",  # Also test "similarity"
#     search_kwargs={"k": 8},
# search_)

### Get Embeddings for MPDL document(s)

In [97]:
import pickle
import faiss # for similarilty: https://faiss.ai/index.html
from langchain.vectorstores import FAISS

In [98]:
def store_embeddings(docs, embeddings, sotre_name, path):
    vectorStore = FAISS.from_documents(docs, embeddings)

    with open(f"{path}/faiss_{sotre_name}.pkl", "wb") as f:
        pickle.dump(vectorStore, f)

In [99]:
def load_embeddings(sotre_name, path):
    with open(f"{path}/faiss_{sotre_name}.pkl", "rb") as f:
        VectorStore = pickle.load(f)
    return VectorStore

### HF Instructor Embeddings

In [100]:
from langchain.embeddings import HuggingFaceInstructEmbeddings
instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
                                                      model_kwargs={"device": "cpu"})

load INSTRUCTOR_Transformer
max_seq_length  512


In [101]:
Embedding_store_path = f"{root_dir}/Embedding_store"
print(Embedding_store_path)

/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/Embedding_store


In [102]:
db_instructEmbedd = FAISS.from_documents(texts, instructor_embeddings)
print(db_instructEmbedd) ## takes so long

<langchain.vectorstores.faiss.FAISS object at 0x175b17d50>


In [103]:
retriever = db_instructEmbedd.as_retriever(search_kwargs={"k":3})

In [104]:
print(retriever)

tags=['FAISS', 'HuggingFaceInstructEmbeddings'] vectorstore=<langchain.vectorstores.faiss.FAISS object at 0x175b17d50> search_kwargs={'k': 3}


In [105]:
retriever.search_type

'similarity'

In [106]:
docs = retriever.get_relevant_documents("Who are the authors of this software?")
docs[0]

Document(page_content='"value": "Copyright 2013, NAN contributors:\\n  - Rod Vagg <https://github.com/rvagg>\\n  - Benjamin Byholm <https://github.com/kkoopa>\\n  - Trevor Norris <https://github.com/trevnorris>\\n(the \\"Original Author\\")\\nAll rights reserved.\\n\\nMIT +no-false-attribs License\\n\\nPermission is hereby granted, free of charge, to any person\\nobtaining a copy of this software and associated documentation\\nfiles (the \\"Software\\"), to deal in the Software without\\nrestriction, including without limitation the rights to use,\\ncopy, modify, merge, publish, distribute, sublicense, and/or sell\\ncopies of the Software, and to permit persons to whom the\\nSoftware is furnished to do so, subject to the following\\nconditions:\\n\\nThe above copyright notice and this permission notice shall be\\nincluded in all copies or substantial portions of the Software.\\n\\nDistributions of all or part of the Software intended to be used\\nby the recipients as they would use the

In [107]:
# create the chain to answer questions
qa_chain_instrucEmbed = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
                                                    chain_type="stuff",
                                                    retriever=retriever,
                                                    return_source_documents=True)

## OpenAI's embeddings

In [108]:
#  Warnings regarding parallelism and stack logging that are used within the library. To address these warnings,
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [109]:
from langchain.embeddings import OpenAIEmbeddings

In [110]:
embeddings = OpenAIEmbeddings()

Following lines gives an message: `RateLimitError` caused by reaching the rate limit for using the OpenAI Text Embedding API (text-embedding-ada-002). This API has a limitation on the number of tokens that can be processed within a certain time frame. The solution is to reduce the Input Size: As suggested in the error message, you can reduce the number of input tokens (text) in your request to stay within the rate limits of the Text Embedding API.

In [123]:
db_openAIEmbedd= FAISS.from_documents(texts, embeddings)
retriever_openai = db_openAIEmbedd.as_retriever(search_kwargs={"k": 3})

RateLimitError: Error code: 429 - {'error': {'message': 'Request too large for text-embedding-ada-002 in organization org-9JUjLVRoZdIJ2OZjz7Nx4iOK on tokens per min (TPM): Limit 150000, Requested 294554. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more. You can increase your rate limit by adding a payment method to your account at https://platform.openai.com/account/billing.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

In [112]:
# create the chain to answer questions
qa_chain_openai = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
                                                    chain_type="stuff",
                                                    retriever=retriever_openai,
                                                    return_source_documents=True)

### Testing both MODELS

In [113]:
## Cite sources

import textwrap # text wrapping and filling
def wrap_text_preserve_newlines(text, width=79):
    # Split the input text into lines based on newline characters
    lines = text.split("\n")

    #wrap each line individually
    wrapped_lines = [textwrap.fill(line, width) for line in lines]

    # Join the wrapped lines back into a single string using newline characters
    wrapped_text = '\n'.join(wrapped_lines)
    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\nSources:')
    for source in llm_response['source_documents']:
        print(source.metadata['source'])

In [126]:
query = 'Can you give me the installation instructions?'

print('--------------Instructor Embeddings-------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)


--------------Instructor Embeddings-------------

 You can find a very practicable installation guide on http://www.compass-
style.org/install/. If you get a failure (e.g. on Debian systems) with "$ gem
update --system", please try the following steps in front of update system:
sudo gem install rubygems-update, sudo update_rubygems, sudo gem update
--system. If you get a failure for installing compass, try sudo apt-get install
ruby-dev, sudo gem install sass, sudo gem install compass.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_imeji_sass_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_MPG-Reader_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_r2-ui-prototype_2023-11-28.json


In [115]:
query = 'Can you give me the installation instructions?'

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------


python(86223) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


 No, I don't know the installation instructions.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_fedora-genericsearch-2-2_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_fedora-genericsearch-2-2_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_fedora-genericsearch-2-2_2023-11-28.json






In [116]:
query = 'Can you give me the installation requirements of fedora?'

print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 No, I don't know the installation requirements of fedora.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_imeji_sass_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_fedora-genericsearch-2-2_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_MPG-Reader_2023-11-28.json


In [117]:
query = "Can you give me the installation requirements of fedora?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 No, I don't know the installation requirements of fedora.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_imeji_sass_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_fedora-genericsearch-2-2_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_MPG-Reader_2023-11-28.json


In [118]:
query = "How the software is packaged and distributed?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 You may distribute the Executable form of the Covered Software under the terms
of this License or under the terms of a license of Your choice, which may
contain terms different from this License, provided that You are in compliance
with the terms of this License and that the license for the Executable form
does not attempt to limit or alter the recipient's rights in the Source Code
form from the rights set forth in this License.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_INGe_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_INGe_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection_raw/MPDL_INGe_2023-11-28.json


In [119]:
query = "How the software is packaged and distributed?"

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 FedoraGSearch is subject to the same open source license as the Fedora
Repository System at www.fedora-commons.org and is distributed by The Technical
University of Denmark.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_fedora-genericsearch-2-2_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_fedora-genericsearch-2-2_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_fedora-genericsearch-2-2_2023-11-28.json






In [120]:
query = "Please provide a link to the documentation for each project?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------



RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for text-davinci-003 in organization org-9JUjLVRoZdIJ2OZjz7Nx4iOK on requests per min (RPM): Limit 3, Used 3, Requested 1. Please try again in 20s. Visit https://platform.openai.com/account/rate-limits to learn more. You can increase your rate limit by adding a payment method to your account at https://platform.openai.com/account/billing.', 'type': 'requests', 'param': None, 'code': 'rate_limit_exceeded'}}

In [121]:
query = "Please provide a link to the documentation for each project?"

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 The documentation for each project can be found on the GitHub Wiki page for
the project. The link for the bio-formats-service project is
https://github.com/MPDL/bio-formats-service/wiki.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_fedora-genericsearch-2-2_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_auth_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_bio-formats-service_2023-11-28.json






In [122]:
query = "Can you give me the list of names for all the project you find?"

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 No, I cannot give you the list of names for all the projects I find.

Sources:
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_fedora-genericsearch-2-2_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_fedora-genericsearch-2-2_2023-11-28.json
/Users/ccugutrillague/Documents/outreach/smp-hackthaton/smp-llm/mpdl_collection/MPDL_auth_2023-11-28.json




