Following langchain quickstart tutorial: https://python.langchain.com/v0.1/docs/use_cases/question_answering/quickstart/ 

Other tutorials: https://medium.com/@thakermadhav/build-your-own-rag-with-mistral-7b-and-langchain-97d0c92fa146

Goal: make a RAG to answer queries about research being done at UCSD, database info is pulled from Dimensions

Using HuggingFace, Mistral-7B model. https://huggingface.co/docs/transformers/installation installation instructions

Pip installations:
pip install --upgrade huggingface_hub <br/>
install pytorch
pip install transformers

for langchain stuff:
pip install --upgrade --quiet  langchain langchain-community langchainhub langchain-openai langchain-chroma bs4

Langchain is for prototyping, LangSmith is for production to increase reliability of models and have a UI so you can visualize your LLM

In [50]:
import os
from dotenv import load_dotenv
import getpass

load_dotenv()

hf_access_token = os.getenv('HF_TOKEN') # get access token from .env file
#openai_api_key = os.getenv('OPENAI_API_KEY')
lc_api_key = os.getenv('LANGCHAIN_API_KEY')

In [52]:
# start logging traces to use LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ['HUGGINGFACEHUB_API_TOKEN'] = hf_access_token

In [7]:
from langchain import hub
from langchain_chroma import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings

In [8]:
# use DocumentLoaders: objects to load in data fram a source and return a list of documents for our vector database
# one Document has page_content and metadata
# TODO: make a csv file with publications that we want to load in 
# https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/csv/ 

In [11]:
# large fields in the CSV file -> increase field size limit
# https://stackoverflow.com/questions/15063936/csv-error-field-larger-than-field-limit-131072

import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

In [13]:
# STEP 1: indexing, load data
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="publication_docs.csv")
data = loader.load()

In [15]:
len(data)

5014

In [17]:
# STEP 2: indexing, split data
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(data)

In [19]:
all_splits[0]

Document(page_content='title: Missing Wedge Completion via Unsupervised Learning with Coordinate Networks', metadata={'source': 'publication_docs.csv', 'row': 0, 'start_index': 0})

In [21]:
len(all_splits)

118738

I tried to load all the documents, but it gives me a JSON decoder issue. I tried batching the data as below but still got the error. 

In [24]:
# STEP 3: indexing, store
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings

embeddings = HuggingFaceInferenceAPIEmbeddings(api_key=hf_access_token, model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Hugging Face Embeddings model initialized successfully.")
db = FAISS.from_documents(all_splits[0:1000], embeddings) 

Hugging Face Embeddings model initialized successfully.


In [26]:
batch_size = 10
batched_data = [] 

for i in range(0, len(all_splits), batch_size):
    batched_data.append(all_splits[i:i + batch_size])

len(batched_data)

11874

In [28]:
# STEP 4: RETRIEVAL AND GENERATION
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [34]:
# STEP 5: GENERATION
from langchain_community.llms import HuggingFaceHub

llm = HuggingFaceHub(
            repo_id="HuggingFaceH4/zephyr-7b-beta", 
            task="text-generation",
            model_kwargs={"temperature": 0.1, "max_length": 1000}
        )

In [38]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [40]:
print(prompt)

input_variables=['context', 'question'] metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'} messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))]


In [54]:
for chunk in rag_chain.stream("What papers cover malaria?"):
    print(chunk, end="", flush=True)

Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: What papers cover malaria? 
Context: category_for: [{'id': '80034', 'name': '3101 Biochemistry and Cell Biology'}, {'id': '80002', 'name': '31 Biological Sciences'}, {'id': '80068', 'name': '3404 Medicinal and Biomolecular Chemistry'}, {'id': '80005', 'name': '34 Chemical Sciences'}, {'id': '80040', 'name': '3107 Microbiology'}]
publisher: MDPI
research_org_names: ['Stanford Synchrotron Radiation Lightsource', 'Baylor College of Medicine', 'Salk Institute for Biological Studies', 'University of California, San Diego', 'Stanford University', 'University of Michigan–Ann Arbor']
times_cited: 0
year: 2024
journal.id: jour.1028874
journal.title: International Journal of Molecular Sciences
funders:

category_for: [{'id': '80002', 'name': '31

In [None]:
# basic rag chain with limited results