# Immunology Vector Search Demo

by [Clifford Anderson](https://www.cliffordanderson.net/), Jean and Alexander Heard Libraries, Vanderbilt University

Latest Revision Date: May 10, 2023

## Document Search and Embeddings with Pinecone and OpenAI

This notebook demonstrates how to perform a similarity search on a set of documents using Pinecone and OpenAI embeddings. It showcases the process of loading a PDF file, extracting and splitting its text, creating an index in Pinecone, generating embeddings using OpenAI, and conducting a similarity search based on a user query. The results are then displayed in a visually appealing HTML format. The purpose of this notebook is to illustrate an effective method for retrieving relevant information from a collection of documents using modern language models and vector search techniques.


### Install Prerequisites

This cell installs four Python packages using pip:

* **pypdf**: A package for working with PDF files, allowing you to extract text, metadata, and more. This package is useful when dealing with PDF documents in Python.
* **langchain**: A language detection and text translation library. It can be used to detect the language of a given text and translate text between languages.
* **openai**: The official Python library for the OpenAI API, which provides access to OpenAI's powerful AI models, such as GPT-3 and Codex. You can use this package to interact with these models for various tasks, like text completion, summarization, translation, and more.
* **pinecone-client**: The Python client for Pinecone, a managed vector database service. This package allows you to interact with Pinecone's vector database for tasks like similarity search, nearest neighbors, and more.
* **tiktoken**: A package for tokenizing text, which is useful for various natural language processing tasks. 
* **pandas**: A package for conducting data analysis with Python.

The -q flag is used to run pip install in "quiet" mode, which means it will only display error messages and suppress the usual installation output. This is useful when you want to keep the output clean and focus on any potential issues.

In [None]:
!pip install -q pypdf
!pip install -q langchain
!pip install -q openai
!pip install -q pinecone-client
!pip install -q tiktoken
!pip install -q pandas


### Helper Packages for Security

This cell installs three Python packages: **ndg-httpsclient**, **pyopenssl**, and **pyasn1**.These packages are related to SSL/TLS support and security, providing additional functionality for secure communication over HTTPS.

In [None]:
!pip install -q ndg-httpsclient
!pip install -q pyopenssl
!pip install -q pyasn1

### Importing Langchain Modules

This cell imports six modules from the Langchain library. OpenAIEmbeddings is used for generating embeddings using OpenAI's models, while CharacterTextSplitter handles text splitting based on characters. Pinecone is a module for working with the Pinecone vector database service, and TextLoader and PyPDFLoader are modules for loading text and PDF documents, respectively. OpenAI is a module for interacting with OpenAI's API. Lastly, we import two modules for Q&A with ChatGPT.

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.document_loaders import TextLoader
from langchain.llms import OpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chains.question_answering import load_qa_chain

### Setting API Keys and Environment Variables

This cell imports the os module and sets environment variables for the OpenAI and Pinecone APIs. The OpenAI API key (OPENAI_API_KEY) is set to a specific value, allowing access to OpenAI's services. Similarly, the Pinecone API key (PINECONE_API_KEY) is set, along with the Pinecone environment (PINECONE_ENV), which specifies the region where Pinecone services are deployed. These environment variables enable access to the respective services through their APIs. If this notebook becomes public, these API keys should be removed so that users may substitute their own keys.

In [None]:
import os

os.environ["OPENAI_API_KEY"]="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"
PINECONE_API_KEY="5eac0906-a536-4824-9182-71078a7df2ba"
PINECONE_ENV="us-west4-gcp"

### Initializing OpenAI's Language Model

The provided code initializes a language model named llm using OpenAI's `GPT-3.5-turbo model`. With a temperature set to 0.0, the model's responses will be deterministic and focused on providing the most probable output given the input. You can now leverage the llm object to interact with the `GPT-3.5-turbo model`, generating text and receiving responses based on the given prompts or context.

In [None]:
llm = ChatOpenAI(
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

### Initializing OpenAI Embeddings Instance

This cell creates an instance of the OpenAIEmbeddings class. By doing so, it sets up the necessary configurations to generate embeddings using OpenAI's models, such as GPT-3 or Codex. These embeddings can be used for various natural language processing tasks, like similarity search, clustering, or classification.

In [None]:
embeddings = OpenAIEmbeddings()

### Reading Grants Data from Github

This Python code snippet provides a method to read a CSV file hosted on a GitHub Gist and extract a specific column `Project Abstracts` using the pandas library.

In [None]:
import pandas as pd

url = 'https://gist.githubusercontent.com/CliffordAnderson/48bc0acb1be3df3538bb30e5d559710e/raw/'
df = pd.read_csv(url)

project_info = df[['Project Number', 'Project Abstract']]
project_info = project_info.dropna(subset=['Project Number', 'Project Abstract'])
 

In [None]:
from langchain.document_loaders import DataFrameLoader

grants_loader = DataFrameLoader(project_info, page_content_column="Project Number")
grant_docs = grants_loader.load_and_split()

### Uploading a File in Google Colab

This cell uses the google.colab module to import the files class, which enables file uploading in Google Colab. The files.upload() function is called to prompt the user to upload a file. After the file is uploaded, its name is stored in the file_name variable, in this case, "philosophy_of_immunology.pdf". This allows for further processing and analysis of the uploaded file within the notebook.

In [None]:
from google.colab import files

uploaded = files.upload()
file_name = "philosophy_of_immunology.pdf" # Download from http://philsci-archive.pitt.edu/18291/1/philosophy_of_immunology.pdf

### Loading and Splitting PDF with PyPDFLoader

This cell creates an instance of the PyPDFLoader class, passing the file_name variable, which contains the name of the previously uploaded PDF file. It then calls the load_and_split() method on the loader instance to load the PDF document and split it into smaller text segments, storing the resulting list of text segments in the `immunology docs` variable. This process makes it easier to work with the document's content and perform further analysis or processing on individual text segments.






In [None]:
literature_loader = PyPDFLoader(file_name)
literature_docs = literature_loader.load_and_split()

### Initializing Pinecone SDK

This cell imports the pinecone package and initializes the Pinecone SDK using the pinecone.init() function. By providing the PINECONE_API_KEY and PINECONE_ENV environment variables as arguments, it sets up the connection to Pinecone's services. This enables the notebook to interact with Pinecone's vector database service, allowing you to store, search, and manage high-dimensional vectors generated from the text data.

In [None]:
import pinecone 

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENV)

### Creating a Pinecone Index

This cell creates a Pinecone index to store and search high-dimensional vectors efficiently. It imports the pinecone package and sets the index_name to "immunology". The pinecone.create_index() function is called with several parameters: the index name, the dimension of the vectors (1536), the similarity metric ('cosine'), the number of pods (1), the number of replicas (1), and the pod type ('p1.x1'). These settings create an index optimized for cosine similarity searches on 1536-dimensional vectors, which will be useful for storing and querying embeddings generated from the text data.

In [None]:
import pinecone

index_name = "immunology" 

pinecone.create_index(index_name, 
                      dimension=1536, 
                      metric='cosine', 
                      pods=1, 
                      replicas=1, 
                      pod_type='p1.x1')


### Creating Pinecone Document Search Instance

This cell initializes a Pinecone document search instance using the `Pinecone.from_documents()` method. It takes three arguments: the docs variable containing the text segments, the embeddings instance of the OpenAIEmbeddings class, and the index_name which is set to "immunology". By creating a Pinecone document search instance, you can efficiently store, search, and retrieve high-dimensional vectors that represent the text segments, enabling you to perform tasks such as similarity search, clustering, or classification.

In [None]:
literature_search = Pinecone.from_documents(literature_docs, embeddings, index_name=index_name, namespace="literature")

In [None]:
grant_search = Pinecone.from_documents(grant_docs, embeddings, index_name=index_name, namespace="grants")

### Use Existing Pinecone Index

Use this code cell to display the search results of a document if and only if you have already created an index using Pinecone's vector database. The provided code snippet assumes that you have set up the index_name and embeddings variables appropriately by creating an index and initializing the OpenAIEmbeddings class with the desired embedding model. By executing this cell, you can visualize the search results in an HTML format, showcasing the page content, page number, and source information for each matching document in the index.

In [None]:
embeddings = OpenAIEmbeddings()

literature_search = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings, namespace="literature")
grant_search = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings, namespace="grants")

### Performing a Similarity Search with Pinecone

In this cell, a similarity search is performed using the docsearch instance of the Pinecone document search class. The query variable is set to "Please define immunology in simple terms". The similarity_search() method is called with the query as its argument. This function searches the Pinecone index for documents that are most similar to the given query based on their embeddings. The results are stored in the docs variable, which can then be used to display or further analyze the most relevant documents found.

In [None]:
query = "Please define immunology in simple terms"
docs = literature_search.similarity_search(query, namespace="literature")

### Displaying Similarity Search Results

This cell iterates through the documents found in the docs variable and creates an HTML output to display the results in a user-friendly format. The html_output variable is initialized with a ```<div>``` element to set the font family. For each document, the cell extracts the page_content, page, and source from the document's metadata. It then constructs an HTML string that includes a styled ```<div>``` containing the page number, source, and content of each document. The cell concludes by displaying the formatted HTML output using the display() function from the IPython library, which presents the results in a visually appealing manner.






In [None]:
from IPython.display import display, HTML

html_output = "<div style='font-family: Arial, sans-serif;'>"
for document in docs:
    content = document.page_content
    page = document.metadata["page"]
    source = document.metadata["source"]
    
    html_output += "<div style='border: 1px solid #ccc; padding: 15px; margin: 15px;'>"
    html_output += f"<h3 style='margin-bottom: 5px;'>Page: {page}</h3>"
    html_output += f"<p style='margin-bottom: 5px;'>Source: {source}</p>"
    html_output += f"<p>{content}</p>"
    html_output += "</div>"
html_output += "</div>"

display(HTML(html_output))

### Initializing Pinecone Vector Store

The code snippet initializes a Pinecone vector store named ```vectorstore``` with a specified ```text_field``` parameter set to "text". It creates a Pinecone index object named ```index``` using the provided index_name. The vectorstore is then initialized using the index, embeddings.embed_query, and text_field parameters, allowing for text-based searches alongside vector-based searches within the index.

In [None]:
text_field = "text" # See https://github.com/hwchase17/langchain/issues/3800

index = pinecone.Index(index_name)

immunology_vectorstore = Pinecone(
    index, embeddings.embed_query, text_field, namespace = "literature"
)

In [None]:
text_field = "Project Narrative" # See https://github.com/hwchase17/langchain/issues/3800

index = pinecone.Index(index_name)

grants_vectorstore = Pinecone(
    index, embeddings.embed_query, text_field, namespace = "grants"
)

### Initialize Retrieval-based Q&A

The code snippet initializes a retrieval-based question answering model called qa_with_sources using the RetrievalQAWithSourcesChain class. It is configured with an underlying language model llm, a chain type of "stuff", and a retriever based on the vectorstore object. The `qa_with_literarure` model and the `qa_with_grants` model can now be used to answer questions and provide supporting sources based on the vector store's indexed data and the language model's capabilities.

In [None]:
qa_with_literature = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=immunology_vectorstore.as_retriever()
)

In [None]:
qa_with_grants = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=grants_vectorstore.as_retriever()
)

### Querying the Retrieval-based Question Answering Model with Sources

The function `qa_with_literature(query`) allows you to query the retrieval-based question answering model `qa_with_literature` with a specific query. This function leverages the underlying language model llm and the configured retriever based on the vectorstore to generate answers to the query while also providing supporting sources or evidence for the answers.

In [None]:
literature_query = "What is the definition of immunology"

qa_with_literature(literature_query)

### Extracting Information from the Grants Dataset

This cell utilizes a question-answering (QA) chain model from OpenAI to process and interpret a large dataset of research grant documents, with a specific focus on immunology.

The first line of code defines the question of interest: "what are major areas of immunology research?" This is the query that the model will use to filter and organize the information from the grants database.

The second line of code initiates a QA chain using the OpenAI's GPT-4 language model. The chain is of the "map_reduce" type, which means it employs a two-step process: it first 'maps' or processes each document independently, and then 'reduces' or combines the results into a single output.

The final line of code runs the QA chain, using as input the documents loaded from the grants database. The 'load' method of the 'grants_loader' object is used to load these documents. The chain uses the question defined earlier ("what are major areas of immunology research?") to analyze the documents and extract the relevant information.

In [None]:
grants_query ="what are major areas of immunology research?"

chain = load_qa_chain(llm=OpenAI(), chain_type="map_reduce")
chain.run(input_documents=grants_loader.load(), question=grants_query)

In [None]:
grants_query = "You are an expert in immunology who is writing a grant to fund cancer immunology research. Please provide an introduction section of the grant proposal based on the documents given."
chain.run(input_documents=grants_loader.load(), question=grants_query)

In [None]:
# try out new chain - https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_text_generation.html
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
prompt_template = """Use the context below to write a 200-500 word introduction to a grant proposal about the topic below:
    Context: {context}
    Topic: {topic}
    Grant:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "topic"]
)

llm = OpenAI(temperature=0)

chain = LLMChain(llm=llm, prompt=PROMPT)

In [None]:
def generate_grant_section(topic):
    docs = grant_search.similarity_search(topic, k=4)
    inputs = [{"context": doc.page_content, "topic": topic} for doc in docs]
    print(chain.apply(inputs))

In [None]:
generate_grant_section("cancer immunology research")

[{'text': ' National Institutes of Health\n\nCancer immunology research is a rapidly growing field of study that has the potential to revolutionize the way we treat cancer. The National Institutes of Health (NIH) has recognized the importance of this research and has provided funding for the 5R01AI106002-02 grant to support further research in this area.\n\nThe goal of this grant is to advance our understanding of the role of the immune system in cancer and to develop new treatments that can be used to fight cancer. This grant will fund research into the mechanisms of cancer immunology, including the development of new therapies and treatments. It will also support research into the development of new diagnostic tools and biomarkers that can be used to identify and monitor cancer.\n\nThe research funded by this grant will be conducted by a team of scientists from a variety of disciplines, including immunology, oncology, and molecular biology. The team will work together to develop new treatments and therapies that can be used to fight cancer. The team will also work to develop new diagnostic tools and biomarkers that can be used to identify and monitor cancer.\n\nThis grant will provide the necessary resources to support the research team in their efforts to advance our understanding of cancer immunology and to develop'}, {'text': ' National Institutes of Health\n\nCancer immunology research is a rapidly growing field of study that has the potential to revolutionize the way we treat cancer. The National Institutes of Health (NIH) has recognized the importance of this research and has provided funding for the 5R01AI106002-02 grant to support further research in this area.\n\nThe goal of this grant is to advance our understanding of the role of the immune system in cancer and to develop new treatments that can be used to fight cancer. This grant will fund research into the mechanisms of cancer immunology, including the development of new therapies and treatments. It will also support research into the development of new diagnostic tools and biomarkers that can be used to identify and monitor cancer.\n\nThe research funded by this grant will be conducted by a team of scientists from a variety of disciplines, including immunology, oncology, and molecular biology. The team will work together to develop new treatments and therapies that can be used to fight cancer. The team will also work to develop new diagnostic tools and biomarkers that can be used to identify and monitor cancer.\n\nThis grant will provide the necessary resources to support the research team in their efforts to advance our understanding of cancer immunology and to develop'}, {'text': " National Institutes of Health\n\nCancer immunology research is a rapidly growing field of study that has the potential to revolutionize the way we treat cancer. The National Institutes of Health (NIH) has recognized the importance of this research and has provided funding for the 5R01AI106002-02 grant to support further research in this area.\n\nThis grant proposal seeks to build on the existing research in cancer immunology and to develop new treatments and therapies that can be used to fight cancer. The goal of this proposal is to develop a comprehensive understanding of the immune system's role in cancer and to develop new treatments that can be used to fight cancer.\n\nThe proposal will focus on three main areas of research: understanding the role of the immune system in cancer, developing new treatments and therapies, and exploring the potential of immunotherapy. The research will be conducted in collaboration with leading experts in the field of cancer immunology and will involve a combination of laboratory experiments, clinical trials, and epidemiological studies.\n\nThe research proposed in this grant proposal will provide a comprehensive understanding of the immune system's role in cancer and will lead to the development of new treatments and therapies that can be used to fight cancer. This research will also provide insight into the potential of"}, {'text': ' National Institutes of Health\n\nCancer immunology research is a rapidly growing field of study that has the potential to revolutionize the way we treat cancer. The National Institutes of Health (NIH) has recognized the importance of this research and has provided funding for the 5R01AI106002-02 grant to support further research in this area.\n\nThe goal of this grant is to advance our understanding of the role of the immune system in cancer and to develop new treatments that can be used to fight cancer. This grant will fund research into the mechanisms of cancer immunology, including the development of new therapies and treatments. It will also support research into the development of new diagnostic tools and biomarkers that can be used to identify and monitor cancer.\n\nThe research funded by this grant will be conducted by a team of scientists from a variety of disciplines, including immunology, oncology, and molecular biology. The team will work together to develop new treatments and therapies that can be used to fight cancer. The team will also work to develop new diagnostic tools and biomarkers that can be used to identify and monitor cancer.\n\nThis grant will provide the necessary resources to support the research team in their efforts to advance our understanding of cancer immunology and to develop'}]
