# Augmenting OpenAI with Specialized Research Papers

ChatGPT has a limit of 4096 tokens. Many papers are too long to feed the entire text to ChatGPT; and each token has an associated cost when using OpenAI.

This notebook does the following:
* Task 1: breaks the text from or more papers into smaller chunks; creates vector embeddings for each chunk; and stores the text and vector indices in an online database (and saves to a local json file)
* Task 2: conducts queries, such as a similarity search, between a user input text/question and the database to find related chunks of text
* Task 3: passes the related chunks of text to ChatGPT to augment its knowledge and to use this information in user-interactions via natural language chat

There are probably fancier and exponentially increasing number of ways to do this. This is a fairly straightforward presentation of the steps to augment ChatGPT with specialized knowledge.

Sarah T. Stewart<br>
April 9, 2023


## 1. Prepare accounts and python environment

### Make an OpenAI account
You need an OpenAI API key. Go to https://platform.openai.com and make an account and generate an API key. 

OpenAI charges by the token for each API transaction.

If you have a new account, you will have a small starting balance. When you spend it all, you will need to set up billing to continue to use it. (or set up another account.....)

Documentation for how to query the API: https://platform.openai.com/docs/api-reference

This notebook uses OpenAI to generate indices for vector embeddings of the source text (task 1) and to generate indices for vector embeddings of questions or sample text to match to the source text.

Currently, ChatGPT-3.5 has a limit of 4096 tokens in any query (this includes history for conversations), so the papers need to be broken into smaller chunks to be able to pass small amounts (<4096 tokens) at a time to augment ChatGPT's knowledge base.

### Make a Pinecone account to store your vector database
You will need to store the text and embedding indices in a database for search and queries. There are multiple options. I am using https://pinecone.io. A free account will let you create one index database.

Create an index (this is going to hold your database):
* Give your index a name
* metric = cosine # simplest way to find matching text using vector embedding indices
* Pod Type = p1 # optimized for speed
* Dimensions = 1536 # this matches the OpenAI model 'text-embedding-ada-002'

Docs are here: https://docs.pinecone.io/docs/overview

This notebook uses pinecone functions to conduct a similarity search (task 2). The index holds all of the indices from task 1, the corresponding source text, and other metadata. The similarity search finds chunks of text in the database that have a high match to your search query text.

Description of a vector database: https://www.pinecone.io/learn/vector-database/

### Create a ~/.env file
Store the following information in your .env file:
```
export OPENAI_API_KEY='long-string'
export PINECONE_API_KEY='long-string'
# The environment will be randomly assigned by pinecone; look in your index settings for 'environment'
export PINECONE_API_ENV='us-east4-gcp' 
```

### Langchain and Tiktoken
This notebook uses langchain, a very new python package that is only available on pip. It has the ability to chain together content to provide context or to pass information between tasks. It may or may not survive the GPT user package wars. Others will have similar capabilities.

Langchain is used to form reasonable chunks of text from the original papers. Tiktoken is a package to turn the original text into tokens. OpenAI uses tokens as the unit of text.

Docs are here https://python.langchain.com/


In [1]:
# some pages that inspired this notebook
# https://www.mlq.ai/gpt-3-enabled-research-assistant-langchain-pinecone/
# https://github.com/pinecone-io/examples/blob/master/analytics-and-ml/data-mining/sentiment-analysis/sentiment-analysis.ipynb
# https://betterprogramming.pub/openais-embedding-model-with-vector-database-b69014f04433
#
# for this notebook, create a python environment with these packages
# note: langchain not available on conda at this time, and I use conda so I have an ugly mixed environment
# conda create -n langchain python=3.9
# conda install -c conda-forge python-dotenv
# conda install -c conda-forge jupyterlab
# conda install -c anaconda ipywidgets
# pip install langchain
# pip install unstructured
# pip install 'unstructured[local-inference]'
# pip install pinecone-client
# pip install openai
# pip install tiktoken

In [35]:
# Load python packages
# to load .env file
import os
from dotenv import load_dotenv
# langchain packages to work with the text
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# package to turn text into tokens
import tiktoken
# openai for creating vector embeddings
import openai
# pinecone database client
import pinecone
# this is a status bar; used in a lot of examples from langchain; needs to load twice for some reason
# from tqdm.autonotebook import tqdm
# langchain packages to work with vector embeddings
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
# pickle used to save vector embeddings for a single paper
import pickle
# check if file exists
from os.path import exists
from copy import deepcopy
# for conversations with ChatGPT
from langchain import OpenAI, ConversationChain

In [5]:
# Load your API keys from ~/.env file
# protect your personal keys because others could spend your $ or corrupt your database
load_dotenv()
#print(os.environ) # for you see what is there; don't share your API keys
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
PINECONE_API_KEY = os.environ['PINECONE_API_KEY']
PINECONE_API_ENV = os.environ['PINECONE_API_ENV']


## Choose OpenAI model

The world is changing fast and the model will change tomorrow. Today, the model named 'text-embedding-ada-002' is recommended for calculating vector embeddings of a text.

See https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
```
MODEL NAME              TOKENIZER    MAX INPUT TOKENS  OUTPUT DIMENSIONS
text-embedding-ada-002  cl100k_base  8191              1536
```

Tiktoken needs this information. When you set up the pinecone index. The output dimensions of this model need to match the index. You need a different index if you are making different databases with different dimensions.

The chunk size determines the length of text (in units of tokens) that is stored in each entry of the pinecone database. We will try to split the text intelligently, but sometimes sentences are broken in the middle, so the overlap value provides some protection for poor token splits.

In the tiktoken cell below, the command
```
separators=['\n\n', '\n', ' ', '']
```
first tries to split the text at two new lines, then one new line, then spaces, then anywhere to chunk the source text into manageable lengths.

In [6]:
OPENAI_MODEL = 'text-embedding-ada-002'
OPENAI_ENCODING = 'cl100k_base'
OPENAI_CHUNKSIZE = 800
OPENAI_CHUNKOVERLAP = 30

## Set up Tiktoken 

In [7]:
# ref
# https://github.com/pinecone-io/examples/blob/master/generation/langchain/handbook/xx-langchain-chunking.ipynb

# decide how to chunk the source text
# the currently preferred encoding is 'text-embedding-ada-002'
tokenizer = tiktoken.get_encoding(OPENAI_ENCODING)
tiktoken.encoding_for_model(OPENAI_MODEL)

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

# make a function to split your text into chunks that fit the OpenAI model
# RecursiveCharacterTextSplitter is a langchain function
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=OPENAI_CHUNKSIZE, 
    chunk_overlap=OPENAI_CHUNKOVERLAP,
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']) 


## Prepare your data = research papers

You want to provide specialized information ChatGPT to be able to use its natural language processing capabilities in a specific area.

To augment ChatGPT's knowledge, you need to provide high quality source material. The source material needs to be plain text; e.g., what you can type into the web interface. Although here, we will use python commands to pass the text around.

Unfortunately, high quality plain text can require some effort.

You have options:
* Use one of various PDF readers to scrape the file for plain text. This will be a mixed bag depending on the complexity of your source.
    * You can use a PDF editor to delete extraneous items in the PDF file. The problem here is that PDF does not necessary group words on a page in sentence order, so the language processing can be corrupted. 
* Use a manuscript version of the paper with minimal extraneous things in the file. Again, remove extraneous material.
* Use the html version of the paper to gather plain text. This is what I did for recent papers.
    * Go to the publisher web site html version
    * I used 'Reader view' plug-in in my browser to remove all the extraneous material from the web page
    * I cut and pasted this text into a plain text document.
    * I removed the figure and references (references are not natural language).
    * Save the plain text document with utf8 encoding.

## Put one paper into the pinecone database

### A. Open your connections to the pinecone database and openai
My index is called 'papers'.


In [8]:
# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to api key in console
)
active_indexes = pinecone.list_indexes()
list(active_indexes)

['papers']

In [9]:
# set this to your pinecone index name
INDEX_NAME = 'papers'
# select the pinecone index to use
index = pinecone.Index(INDEX_NAME)

In [10]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)


### B. read in the text; using bibtext for metadata

In [11]:
pathtopapers = './papers/'
# loader = OnlinePDFLoader("https://--full-path--.pdf")
# loader = UnstructuredPDFLoader("./papers/Lock2018.pdf")
loader = TextLoader(pathtopapers+'Lock2018.txt')

# remember the parameters used for the vector database in this metadata
paper_meta = [{
    'key' : 'Lock2018',
    'doi' : '10.1002/2017JE005333',
    'bibtex' : "@article{Lock2018,\
        author = {Lock, Simon J. and Stewart, Sarah T. and Petaev, Michail I. and Leinhardt, Zo{\"{e}} and Mace, Mia T. and Jacobsen, Stein B. and Cuk, Matija},\
        doi = {10.1002/2017JE005333},\
        journal = {Journal of Geophysical Research: Planets},\
        keywords = {angular momentum,giant impact,isotopic equilibration,lunar accretion,lunar chemistry,volatile elements},\
        month = {apr},\
        number = {4},\
        pages = {910--951},\
        title = {{The Origin of the Moon Within a Terrestrial Synestia}},\
        volume = {123},\
        year = {2018}\
        }",\
    'openai_model' : OPENAI_MODEL,\
    'openai_encoding' : OPENAI_ENCODING,\
    'openai_chunksize' : str(OPENAI_CHUNKSIZE),\
    'openai_chunkoverlap' : str(OPENAI_CHUNKOVERLAP),\
    'text' : 'text'
    }]
          

In [12]:
paper_meta[0]

{'key': 'Lock2018',
 'doi': '10.1002/2017JE005333',
 'bibtex': '@article{Lock2018,        author = {Lock, Simon J. and Stewart, Sarah T. and Petaev, Michail I. and Leinhardt, Zo{"{e}} and Mace, Mia T. and Jacobsen, Stein B. and Cuk, Matija},        doi = {10.1002/2017JE005333},        journal = {Journal of Geophysical Research: Planets},        keywords = {angular momentum,giant impact,isotopic equilibration,lunar accretion,lunar chemistry,volatile elements},        month = {apr},        number = {4},        pages = {910--951},        title = {{The Origin of the Moon Within a Terrestrial Synestia}},        volume = {123},        year = {2018}        }',
 'openai_model': 'text-embedding-ada-002',
 'openai_encoding': 'cl100k_base',
 'openai_chunksize': '800',
 'openai_chunkoverlap': '30',
 'text': 'text'}

In [13]:
data = loader.load() # this is a langchain document object
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

You have 1 document(s) in your data
There are 156440 characters in your document


In [14]:
# uncomment to see the document
# data
# the text is in
# data[0].page_content
# metadata is the file name and path
# data[0].metadata

### C. Chunk your data up into smaller documents

In [15]:
texts = text_splitter.split_documents(data)
print (f'Now you have {len(texts)} documents')

Now you have 59 documents


In [16]:
# look at the first chunk
texts[0]

Document(page_content="Title\nThe Origin of the Moon Within a Terrestrial Synestia\nAuthors\nSimon J. Lock and Sarah T. Stewart\nAbstract\nThe giant impact hypothesis remains the leading theory for lunar origin. However, current models struggle to explain the Moon's composition and isotopic similarity with Earth. Here we present a new lunar origin model. High-energy, high-angular-momentum giant impacts can create a post-impact structure that exceeds the corotation limit, which defines the hottest thermal state and angular momentum possible for a corotating body. In a typical super-corotation-limit body, traditional definitions of mantle, atmosphere, and disk are not appropriate, and the body forms a new type of planetary structure, named a synestia. Using simulations of cooling synestias combined with dynamic, thermodynamic, and geochemical calculations, we show that satellite formation from a synestia can produce the main features of our Moon. We find that cooling drives mixing of the

### D. Create and save vector embeddings for each chunk

This step costs $ by using OpenAI. Using the pinecone function that calls OpenAI to create vector embeddings.

The database entry is uploaded with the pinecone upsert function:
```
to_upsert = zip(idlist, embedslist, metalist)
# upsert to Pinecone
index.upsert(vectors=list(to_upsert))
```

Each vector has the format
```
[id, indices, metadata]
```

You can batch upload database entries, so all the chunks from a paper can be uploaded at once.

In [17]:
# first create big lists of the components to pass to pinecone

# format id numbers as paper number.chunk number
# this is for convenience to find paper chunks in the pinecone web browser interface later
# also include the text content in the metadata
paperno = 1
idlist = []
contentlist = []
metalist = []
for i in range(len(texts)):
    idlist.append(str(paperno)+'.'+str(i))
    contentlist.append(texts[i].page_content)
    # put the source text for this chunk in the metadata
    # print(i,texts[i].page_content[0:10])
    paper_meta[0]['text']=''+texts[i].page_content[:]
    temp = deepcopy(paper_meta[0]) # otherwise all the metadata points to the last text chunk
    metalist.append(temp) 

In [18]:
# check the metalist to make sure the chunks are copied correctly into the list
metalist[0]

{'key': 'Lock2018',
 'doi': '10.1002/2017JE005333',
 'bibtex': '@article{Lock2018,        author = {Lock, Simon J. and Stewart, Sarah T. and Petaev, Michail I. and Leinhardt, Zo{"{e}} and Mace, Mia T. and Jacobsen, Stein B. and Cuk, Matija},        doi = {10.1002/2017JE005333},        journal = {Journal of Geophysical Research: Planets},        keywords = {angular momentum,giant impact,isotopic equilibration,lunar accretion,lunar chemistry,volatile elements},        month = {apr},        number = {4},        pages = {910--951},        title = {{The Origin of the Moon Within a Terrestrial Synestia}},        volume = {123},        year = {2018}        }',
 'openai_model': 'text-embedding-ada-002',
 'openai_encoding': 'cl100k_base',
 'openai_chunksize': '800',
 'openai_chunkoverlap': '30',
 'text': "Title\nThe Origin of the Moon Within a Terrestrial Synestia\nAuthors\nSimon J. Lock and Sarah T. Stewart\nAbstract\nThe giant impact hypothesis remains the leading theory for lunar origin. How

In [19]:
# file to save/load openai embeddings for this paper
openai_file = pathtopapers+paper_meta[0]['key']+'.pickle'

# check if openai results already exist for this file
if exists(openai_file):
    # Open the file in binary mode
    with open(openai_file, 'rb') as file:
        # Call load method to deserialze
        res = pickle.load(file)
    print('Loaded previously saved embeddings: ',openai_file)
else:
    # this costs $
    # take the list of text chunks and generate vector embeddings
    # res is an openai object
    res = openai.Embedding.create(input=contentlist, engine=OPENAI_MODEL)
    # save the openai results because you don't want to spend the $ to make it again if you don't have to
    # Open a file and use dump()
    with open(openai_file, 'wb') as file:
        # A new file will be created
        pickle.dump(res, file)


Loaded previously saved embeddings:  ./papers/Lock2018.pickle


In [20]:
# look at openai object
#res

In [21]:
# extract the vectors into this list to pass to pinecone
# this is the openai list of vector indices for each chunk
# it is paired with the idlist, metadatalist generated above
embedslist = [record['embedding'] for record in res['data']]
len(embedslist)

59

### E. Upload vectors to Pinecone database

Do not need to do again if this paper is already in the database.

In [22]:
# zip all the chunks into one big vector list for pinecone
to_upsert = zip(idlist, embedslist, metalist)
# upsert to Pinecone all the vectors for this one paper
index.upsert(vectors=list(to_upsert))

{'upserted_count': 59}

## Repeat for additional papers to put in the database

This process can be turned into a list of papers to iterate over and upload to the database.

Paper 2:

In [23]:
# will look for the text in ./papers/[key].txt
paperno = 2
paper_meta = [{
    'key' : 'Lock2017',
    'doi' : '10.1002/2016JE005239',
    'bibtex' : "@article{Lock2017,\
        author = {Lock, Simon J. and Stewart, Sarah T.},\
        doi = {10.1002/2016JE005239},\
        journal = {Journal of Geophysical Research: Planets},\
        keywords = {exoplanets,formation,impacts,rotation,structure,synestias},\
        month = {may},\
        number = {5},\
        pages = {950--982},\
        title = {{The structure of terrestrial bodies: Impact heating, corotation limits, and synestias}},\
        volume = {122},\
        year = {2017}\
        }",
    'openai_model' : OPENAI_MODEL,
    'openai_encoding' : OPENAI_ENCODING,
    'openai_chunksize' : str(OPENAI_CHUNKSIZE),
    'openai_chunkoverlap' : str(OPENAI_CHUNKOVERLAP),
    'text' : 'text'
    }] 

In [24]:
loader = TextLoader(pathtopapers+paper_meta[0]['key']+'.txt')
data = loader.load() # this is a langchain document object
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your original document')
texts = text_splitter.split_documents(data)
print (f'Now you have {len(texts)} document chunks')
idlist = []
contentlist = []
metalist = []
for i in range(len(texts)):
    idlist.append(str(paperno)+'.'+str(i))
    contentlist.append(texts[i].page_content)
    # put the source text for this chunk in the metadata
    # print(i,texts[i].page_content[0:10])
    paper_meta[0]['text']=''+texts[i].page_content[:]
    temp = deepcopy(paper_meta[0]) # otherwise all the metadata points to the last text chunk
    metalist.append(temp)
    
# file to save/load openai embeddings for this paper
openai_file = pathtopapers+paper_meta[0]['key']+'.pickle'

# check if openai results already exist for this file
if exists(openai_file):
    # Open the file in binary mode
    with open(openai_file, 'rb') as file:
        # Call load method to deserialze
        res = pickle.load(file)
    print('Loaded previously saved embeddings: ',openai_file)
else:
    # this costs $
    # take the list of text chunks and generate vector embeddings
    # res is an openai object
    res = openai.Embedding.create(input=contentlist, engine=OPENAI_MODEL)
    # save the openai results because you don't want to spend the $ to make it again if you don't have to
    # Open a file and use dump()
    with open(openai_file, 'wb') as file:
        # A new file will be created
        pickle.dump(res, file)

embedslist = [record['embedding'] for record in res['data']]
# zip all the chunks into one big vector list for pinecone
to_upsert = zip(idlist, embedslist, metalist)
# upsert to Pinecone all the vectors for this one paper
index.upsert(vectors=list(to_upsert))

print('finished paper upload to pinecone')

You have 1 document(s) in your data
There are 120375 characters in your original document
Now you have 46 document chunks
Loaded previously saved embeddings:  ./papers/Lock2017.pickle
finished paper upload to pinecone


paper 3:

In [25]:
# will look for the text in ./papers/[key].txt
paperno = 3
paper_meta = [{
    'key' : 'Lock2019',
    'doi' : '10.1126/sciadv.aav3746',
    'bibtex' : "@article{Lock2019,\
        author = {Lock, Simon J. and Stewart, Sarah T.},\
        doi = {10.1126/sciadv.aav3746},\
        journal = {Science Advances},\
        month = {sep},\
        number = {9},\
        pages = {eaav3746},\
        title = {{Giant impacts stochastically change the internal pressures of terrestrial planets}},\
        volume = {5},\
        year = {2019}\
        }",
    'openai_model' : OPENAI_MODEL,
    'openai_encoding' : OPENAI_ENCODING,
    'openai_chunksize' : str(OPENAI_CHUNKSIZE),
    'openai_chunkoverlap' : str(OPENAI_CHUNKOVERLAP),
    'text' : 'text'
    }] 

In [26]:
loader = TextLoader(pathtopapers+paper_meta[0]['key']+'.txt')
data = loader.load() # this is a langchain document object
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your original document')
texts = text_splitter.split_documents(data)
print (f'Now you have {len(texts)} document chunks')
idlist = []
contentlist = []
metalist = []
for i in range(len(texts)):
    idlist.append(str(paperno)+'.'+str(i))
    contentlist.append(texts[i].page_content)
    # put the source text for this chunk in the metadata
    # print(i,texts[i].page_content[0:10])
    paper_meta[0]['text']=''+texts[i].page_content[:]
    temp = deepcopy(paper_meta[0]) # otherwise all the metadata points to the last text chunk
    metalist.append(temp)
    
# file to save/load openai embeddings for this paper
openai_file = pathtopapers+paper_meta[0]['key']+'.pickle'

# check if openai results already exist for this file
if exists(openai_file):
    # Open the file in binary mode
    with open(openai_file, 'rb') as file:
        # Call load method to deserialze
        res = pickle.load(file)
    print('Loaded previously saved embeddings: ',openai_file)
else:
    # this costs $
    # take the list of text chunks and generate vector embeddings
    # res is an openai object
    res = openai.Embedding.create(input=contentlist, engine=OPENAI_MODEL)
    # save the openai results because you don't want to spend the $ to make it again if you don't have to
    # Open a file and use dump()
    with open(openai_file, 'wb') as file:
        # A new file will be created
        pickle.dump(res, file)

embedslist = [record['embedding'] for record in res['data']]
# zip all the chunks into one big vector list for pinecone
to_upsert = zip(idlist, embedslist, metalist)
# upsert to Pinecone all the vectors for this one paper
index.upsert(vectors=list(to_upsert))

print('finished paper upload to pinecone')

You have 1 document(s) in your data
There are 61571 characters in your original document
Now you have 21 document chunks
finished paper upload to pinecone


paper 4:

In [27]:
# will look for the text in ./papers/[key].txt
paperno = 4
paper_meta = [{
    'key' : 'Lock2020',
    'doi' : '10.1016/j.epsl.2019.115885',
    'bibtex' : "@article{Lock2020,\
        author = {Lock, Simon J. and Stewart, Sarah T. and {\'{C}}uk, Matija},\
        doi = {10.1016/j.epsl.2019.115885},\
        journal = {Earth and Planetary Science Letters},\
        keywords = {Moon,energy,figure,impact,tides},\
        month = {jan},\
        pages = {115885},\
        title = {{The energy budget and figure of Earth during recovery from the Moon-forming giant impact}},\
        volume = {530},\
        year = {2020}\
        }",
    'openai_model' : OPENAI_MODEL,
    'openai_encoding' : OPENAI_ENCODING,
    'openai_chunksize' : str(OPENAI_CHUNKSIZE),
    'openai_chunkoverlap' : str(OPENAI_CHUNKOVERLAP),
    'text' : 'text'
    }] 



In [28]:
loader = TextLoader(pathtopapers+paper_meta[0]['key']+'.txt')
data = loader.load() # this is a langchain document object
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your original document')
texts = text_splitter.split_documents(data)
print (f'Now you have {len(texts)} document chunks')
idlist = []
contentlist = []
metalist = []
for i in range(len(texts)):
    idlist.append(str(paperno)+'.'+str(i))
    contentlist.append(texts[i].page_content)
    # put the source text for this chunk in the metadata
    # print(i,texts[i].page_content[0:10])
    paper_meta[0]['text']=''+texts[i].page_content[:]
    temp = deepcopy(paper_meta[0]) # otherwise all the metadata points to the last text chunk
    metalist.append(temp)
    
# file to save/load openai embeddings for this paper
openai_file = pathtopapers+paper_meta[0]['key']+'.pickle'

# check if openai results already exist for this file
if exists(openai_file):
    # Open the file in binary mode
    with open(openai_file, 'rb') as file:
        # Call load method to deserialze
        res = pickle.load(file)
    print('Loaded previously saved embeddings: ',openai_file)
else:
    # this costs $
    # take the list of text chunks and generate vector embeddings
    # res is an openai object
    res = openai.Embedding.create(input=contentlist, engine=OPENAI_MODEL)
    # save the openai results because you don't want to spend the $ to make it again if you don't have to
    # Open a file and use dump()
    with open(openai_file, 'wb') as file:
        # A new file will be created
        pickle.dump(res, file)

embedslist = [record['embedding'] for record in res['data']]
# zip all the chunks into one big vector list for pinecone
to_upsert = zip(idlist, embedslist, metalist)
# upsert to Pinecone all the vectors for this one paper
index.upsert(vectors=list(to_upsert))

print('finished paper upload to pinecone')

You have 1 document(s) in your data
There are 55613 characters in your original document
Now you have 21 document chunks
finished paper upload to pinecone


paper 5:

In [29]:
# will look for the text in ./papers/[key].txt
paperno = 5
paper_meta = [{
    'key' : 'Carter2020',
    'doi' : '10.1029/2019JE006042',
    'bibtex' : "@article{Carter2020,\
        author = {Carter, P. J. and Lock, S. J. and Stewart, S. T.},\
        doi = {10.1029/2019JE006042},\
        journal = {Journal of Geophysical Research: Planets},\
        month = {jan},\
        number = {1},\
        title = {{The Energy Budgets of Giant Impacts}},\
        volume = {125},\
        year = {2020}\
        }",
    'openai_model' : OPENAI_MODEL,
    'openai_encoding' : OPENAI_ENCODING,
    'openai_chunksize' : str(OPENAI_CHUNKSIZE),
    'openai_chunkoverlap' : str(OPENAI_CHUNKOVERLAP),
    'text' : 'text'
    }] 

In [30]:
loader = TextLoader(pathtopapers+paper_meta[0]['key']+'.txt')
data = loader.load() # this is a langchain document object
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your original document')
texts = text_splitter.split_documents(data)
print (f'Now you have {len(texts)} document chunks')
idlist = []
contentlist = []
metalist = []
for i in range(len(texts)):
    idlist.append(str(paperno)+'.'+str(i))
    contentlist.append(texts[i].page_content)
    # put the source text for this chunk in the metadata
    # print(i,texts[i].page_content[0:10])
    paper_meta[0]['text']=''+texts[i].page_content[:]
    temp = deepcopy(paper_meta[0]) # otherwise all the metadata points to the last text chunk
    metalist.append(temp)
    
# file to save/load openai embeddings for this paper
openai_file = pathtopapers+paper_meta[0]['key']+'.pickle'

# check if openai results already exist for this file
if exists(openai_file):
    # Open the file in binary mode
    with open(openai_file, 'rb') as file:
        # Call load method to deserialze
        res = pickle.load(file)
    print('Loaded previously saved embeddings: ',openai_file)
else:
    # this costs $
    # take the list of text chunks and generate vector embeddings
    # res is an openai object
    res = openai.Embedding.create(input=contentlist, engine=OPENAI_MODEL)
    # save the openai results because you don't want to spend the $ to make it again if you don't have to
    # Open a file and use dump()
    with open(openai_file, 'wb') as file:
        # A new file will be created
        pickle.dump(res, file)

embedslist = [record['embedding'] for record in res['data']]
# zip all the chunks into one big vector list for pinecone
to_upsert = zip(idlist, embedslist, metalist)
# upsert to Pinecone all the vectors for this one paper
index.upsert(vectors=list(to_upsert))

print('finished paper upload to pinecone')

You have 1 document(s) in your data
There are 60001 characters in your original document
Now you have 24 document chunks
finished paper upload to pinecone


## Now query the pinecone index for similar text

Here asking for the 5 best matches to the query.
You can use the metadata to restrict the search to different entries (e.g., papers) in the database.

In [32]:
# generate dense vector embeddings for the query
query = "What is the Earth like when it is transformed into a synestia?"
# queryvector holds the vector indices for the query text
# the create function returns an openai object; the data dict has the embedding data
queryvector = openai.Embedding.create(input=query, engine=OPENAI_MODEL)["data"][0]["embedding"]
# query pinecone for similar indices
# return the top 5 matches
# index is the variable that points to the pinecone database
# query is the pinecone function to find similar text
result = index.query(queryvector, top_k=5, include_metadata=True)


In [44]:
result

{'matches': [{'id': '1.6',
              'metadata': {'bibtex': '@article{Lock2018,        author = '
                                     '{Lock, Simon J. and Stewart, Sarah T. '
                                     'and Petaev, Michail I. and Leinhardt, '
                                     'Zo{"{e}} and Mace, Mia T. and Jacobsen, '
                                     'Stein B. and Cuk, Matija},        doi = '
                                     '{10.1002/2017JE005333},        journal = '
                                     '{Journal of Geophysical Research: '
                                     'Planets},        keywords = {angular '
                                     'momentum,giant impact,isotopic '
                                     'equilibration,lunar accretion,lunar '
                                     'chemistry,volatile elements},        '
                                     'month = {apr},        number = '
                                     '{4},        pages

In [55]:
textforchatgpt = ''
for i in range(len(result['matches'])):
    textforchatgpt=textforchatgpt+' '+(result['matches'][i]['metadata']['text'])
textforchatgpt

" Here we present a new model for lunar origin within a terrestrial synestia, an impact-generated structure with Earth mass and composition that exceeds the corotation limit (CoRoL). Synestias are formed by a range of high-energy, high-AM collisions during the giant impact stage of planet formation (Lock & Stewart, 2017, hereafter LS17). A synestia is a distinct dynamical structure compared to a planet with a condensate-dominated circumplanetary disk, and, as a result, different processes dominate the early evolution of a synestia. Note that preliminary versions of this work (e.g., Lock et al., 2016; Petaev et al., 2016) used different nomenclature than is used here. In particular, synestias were referred to as continuous mantle-atmosphere-disk (MAD) structures.\nAt present, no single calculation can fully capture the dynamics, thermodynamics, and chemistry of lunar accretion. Therefore, our approach is to link the physics and chemistry of satellite accretion from a terrestrial synesti

## Now pass this information to ChatGPT to augment its knowledge base

This uses the ConversationChain function from the langchain package.
Initialize a large language model conversation.
You can lower the temperature if you want less verbose answers.

In [37]:
llm = OpenAI(temperature=0.9)
conversation = ConversationChain(llm=llm, verbose=True)

In [38]:
conversation.predict(input="I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.
AI:[0m

[1m> Finished chain.[0m


" Hi Professor! It's nice to meet you. That sounds like a great way to get started! What type of text excerpt are you looking for?"

In [56]:
conversation.predict(input='Here is my text:'+textforchatgpt+' Now answer my question: '+query)



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.
AI:  Hi Professor! It's nice to meet you. That sounds like a great way to get started! What type of text excerpt are you looking for?
Human: Here is my text: Here we present a new model for lunar origin within a terrestrial synestia, an impact-generated structure with Earth mass and composition that exceeds the corotation limit (CoRoL). Synestias are formed by a range of high-energy, high-AM collisions during the giant impact stage of planet formation (Lock 

' When Earth is transformed into a synestia, it is an impact-generated structure with Earth mass and composition that exceeds the corotation limit (CoRoL). Synestias are formed by a range of high-energy, high-AM collisions during the giant impact stage of planet formation and have a distinct dynamical structure compared to a planet with a condensate-dominated circumplanetary disk. Synestias are composed of a mixture of vapor and condensates, and have a photosphere at low pressures (10^-6 to 10^-2 bar) that radiates at a temperature of about Trad = 2,300 K. Inside the synestia, the pressure gradient force perpendicular to the rotation axis is large enough to cause condensates to fall into the structure. As the synestia cools, condensates form from the vapor at the photosphere and interact with the vapor, which perturbs their motion. The size of the droplets is determined by the balance between shear from the vapor and surface tension. Eventually, the structure cools below the CoRoL and 

## This conversation has taken up most of the token limit

You can start a new conversation and pass new information or shorten the amount of material being passed. Or ask GPT to summarize the previous material and just remember the summary.

## Ask more questions

Compare temperature = 0.9 to 0.0 below

In [62]:
query = "When does a synestia turn back into a planet?"
llm = OpenAI(temperature=0.9)
conversation2 = ConversationChain(llm=llm, verbose=True)
conversation2.predict(input="I am a professor of planetary science. \
You are a very helpful question and answer assistant. \
I am going to give you text excerpt from science papers and ask you questions about the text.")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.
AI:[0m

[1m> Finished chain.[0m


' That sounds like a great challenge! Please provide me with the text extract and your questions, and I will do my best to answer them.'

In [None]:
queryvector = openai.Embedding.create(input=query, engine=OPENAI_MODEL)["data"][0]["embedding"]
result = index.query(queryvector, top_k=4, include_metadata=True)
textforchatgpt = ''
for i in range(len(result['matches'])):
    textforchatgpt=textforchatgpt+' '+(result['matches'][i]['metadata']['text'])


In [64]:
conversation2.predict(input="Here is the text: "+textforchatgpt+\
                      "Only answer based on this text. If you don't know, say you don't know."+\
                      "Question: "+query)




[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.
AI:  That sounds like a great challenge! Please provide me with the text extract and your questions, and I will do my best to answer them.
Human: Here is the text:  Super-Earths are also expected to form via giant impacts in a manner similar to the accretion of our terrestrial planets [Chambers, 2010]. As noted in section 5, Kokubo and Genda [2010] found that 0.25 to 1.75 MEarth planets had a similar average angular velocity at the end of accretion. Accounti

' A synestia is expected to cool to a state below its corotation limit (CoRoL) in approximately 102 to 103 years, after which it can begin to transition back into a planet. This process involves the removal of angular momentum from the structure by tides, interactions with the host star, or three body interactions with the host star and a moon. Once the angular momentum has been reduced to a level below the CoRoL, the structure will begin to contract, releasing kinetic and potential energy, as well as reducing its specific entropy. The whole process is expected to take around 103-109 years to complete.'

In [65]:
conversation2.predict(input="Is Mars a synestia?")




[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.
AI:  That sounds like a great challenge! Please provide me with the text extract and your questions, and I will do my best to answer them.
Human: Here is the text:  Super-Earths are also expected to form via giant impacts in a manner similar to the accretion of our terrestrial planets [Chambers, 2010]. As noted in section 5, Kokubo and Genda [2010] found that 0.25 to 1.75 MEarth planets had a similar average angular velocity at the end of accretion. Accounti

'No, Mars is not a synestia. Synestias are post-impact states produced during the accretion of terrestrial planets. Mars has been in its current structure for around 4.5 billion years and has not experienced any kind of impact large enough to produce a synestia.'

In [66]:
conversation2.predict(input="Can my cat turn into a synestia?")




[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.
AI:  That sounds like a great challenge! Please provide me with the text extract and your questions, and I will do my best to answer them.
Human: Here is the text:  Super-Earths are also expected to form via giant impacts in a manner similar to the accretion of our terrestrial planets [Chambers, 2010]. As noted in section 5, Kokubo and Genda [2010] found that 0.25 to 1.75 MEarth planets had a similar average angular velocity at the end of accretion. Accounti

' No, animals and living beings cannot turn into synestias. Synestias are post-impact states produced during the accretion of terrestrial planets and are made up of rock and/or metal vapor.'

In [67]:
conversation2.predict(input="Have we observed a synestia in an exoplanetary system?")




[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.
AI:  That sounds like a great challenge! Please provide me with the text extract and your questions, and I will do my best to answer them.
Human: Here is the text:  Super-Earths are also expected to form via giant impacts in a manner similar to the accretion of our terrestrial planets [Chambers, 2010]. As noted in section 5, Kokubo and Genda [2010] found that 0.25 to 1.75 MEarth planets had a similar average angular velocity at the end of accretion. Accounti

' Not to my knowledge. Synestias are post-impact states produced during the accretion of terrestrial planets, and usually last for tens to thousands of years before cooling to a state below the corotation limit. Currently, we do not have the capabilities to detect exoplanetary synestias, but it could be possible in the future.'

## Compare temperatures


In [68]:
query = "When does a synestia turn back into a planet?"
llm = OpenAI(temperature=0.)
conversation3 = ConversationChain(llm=llm, verbose=True)
conversation3.predict(input="I am a professor of planetary science. \
You are a very helpful question and answer assistant. \
I am going to give you text excerpt from science papers and ask you questions about the text.")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.
AI:[0m

[1m> Finished chain.[0m


" That sounds like a great idea! I'm looking forward to helping you out. What is the first text excerpt you would like me to analyze?"

In [69]:
conversation3.predict(input="Here is the text: "+textforchatgpt+\
                      "Only answer based on this text. If you don't know, say you don't know."+\
                      "Question: "+query)




[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.
AI:  That sounds like a great idea! I'm looking forward to helping you out. What is the first text excerpt you would like me to analyze?
Human: Here is the text:  Super-Earths are also expected to form via giant impacts in a manner similar to the accretion of our terrestrial planets [Chambers, 2010]. As noted in section 5, Kokubo and Genda [2010] found that 0.25 to 1.75 MEarth planets had a similar average angular velocity at the end of accretion. Accounting

' A synestia typically turns back into a planet when it has cooled to a state below the CoRoL (Corotation Limit). This process typically takes tens to thousands of years, depending on the radiative surface area of the structure and the amount of energy that needs to be radiated to fall below the CoRoL.'

In [71]:
conversation3.predict(input="Is Mars a synestia?")




[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.
AI:  That sounds like a great idea! I'm looking forward to helping you out. What is the first text excerpt you would like me to analyze?
Human: Here is the text:  Super-Earths are also expected to form via giant impacts in a manner similar to the accretion of our terrestrial planets [Chambers, 2010]. As noted in section 5, Kokubo and Genda [2010] found that 0.25 to 1.75 MEarth planets had a similar average angular velocity at the end of accretion. Accounting

' No, Mars is not a synestia. Synestias are post-impact structures that form during the accretion of terrestrial planets. Mars is a terrestrial planet that has already formed.'

In [72]:
conversation3.predict(input="Are planets sad?")




[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: I am a professor of planetary science. You are a very helpful question and answer assistant. I am going to give you text excerpt from science papers and ask you questions about the text.
AI:  That sounds like a great idea! I'm looking forward to helping you out. What is the first text excerpt you would like me to analyze?
Human: Here is the text:  Super-Earths are also expected to form via giant impacts in a manner similar to the accretion of our terrestrial planets [Chambers, 2010]. As noted in section 5, Kokubo and Genda [2010] found that 0.25 to 1.75 MEarth planets had a similar average angular velocity at the end of accretion. Accounting

" I don't know."