# Extracting literature knowledge using LLMs

In this task, you will use large language models to extract knowledge from the scientific literature. Your goal is to develop an approach that can answer a series of chemistry and physics exam paper questions.

## How to complete this task

There are two options to completing this task:

1. Using ChatGPT, a fully trained large language model hosted through OpenAI.
2. Using paper-qa, a package for answering questions based on PDF or text files.

More information on each of these approaches is outlined below.

### Option 1: ChatGPT

The simplest option is to use ChatGPT to answer the questions. To do this, simply navigate to https://chat.openai.com and create an account. You can ask questions to the large language model directly using the chat box.

While directly pasting the question to the model will always get you an answer, in many cases it is unlikely to be the correct one. To achieve better results, you can try "prompt engineering". This is adding more information to the prompt (the question) to improve the reliability and accuracy of the results. Research has shown that simply asking the model to respond as if it were an expert can improve the answers given.

A quick introduction to prompt engineering is available here: https://www.datacamp.com/tutorial/a-beginners-guide-to-chatgpt-prompt-engineering

### Option 2: Paper-qa

If you have previous Python programming expertise then we recommend trying the paper-qa approach. [Paper-qa](https://github.com/whitead/paper-qa) is a package for extracting and synthesising information contained in PDF and text files. Under the hood, it uses large language models (like ChatGPT) to:
1. Decide which PDF files are relevant to a question.
2. Extract the relevant information from PDF files.
3. Summarise the extracted information into a final response.

A benefit of paper-qa is that it can provide references to where its answer originates from, unlike ChatGPT which can confidently state incorrect information.

## The questions

The following questions have been taken from past chemistry and physics exam papers. In several cases, they require understanding and summarising different aspects of the subjects, which can make it difficult for a model like ChatGPT.

**Please ensure that all answers are less than 100 words. All answers will be truncated to this length when being marked. You can directly instruct the model to provide answers within this word count.**

The list of questions is as follows:

**Chemistry questions**

1. Account for the variation in bond strengths of the Group 17 diatomic molecules (given in kJ mol-1) F2 (158) Cl2 (242) Br2 (192) I2 (151)
2. What is the oxidation state and hybridisation of the Cl centres in ClF3 and ClF5?
3. Carbon monoxide is a good ligand. Why is the isoelectronic N2 molecule not a good ligand?
4. Describe the 1H NMR spectrum of GeH4.
5. What is the expected maximum stable oxidation state for (a) Ba, (b) As, (c) Ti, (d) Cr?

**Physics questions**

6. A
7. B
8. C
9. D
10. E

## Uploading your results

Once you have the list of your 10 answers, you should add them to your GitHub pull request for automated scoring. See the automated scoring documentation for more details on how this process works.

Each answer must be on a single line (and not contain any new lines). The answers should be ordered in the same order as above. Accordingly, the file that you upload should only contain 10 lines in total. If your file contains more or less than this, an error message will be shown. An example answers file is shown below:

```
This is my answer to question 1.
This is my answer to question 2.
This is my answer to question 3.
This is my answer to question 4.
This is my answer to question 5.
This is my answer to question 6.
This is my answer to question 7.
This is my answer to question 8.
This is my answer to question 9.
This is my answer to question 10.
```

You should name your file: `task2.txt`.

## Using paper-qa

The rest of this notebook, gives a quick introduction to using paper-qa, and should be used as the starting point for groups following option 2.

First, we need to download and install the necessary packages to run the notebook.

In [None]:
! pip install paper-qa openai

The next step is adding your OpenAI API key. This is necessary for paper-qa to formulate responses to the prompts, and to enable extraction of literature information. You should recieve your group's personalised API key from the hackathon organisers.

**Each group has a fixed budget for API requests. Adding new documents, and asking more questions will each generate multiple requests, so be mindful when using the model.**

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "PUT_API_KEY_HERE"

Next, we set up paper-qa for use in notebook mode. It is essential that you run this code, otherwise the rest of the notebook will not work as expected.

In [None]:
import nest_asyncio

nest_asyncio.apply()

### Loading the Docs object and adding documents

First, we load a pre-prepared paperqa `Docs` object. See the [paper-qa documentation](https://github.com/whitead/paper-qa/tree/main#usage) for more details on this object. It is recommended that you use this docs object as the starting point for your queries.

This object has already been configured to include the following textbooks:

1. Inorganic Chemistry (2014) *Shriver, Weller, Overton, Rourke, Armstrong*, 6th Ed.

For reference, the object was created using the following code:

```python
from paperqa import Docs

docs = Docs(llm='gpt-3.5-turbo')
docs.add(
    "Inorganic Chemistry.pdf", 
    citation="Inorganic Chemistry, Shriver, 2014"
)
```



In [None]:
import pickle

with open("docs.p", "rb") as f:
    docs = pickle.load(f)

New documents (PDFs and text files) can be added to the `docs` object using the following code:

```python
docs.add ("my_file.pdf")
```

You should add any documents that you think will help answer the questions. These can be papers taken from the scientific literature, text from websites, or any other sources you see fit.

### Querying the text corpus

You can query the `docs` object to help answer questions. When you do so, paper-qa will perform the following task:

1. Search all documents for the top 10 relevant passages to the query (using ChatGPT).
2. Create summary of each passage relevant to the query (using ChatGPT).
3. Put the summaries into a context.
4. Generate an answer taking into account the context (using ChatGPT).

An example of using the `docs` object is shown below:

In [None]:
answer = docs.query("What is an oxidation state?")
print(answer)

You can inspect the context (the selected passages) that paper-qa found relevant to your query using the `context` attribute.

In [None]:
print(answer.context)

### Customising prompts

Steps 1, 2, and 4 outlined above each use ChatGPT to extract information. Each step uses a custom prompt to achieve its goal. All of these prompts are configurable in paper-qa.

Below, we have reproduced the prompts that paper-qa uses. If you edit the cell, the prompts will be updated and you can tune how information is extracted. This can be an effective way of extracting more information for your query.

In [None]:
from langchain.prompts import PromptTemplate
from paperqa.prompts import _get_datetime
from paperqa.types import PromptCollection

summary_prompt = PromptTemplate(
    input_variables=["text", "citation", "question", "summary_length"],
    template="Summarize the text below to help answer a question. "
    "Do not directly answer the question, instead summarize "
    "to give evidence to help answer the question. "
    'Reply "Not applicable" if text is irrelevant. '
    "Use {summary_length}. At the end of your response, provide a score from 1-10 on a newline "
    "indicating relevance to question. Do not explain your score. "
    "\n\n"
    "{text}\n\n"
    "Excerpt from {citation}\n"
    "Question: {question}\n"
    "Relevant Information Summary:",
)

qa_prompt = PromptTemplate(
    input_variables=["context", "answer_length", "question"],
    template="Write an answer ({answer_length}) "
    "for the question below based on the provided context. "
    "If the context provides insufficient information, "
    'reply "I cannot answer". '
    "For each part of your answer, indicate which sources most support it "
    "via valid citation markers at the end of sentences, like (Example2012). "
    "Answer in an unbiased, comprehensive, and scholarly tone. "
    "If the question is subjective, provide an opinionated answer in the concluding 1-2 sentences. \n\n"
    "{context}\n"
    "Question: {question}\n"
    "Answer: ",
)

select_paper_prompt = PromptTemplate(
    input_variables=["question", "papers"],
    template="Select papers that may help answer the question below. "
    "Papers are listed as $KEY: $PAPER_INFO. "
    "Return a list of keys, separated by commas. "
    'Return "None", if no papers are applicable. '
    "Choose papers that are relevant, from reputable sources, and timely "
    "(if the question requires timely information). \n\n"
    "Question: {question}\n\n"
    "{papers}\n\n"
    "Selected keys:",
)

citation_prompt = PromptTemplate(
    input_variables=["text"],
    template="Provide the citation for the following text in MLA Format. Today's date is {date}\n"
    "{text}\n\n"
    "Citation:",
    partial_variables={"date": _get_datetime},
)

docs.prompts = PromptCollection(
    summary=summary_prompt,
    qa=qa_prompt,
    select=select_paper_prompt,
    cite=citation_prompt,
)

Any new queries to the docs objects will use the updated prompt.

### Querying ChatGPT and other OpenAI LLMs

You may find that paper-qa is too restrictive. If you want to query ChatGPT directly you can use the `OpenAI` object from the `langchain` package.

Below, we create a model to query the `text-davinci-003` OpenAI model. This is similar to ChatGPT but is less conversational. More information on the models available in OpenAI can be found on the OpenAI [documentation page](https://platform.openai.com/docs/models).

The `temperature` parameter adjusts the randomness of the output. Higher values like 0.9 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

In [None]:
from langchain import OpenAI

model = OpenAI(model_name="text-davinci-003", temperature=0.9)

You can query the model as follows:

In [None]:
response = model("What is an oxidation state?")
print(response)

Directly using OpenAI models may give you answers where paper-qa may not. However, OpenAI models are less strict about providing correct information, so beware of the results.

### Prompt engineering

ChatGPT, paper-qa, and OpenAI models can all be tuned using prompt engineering. It may be better to ask your question in multiple parts, to state the expected audience of your question, or to ask the model to respond as an expert. A quick introduction to prompt engineering is available here: https://www.datacamp.com/tutorial/a-beginners-guide-to-chatgpt-prompt-engineering