## **Chapter 5 – Literature and Knowledge Mining with LLMs**

The explosion of chemical literature has made it difficult for researchers to keep up with new findings. Large language models (LLMs) provide a powerful way to mine this vast knowledge. This chapter explores how LLMs transform literature interaction, starting with the scale and complexity of today’s journals and the limits of manual or rule-based mining. We then cover automated techniques powered by LLMs—such as chemical named entity recognition, relationship extraction, and parsing of experimental sections.  

We further show how extracted data can form **knowledge bases or graphs**, organizing reactions, properties, and relationships beyond traditional databases. LLM-driven analysis of literature can also inspire new hypotheses and guide research directions. Practical examples, code snippets, and case studies demonstrate these ideas, highlighting that LLMs are not just tools for text, but catalysts for **knowledge discovery in chemistry**.  


### 5.2.1 Entity and Relationship Extraction  

A core task in text mining is **entity extraction**—identifying mentions of key items within text. In chemistry, these entities often include compounds, materials, reactions, properties, experimental conditions, and instruments. For example, **chemical named entity recognition (CNER)** targets chemical names such as *aspirin*, *H₂SO₄*, or *acetylsalicylic acid*.  

Building on this, **relationship extraction** identifies how entities connect. Examples include linking a compound to a property (*“aspirin has a melting point of 135 °C”*) or mapping reagents to a reaction outcome (*“benzene under condition X yields phenol”*).  

LLMs can serve as powerful chemical entity recognizers. With well-designed prompts, an LLM can perform CNER dynamically, leveraging its contextual understanding gained during training. Unlike classical NER systems that depend on fixed vocabularies or labeled datasets, an LLM can infer roles from context. For instance, given the sentence:  

*“We dissolved the sample in ethyl acetate.”*  

A traditional algorithm might miss the solvent if it were not in its dictionary, whereas an LLM can infer that *ethyl acetate* functions as a solvent in this context.  

**Example:** Consider the sentence:  
*“The mixture of benzaldehyde and NaBH₄ was stirred in methanol at 0 °C.”*  

A prompt to an LLM could extract:  
- **Chemicals**: benzaldehyde, NaBH₄, methanol  
- **Condition**: 0 °C  


In [1]:
from openai import OpenAI
from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Initialize client
client = OpenAI()

# Input text
text = "The mixture of benzaldehyde and NaBH4 was stirred in methanol at 0°C."

# Build prompt
prompt = f"List all chemical compounds and conditions in this text: '{text}'"

# Request completion
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

# Print answer
print(response.choices[0].message.content)



Chemical Compounds:
1. Benzaldehyde
2. NaBH4 (Sodium Borohydride)
3. Methanol

Conditions:
1. The mixture was stirred
2. The process was carried out at a temperature of 0°C.


Let’s illustrate with a practical example using an open-source LLM or API. Suppose we have a paragraph from a paper and we want to pull out certain facts. Below, we simulate extracting information from synthetic chemistry procedure steps. We will use an LLM to find specific entities: the temperature and the starting materials in each step. First, define a list of procedure steps (as strings):

In [None]:
# Example list of synthetic procedure steps
synthetic_steps = [
    "To a solution of aniline (0.1 mol) in 50 mL of ethanol, 0.12 mol of acetic anhydride was added dropwise at 5°C under stirring. After 2 hours, the product was filtered to yield acetanilide.",
    "Weigh 2.5 g of benzaldehyde and add it to a flask with 15 mL of 95% ethanol. Add 3 g of NaOH and heat to 60°C for 30 minutes. Allow to cool and collect crystals by filtration.",
    # ... more steps
]

# Preview first two steps
for i, step in enumerate(synthetic_steps, 1):
    print(f"Step {i}: {step}")


Step 1: To a solution of aniline (0.1 mol) in 50 mL of ethanol, 0.12 mol of acetic anhydride was added dropwise at 5°C under stirring. After 2 hours, the product was filtered to yield acetanilide.
Step 2: Weigh 2.5 g of benzaldehyde and add it to a flask with 15 mL of 95% ethanol. Add 3 g of NaOH and heat to 60°C for 30 minutes. Allow to cool and collect crystals by filtration.


Combining this together

In [None]:
from openai import OpenAI

client = OpenAI()

# Loop through each synthetic step and extract temperature
for step in synthetic_steps:
    prompt = f"Extract the reaction temperature mentioned in the following procedure step (if any):\n{step}"

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    temp = response.choices[0].message.content.strip()
    print("Step text:", step[:30], "... -> Temperature:", temp)


Step text: To a solution of aniline (0.1  ... -> Temperature: The reaction temperature mentioned in the procedure step is 5°C.
Step text: Weigh 2.5 g of benzaldehyde an ... -> Temperature: 60°C


### 5.2.2 Reaction and Methodology Mining  

Extracting reactions and experimental methods from literature is one of the most valuable applications of text mining in chemistry. Systematically capturing details from patents and papers enables the creation of reaction databases, which support tasks like prediction models and trend analysis.  

Traditional approaches break the text into parts: identifying reagents, products, conditions (temperature, catalysts), and actions (“added”, “heated”, “stirred”), then reconstructing structured formats (e.g., reaction SMILES or procedural steps). While accurate, these methods require heavy annotation and predefined categories.  

LLMs offer a more flexible solution. With prompting, they can read free-form text and output structured data (e.g., JSON with reactants, products, solvents, and conditions) or even step-by-step procedures. This transforms unstructured text into searchable, machine-readable records.  

For example, instead of manually scanning a paper, an LLM could answer:  
*“What yield did they report for the oxidation of benzyl alcohol to benzaldehyde?”* → *“82% yield under condition X.”*  

Such capabilities make literature more searchable, actionable, and directly integrable into electronic lab notebooks and ML workflows.  


In [None]:
from openai import OpenAI

client = OpenAI()

procedure = (
    "Benzyl alcohol (1 mmol) was dissolved in 5 mL of dichloromethane. "
    "TEMPO (5 mol%) and potassium bromide (0.5 mmol) were added. "
    "The mixture was cooled to 0°C and 2 mmol of bleach (sodium hypochlorite) solution "
    "was added dropwise. The reaction was stirred for 1 hour, warming to room temperature. "
    "The organic layer was separated, dried, and evaporated to give benzaldehyde in 75% yield."
)

prompt = (
    "Extract the key details from this procedure as JSON with fields: "
    "reactant, product, catalyst, other_reagents, temperature, time, yield.\n\n"
    f"{procedure}"
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0
)

result = response.choices[0].message.content.strip()
print(result)


{
"reactant": "Benzyl alcohol",
"product": "Benzaldehyde",
"catalyst": "TEMPO",
"other_reagents": ["Dichloromethane", "Potassium bromide", "Bleach (sodium hypochlorite)"],
"temperature": ["0°C", "Room temperature"],
"time": "1 hour",
"yield": "75%"
}
