## **Chapter 4 – Large Language Models in Chemistry**


This chapter explores how transformer-based LLMs can be applied in the chemical sciences. We cover how they process chemical information, strategies for prompting and fine-tuning, and building chemistry-focused conversational agents. Extensions to multimodal data (structures, spectra) and integration with external tools are discussed, along with evaluation, reliability, and safety (e.g., avoiding hallucinations). By the end, readers will understand how to apply and adapt LLMs to chemical problems responsibly.  


### 4.2.2 Domain-Specific Fine-Tuning Pipelines  
**Step 1: Prepare Your Dataset**

Gather examples of the task you want the model to learn (e.g., question–answer or prompt–completion pairs).  
For instance, if you want a model that assists with chemical reaction questions, you could collect problems as **prompts** and their solutions as **completions**.  

The dataset must be in **JSONL (JSON Lines)** format, where each line is a JSON object with `"prompt"` and `"completion"` fields. The `"prompt"` is the input you would provide at runtime, and the `"completion"` is the expected response.  

Be sure to include any formatting in the prompt that you plan to use during inference.  


In [None]:
import json
data = [
    {
        "messages": [
            {"role": "system", "content": "You are a chemistry tutor."},
            {"role": "user", "content": "What is the formula of acetone?"},
            {"role": "assistant", "content": "C3H6O"}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a chemistry tutor."},
            {"role": "user", "content": "How many 1H NMR signals would benzene have?"},
            {"role": "assistant", "content": "1 (all hydrogens are equivalent)"}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a chemistry tutor."},
            {"role": "user", "content": "What functional group is present in ethanol?"},
            {"role": "assistant", "content": "A hydroxyl (-OH) group"}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a chemistry tutor."},
            {"role": "user", "content": "What is the pKa of acetic acid approximately?"},
            {"role": "assistant", "content": "Around 4.76"}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a chemistry tutor."},
            {"role": "user", "content": "What is the molecular geometry of methane?"},
            {"role": "assistant", "content": "Tetrahedral"}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a chemistry tutor."},
            {"role": "user", "content": "What hybridization does the carbon atom in ethyne (C2H2) have?"},
            {"role": "assistant", "content": "sp hybridization"}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a chemistry tutor."},
            {"role": "user", "content": "Which reagent is commonly used to test for unsaturation (double bonds) in hydrocarbons?"},
            {"role": "assistant", "content": "Bromine water (Br2 in CCl4)"}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a chemistry tutor."},
            {"role": "user", "content": "What type of bond is formed between Na+ and Cl- in NaCl?"},
            {"role": "assistant", "content": "An ionic bond"}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a chemistry tutor."},
            {"role": "user", "content": "What is the IUPAC name of CH3-CH2-CH2-OH?"},
            {"role": "assistant", "content": "Propan-1-ol"}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a chemistry tutor."},
            {"role": "user", "content": "Which gas is released when hydrochloric acid reacts with sodium carbonate?"},
            {"role": "assistant", "content": "Carbon dioxide (CO2)"}
        ]
    }
]



with open("chem_tuning_data_chat.jsonl", "w", encoding="utf-8") as f:
    for row in data:
        f.write(json.dumps(row, ensure_ascii=False) + "\n")

print("Wrote chat-format file!")
!head -n 5 chem_tuning_data_chat.jsonl


Wrote chat-format file!
{"messages": [{"role": "system", "content": "You are a chemistry tutor."}, {"role": "user", "content": "What is the formula of acetone?"}, {"role": "assistant", "content": "C3H6O"}]}
{"messages": [{"role": "system", "content": "You are a chemistry tutor."}, {"role": "user", "content": "How many 1H NMR signals would benzene have?"}, {"role": "assistant", "content": "1 (all hydrogens are equivalent)"}]}
{"messages": [{"role": "system", "content": "You are a chemistry tutor."}, {"role": "user", "content": "What functional group is present in ethanol?"}, {"role": "assistant", "content": "A hydroxyl (-OH) group"}]}
{"messages": [{"role": "system", "content": "You are a chemistry tutor."}, {"role": "user", "content": "What is the pKa of acetic acid approximately?"}, {"role": "assistant", "content": "Around 4.76"}]}
{"messages": [{"role": "system", "content": "You are a chemistry tutor."}, {"role": "user", "content": "What is the molecular geometry of methane?"}, {"rol

**Step 2: Upload and Validate the Dataset**

Once you have a `.jsonl` file (e.g., `chem_tuning_data.jsonl`), you can use the **OpenAI CLI** to validate and upload it. OpenAI provides a handy command to check the format:  

```bash
openai api files.create -f chem_tuning_data_chat.jsonl -p fine-tune
```
To run this successfully, you must have your API key set as OPENAI_API_KEY in your environment.

In Colab, you can use the **Secrets tab** (🔑) to securely store your API key.  
The CLI and Python libraries will automatically detect it if set as `OPENAI_API_KEY`.  

In general, the steps to get started are:  
1. Create an **OpenAI account**  
2. Generate an **API key** from your account dashboard  
3. Set the key as an environment variable (`OPENAI_API_KEY`) in your working environment  

You can set it in different ways:  
```python
import os
os.environ["OPENAI_API_KEY"] = "sk-your-api-key"
```
or
```bash
export OPENAI_API_KEY="sk-your-api-key"
```
⚠️ Important: API keys grant access to your OpenAI account. Never share them in public notebooks, commits, or screenshots. Use environment variables or Colab’s secret manager to keep them secure.



In [4]:
!pip install google
from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')



In [None]:
!openai api files.create -f chem_tuning_data_chat.jsonl -p fine-tune

Upload progress: 100% 2.16k/2.16k [00:00<00:00, 3.62kit/s]
{
  "id": "file-FFgRh4gh7pBug9Lsrb7m6e",
  "bytes": 2163,
  "created_at": 1759130284,
  "filename": "chem_tuning_data_chat.jsonl",
  "object": "file",
  "purpose": "fine-tune",
  "status": "processed",
  "expires_at": null,
  "status_details": null
}


**Step 3: Initiate the Fine-Tuning Job**  

Now you can start the fine-tuning process by specifying a base model.  
As of 2025, OpenAI supports fine-tuning on certain GPT-3.5 and GPT-4 variants (e.g., `gpt-3.5-turbo` and a special `gpt-4o` model).  

For example, to fine-tune `gpt-3.5-turbo` with your uploaded file:  

```bash
openai api fine_tunes.create -t "<YOUR FILE ID>" -m "gpt-3.5-turbo"
```
Replace <YOUR FILE ID> with the ID you received from the upload step (it typically starts with file-).

This command kicks off the fine-tuning job. By default, OpenAI sets sensible training hyperparameters (epochs, learning rate). You may override these (e.g., --n_epochs, --learning_rate) if you have special requirements, but defaults are usually fine for a first run.

For chemistry tasks, ~4 epochs is typical; for small datasets (<500 examples), you may run a few more epochs to help the model generalize.

OpenAI will queue and run the job on their servers — training may take minutes to an hour depending on dataset size and model complexity.

In [None]:
!openai api fine_tuning.jobs.create -F "file-FFgRh4gh7pBug9Lsrb7m6e" -m gpt-3.5-turbo

{
  "id": "ftjob-McgUcNQdqzxhlImTVVgoNWGl",
  "created_at": 1759132118,
  "error": {
    "code": null,
    "message": null,
    "param": null
  },
  "fine_tuned_model": null,
  "finished_at": null,
  "hyperparameters": {
    "batch_size": "auto",
    "learning_rate_multiplier": "auto",
    "n_epochs": "auto"
  },
  "model": "gpt-3.5-turbo-0125",
  "object": "fine_tuning.job",
  "organization_id": "org-vVj7NX8f155Kazao4Phi3tol",
  "result_files": [],
  "seed": 1417228779,
  "status": "validating_files",
  "trained_tokens": null,
  "training_file": "file-FFgRh4gh7pBug9Lsrb7m6e",
  "validation_file": null,
  "estimated_finish": null,
  "integrations": [],
  "metadata": null,
  "method": {
    "type": "supervised",
    "dpo": null,
    "reinforcement": null,
    "supervised": {
      "hyperparameters": {
        "batch_size": "auto",
        "learning_rate_multiplier": "auto",
        "n_epochs": "auto"
      }
    }
  },
  "user_provided_suffix": null,
  "usage_metrics": null,
  "shared_w

**Step 4: Monitor the Fine-Tune and Wait**

You can check the status of your fine-tuning jobs with:  

```bash
openai api fine_tunes.list
```
This will list both active and completed fine-tune jobs, showing whether your job is pending, running, or completed.

OpenAI typically also sends an email or notification when training finishes.
If successful, you’ll receive a new model ID, for example: `ft:gpt-3.5-turbo:your-org:chem-tuned-model:version`


This fine-tuned model is now stored on OpenAI’s side and ready to use in API calls.


In [None]:
!openai api fine_tuning.jobs.list

{
  "data": [
    {
      "id": "ftjob-McgUcNQdqzxhlImTVVgoNWGl",
      "created_at": 1759132118,
      "error": {
        "code": null,
        "message": null,
        "param": null
      },
      "fine_tuned_model": null,
      "finished_at": 1759132488,
      "hyperparameters": {
        "batch_size": 1,
        "learning_rate_multiplier": 2.0,
        "n_epochs": 10
      },
      "model": "gpt-3.5-turbo-0125",
      "object": "fine_tuning.job",
      "organization_id": "org-vVj7NX8f155Kazao4Phi3tol",
      "result_files": [],
      "seed": 1417228779,
      "status": "running",
      "trained_tokens": null,
      "training_file": "file-FFgRh4gh7pBug9Lsrb7m6e",
      "validation_file": null,
      "estimated_finish": null,
      "integrations": [],
      "metadata": null,
      "method": {
        "type": "supervised",
        "dpo": null,
        "reinforcement": null,
        "supervised": {
          "hyperparameters": {
            "batch_size": 1,
            "learning_rate_m

**Step 5: Using the Fine-Tuned Model**  
Once ready, you use it just like any other model in the API, but by specifying the new model name. For example, in Python:





In [None]:
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="ft:gpt-3.5-turbo-0125:washington-university-in-st-louis-zheng-group::CL2YjEu2",
    messages=[
        {"role": "user", "content": "Q: What is the formula of ethanol?\nA:"}
    ]
)

print(response.choices[0].message.content)

C2H5OH


**Step 6: Test and Refine**  

Evaluate your fine-tuned model on examples not in the training set to check performance. If weaknesses appear (e.g., struggles with organometallic nomenclature), add more high-quality, consistent examples and fine-tune again. Iteration is normal, but avoid over-tuning on inconsistent data. You can also fine-tune in stages by using your fine-tuned model as the new base.  


**Step 7: Cost and Considerations**  

Fine-tuning has costs: you pay for training tokens and a slightly higher per-call rate for the custom model. Weigh the benefits against using a base model with careful prompting. Remember, fine-tuning shapes style and task specialization, not general knowledge—it won’t add facts beyond the base model’s cutoff unless explicitly provided in training data.  




### 4.3 Building a Chemistry-Focused Conversational Agent  

LLMs can power interactive chemistry assistants that maintain dialogue, use tools, and serve students or researchers. To improve reliability, combine them with **retrieval-augmented generation (RAG)**, where the model fetches relevant passages (e.g., lab manuals, reaction guides) before answering. This reduces hallucinations and ensures responses are grounded in your own chemistry resources.  


**Step 1: Prepare and Chunk Your Chemistry Content**

Split your chemistry sources into small, meaningful text chunks (e.g., by paragraph). Each chunk should fit under the token limits (typically 4096 or 8192).

In [None]:
# Example documents
documents = [
    {
        "title": "IR Guide",
        "text": "A broad peak around 3200-3600 cm^-1 suggests O–H..."
    },
    {
        "title": "Reactions",
        "text": "To dehydrate alcohols, use H2SO4 or POCl3..."
    }
]

# Break documents into chunks (split on blank lines)
chunks = [
    {"text": para}
    for doc in documents
    for para in doc["text"].split("\n\n")
]


**Step 2: Generate Embeddings with OpenAI**

Use `text-embedding-ada-002` to convert each chunk to a vector

In [None]:
from openai import OpenAI

# Initialize the client (make sure OPENAI_API_KEY is set in your environment)
client = OpenAI()

# Example: assume you already have a list of dicts called 'chunks'
# Each chunk looks like: {"text": "some text here"}
texts = [c["text"] for c in chunks]

# Create embeddings with the latest API
resp = client.embeddings.create(
    model="text-embedding-ada-002",  # or "text-embedding-3-small"/"text-embedding-3-large"
    input=texts
)

# Add embeddings back into chunks
for i, emb in enumerate(resp.data):
    chunks[i]["embedding"] = emb.embedding

# Optional: check the first embedding length
print("First embedding dimension:", len(chunks[0]["embedding"]))


First embedding dimension: 1536


**Step 3: Embed the Question and Retrieve Similar Chunks**

Embed the user's question, then use cosine similarity to find top-matching chunks.

In [None]:
import numpy as np
from numpy.linalg import norm
from openai import OpenAI

# Initialize client (requires OPENAI_API_KEY in your environment)
client = OpenAI()

# Example query
query = "What does a strong IR peak at 1700 mean?"

# Get embedding for the query
qresp = client.embeddings.create(
    input=[query],
    model="text-embedding-ada-002"  # or "text-embedding-3-small"/"text-embedding-3-large"
)
qvec = qresp.data[0].embedding

# Define cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

# Rank chunks by similarity
scores = [(cosine_similarity(qvec, c["embedding"]), c["text"]) for c in chunks]

# Take top 2 most relevant chunks
top_chunks = sorted(scores, key=lambda x: x[0], reverse=True)[:2]
context = "\n\n".join([text for _, text in top_chunks])

print("Most relevant context:\n", context)


Most relevant context:
 A broad peak around 3200-3600 cm^-1 suggests O–H...

To dehydrate alcohols, use H2SO4 or POCl3...


**Step 4: Create Prompt and Call ChatGPT**

Construct a prompt that includes the context, then pass it to ChatGPT.

In [None]:
from openai import OpenAI

# Initialize client
client = OpenAI()

# Build the prompt with retrieved context
prompt = f"""
You are a chemistry assistant. Use the context below to answer the question.

Context:
{context}

Question:
{query}

Answer:
"""

# Get completion
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": prompt}],
    temperature=0
)

# Print the assistant's answer
print(response.choices[0].message.content)


A strong IR peak at 1700 cm^-1 typically indicates the presence of a carbonyl group (C=O) in the molecule.


### 4.5 Tool Use and Function-Calling  

LLMs become far more powerful when connected to external tools—functions, libraries, or databases—that provide precise results. OpenAI’s **function calling** feature lets models output structured calls instead of guessing, enabling verifiable answers (e.g., chemical properties from RDKit or PubChem).  

For example, when asked *“What is the molecular weight of caffeine?”*, the model can generate a JSON call like `get_molecular_weight("caffeine")`. Your system executes this function, retrieves the result, and returns it to the model, which then produces the final answer.  

This approach transforms a chatbot from speculation to computation, making it a reliable assistant for chemistry.  



In [None]:
def get_molecular_weight(compound_name):
    weights = {
        "water": 18.015,
        "caffeine": 194.19
    }
    return weights.get(compound_name.lower(), "Unknown")


# Example function call
compound = "caffeine"
mw = get_molecular_weight(compound)

if mw != "Unknown":
    print(f"The molecular weight of {compound} is approximately {mw} g/mol.")
else:
    print(f"Molecular weight for {compound} is not in the database.")


The molecular weight of caffeine is approximately 194.19 g/mol.
