# LLM and data extraction

In this notebook, we will explore how to use the OpenAI API to extract metadata from scientific papers. We will use a PDF file as input and convert it to markdown text. Then, we will use the OpenAI API to extract the title, authors, and abstract from the markdown text.

We will compare different methods to extract metadata from scientific papers using the OpenAI API, including:
- Asking the API to extract the metadata directly from the markdown text.
- Asking the API to extract the metadata and return the result in JSON format.
- Using a JSON schema to define the expected output format.
- Using Pydantic models to define the expected output format.
- Using function calls to extract metadata from the markdown text.

In all the following exemple we'll extract the same information on all these articles:
- Title
- Authors
- Abstract

## Why JSON?

- **Interoperability**: JSON is language-agnostic and easily parsed in Python, R, and other languages.
- **API Integration**: Many data sources and web services provide data in JSON format, making it essential for fetching and processing external data.
- **Hierarchical Structure**: Supports nested data, making it ideal for representing complex datasets like configurations or structured logs.
- **Integration with Pandas**: Python's `pandas` library provides seamless methods (`pd.read_json`, `to_json`) for handling JSON data.

## Initialize the OpenAI client and load the libraries

In [None]:
!pip install -r requirements.txt

In [1]:
import json
import re
import pymupdf4llm
import os
import getpass

from pydantic import BaseModel, Field
from typing import List

from openai import OpenAI

from src.pdf_extraction_api import PDFExtractorAPI

In [2]:
os.environ["OPENAI_API_KEY"] = getpass.getpass()

In [3]:
client = OpenAI()
MODEL="gpt-4o-mini"

## Load pdf and convert to markdown

Here we're using the `pymupdf4llm` library to convert a PDF file to markdown. There are other alternatives such as `textract` and `docling` that can be used to extract text from PDF files. After the workshop feel free to try different libraries and compare the results.

We downloaded 2 articles from Pubmed to showcase the process of data extraction. Feel free to try both articles and compare the results.

In [4]:
# Load the PDF file
pdf_path = "../data/Explainable_machine_learning_prediction_of_edema_a.pdf"
# pdf_path = "../data/Modeling tumor size dynamics based on real‐world electronic health records.pdf"

# Convert the PDF file to markdown
# markdown_text = pymupdf4llm.to_markdown(pdf_path)

data_extractor = PDFExtractorAPI()
_, markdown_text, _ = data_extractor.extract_text_and_images(pdf_path)

In [5]:
print(markdown_text[:10000])

DOI: [10.1111/cts.70010](https://doi.org/10.1111/cts.70010)

### **ARTICLE**

![](_page_0_Picture_4.jpeg)

# **Explainable machine learning prediction of edema adverse events in patients treated with tepotinib**

**Federico Amato[1](#page-0-0)** | **Rainer Strotmann[2](#page-0-1)** | **Roberto Castell[o1](#page-0-0)** | **Rolf Bruns[2](#page-0-1)** | **Vishal Ghori[3](#page-0-2)** | **Andreas John[e2](#page-0-1)** | **Karin Berghoff[2](#page-0-1)** | **Karthik Venkatakrishna[n4](#page-0-3)** | **Nadia Terranova[5](#page-0-4)**

<span id="page-0-0"></span>1 Swiss Data Science Center (EPFL and ETH Zurich), Lausanne, Switzerland

<span id="page-0-1"></span>2 The healthcare business of Merck KGaA, Darmstadt, Germany

<span id="page-0-2"></span>3 Ares Trading S.A., Eysins, Switzerland, an affiliate of Merck KGaA, Darmstadt, Germany

<span id="page-0-3"></span>4 EMD Serono, Billerica, Massachusetts, USA

<span id="page-0-4"></span>5 Quantitative Pharmacology, Ares Trading S.A., Lausanne, Swi

## Default extraction

In this case, we will provide a prompt asking the API to extract the title, authors, and abstract from the markdown text. No extra indications are given to the model.



In [6]:
def generate_completion(message: str):
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": message}],
    )

In [7]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Title
- Authors
- Abstract
- DOI (with the link)

Markdown text:
{markdown_text}
"""

completion = generate_completion(prompt)

In [8]:
print(completion.choices[0].message.content)

Here are the extracted details from the provided markdown text:

- **Title:** Explainable machine learning prediction of edema adverse events in patients treated with tepotinib

- **Authors:** Federico Amato, Rainer Strotmann, Roberto Castello, Rolf Bruns, Vishal Ghori, Andreas John, Karin Berghoff, Karthik Venkatakrishnan, Nadia Terranova

- **Abstract:** Tepotinib is approved for the treatment of patients with non-small-cell lung cancer harboring *MET* exon 14 skipping alterations. While edema is the most prevalent adverse event (AE) and a known class effect of MET inhibitors including tepotinib, there is still limited understanding about the factors contributing to its occurrence. Herein, we apply machine learning (ML)-based approaches to predict the likelihood of occurrence of edema in patients undergoing tepotinib treatment, and to identify factors influencing its development over time. Data from 612 patients receiving tepotinib in five Phase I/II studies were modeled with two ML 

#### Result

We can see here that the LLM model was able to extract the title, authors, and abstract from the markdown text. The result is returned as plain text in a markdown format. This format is not very structured and may require additional processing to extract the information.

## Asking for JSON format

Here we're adding one step more. We're asking the LLM to return the result in JSON format. This way we can have a more structured output and it will be easier to extract the information.

In [9]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Title
- Authors
- Abstract
- DOI (with the link)

Markdown text:
{markdown_text}

Give me the result in JSON format.
"""

completion = generate_completion(prompt)

In [10]:
print(completion.choices[0].message.content)

```json
{
  "Title": "Explainable machine learning prediction of edema adverse events in patients treated with tepotinib",
  "Authors": [
    "Federico Amato",
    "Rainer Strotmann",
    "Roberto Castello",
    "Rolf Bruns",
    "Vishal Ghori",
    "Andreas Johne",
    "Karin Berghoff",
    "Karthik Venkatakrishnan",
    "Nadia Terranova"
  ],
  "Abstract": "Tepotinib is approved for the treatment of patients with non-small-cell lung cancer harboring MET exon 14 skipping alterations. While edema is the most prevalent adverse event (AE) and a known class effect of MET inhibitors including tepotinib, there is still limited understanding about the factors contributing to its occurrence. Herein, we apply machine learning (ML)-based approaches to predict the likelihood of occurrence of edema in patients undergoing tepotinib treatment, and to identify factors influencing its development over time. Data from 612 patients receiving tepotinib in five Phase I/II studies were modeled with two ML

In [11]:
json.loads(completion.choices[0].message.content)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

#### Result

The result is not returned in JSON format as requested. The model still returns the information in plain text containing the JSON code. The data is structured as JSON, but it is not returned as a JSON object. We will need to process the text to extract the JSON object.

In [12]:
# Remove the markdown code block markers
json_str = re.sub(r"^```(?:json)?\s*", "", completion.choices[0].message.content)
json_str = re.sub(r"\s*```$", "", json_str)

# Parse the JSON string
json.loads(json_str)

{'Title': 'Explainable machine learning prediction of edema adverse events in patients treated with tepotinib',
 'Authors': ['Federico Amato',
  'Rainer Strotmann',
  'Roberto Castello',
  'Rolf Bruns',
  'Vishal Ghori',
  'Andreas Johne',
  'Karin Berghoff',
  'Karthik Venkatakrishnan',
  'Nadia Terranova'],
 'Abstract': 'Tepotinib is approved for the treatment of patients with non-small-cell lung cancer harboring MET exon 14 skipping alterations. While edema is the most prevalent adverse event (AE) and a known class effect of MET inhibitors including tepotinib, there is still limited understanding about the factors contributing to its occurrence. Herein, we apply machine learning (ML)-based approaches to predict the likelihood of occurrence of edema in patients undergoing tepotinib treatment, and to identify factors influencing its development over time. Data from 612 patients receiving tepotinib in five Phase I/II studies were modeled with two ML algorithms, Random Forest, and Gradi

## Enters `response_format`

OpenAI allows us to specify the response format to be "json_object". This way we can force the model to return the result in JSON format. That way the parsing of the result will be easier.

In [13]:
def generate_completion_json(message: str):
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": message}],
        response_format={"type": "json_object"},
    )

In [14]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Title
- Authors
- Abstract
- DOI (with the link)

Markdown text:
{markdown_text}

Give me the result in JSON format.
"""

completion = generate_completion_json(prompt)

In [15]:
print(completion.choices[0].message.content)

{
  "title": "Explainable machine learning prediction of edema adverse events in patients treated with tepotinib",
  "authors": [
    "Federico Amato",
    "Rainer Strotmann",
    "Roberto Castell",
    "Rolf Bruns",
    "Vishal Ghori",
    "Andreas Johne",
    "Karin Berghoff",
    "Karthik Venkatakrishnan",
    "Nadia Terranova"
  ],
  "abstract": "Tepotinib is approved for the treatment of patients with non-small-cell lung cancer harboring MET exon 14 skipping alterations. While edema is the most prevalent adverse event (AE) and a known class effect of MET inhibitors including tepotinib, there is still limited understanding about the factors contributing to its occurrence. Herein, we apply machine learning (ML)-based approaches to predict the likelihood of occurrence of edema in patients undergoing tepotinib treatment, and to identify factors influencing its development over time. Data from 612 patients receiving tepotinib in five Phase I/II studies were modeled with two ML algorith

In [16]:
json.loads(completion.choices[0].message.content)

{'title': 'Explainable machine learning prediction of edema adverse events in patients treated with tepotinib',
 'authors': ['Federico Amato',
  'Rainer Strotmann',
  'Roberto Castell',
  'Rolf Bruns',
  'Vishal Ghori',
  'Andreas Johne',
  'Karin Berghoff',
  'Karthik Venkatakrishnan',
  'Nadia Terranova'],
 'abstract': 'Tepotinib is approved for the treatment of patients with non-small-cell lung cancer harboring MET exon 14 skipping alterations. While edema is the most prevalent adverse event (AE) and a known class effect of MET inhibitors including tepotinib, there is still limited understanding about the factors contributing to its occurrence. Herein, we apply machine learning (ML)-based approaches to predict the likelihood of occurrence of edema in patients undergoing tepotinib treatment, and to identify factors influencing its development over time. Data from 612 patients receiving tepotinib in five Phase I/II studies were modeled with two ML algorithms, Random Forest, and Gradie

#### Result

This time the result is returned in JSON format as requested. We can directly parse the JSON object to extract the information using `json.loads()`. However, there is no guarantee that the JSON object will have the expected structure. The model may return the data in a different format than the one we expect for exemple with different casing.

## Custom json schema

This time we'll pass a json schema to the model as defined here: https://json-schema.org/. This way we can force the model to return the result in a specific structure, provide default values, descriptions and types for each field.


In [17]:
def generate_completion_json_schema(message: str, schema: dict):
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": message}],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "ExtractedData",
                "schema": schema
            },
        },
    )

In [18]:
schema = {
    "type": "object",
    "name": "ExtractedData",
    "description": "Metadata for a research article including its title, list of authors, and abstract summary.",
    "properties": {
        "title": {
            "type": "string",
            "description": "The title of the research article.",
            "default": "Unknown"
        },
        "authors": {
            "type": "array",
            "items": {"type": "string"},
            "description": "A list of authors who contributed to the article."
        },
        "abstract": {
            "type": "string",
            "description": "A brief summary of the article's content and findings."
        },
        "doi": {
            "type": "object",
            "description": "The DOI of the document.",
            "properties": {
                "id": {
                    "type": "string",
                    "description": "The DOI identifier.",
                },
                "link": {
                    "type": "string",
                    "description": "The url corresponding to the DOI.",
                },
            }
        },
    },
    "additionalProperties": False,
}

prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Title
- Authors
- Abstract

Markdown text:
{markdown_text}

Give me the result in JSON format.
"""

completion = generate_completion_json_schema(prompt, schema)

In [19]:
json.loads(completion.choices[0].message.content)

{'title': 'Explainable machine learning prediction of edema adverse events in patients treated with tepotinib',
 'authors': ['Federico Amato',
  'Rainer Strotmann',
  'Roberto Castello',
  'Rolf Bruns',
  'Vishal Ghori',
  'Andreas Johne',
  'Karin Berghoff',
  'Karthik Venkatakrishnan',
  'Nadia Terranova'],
 'abstract': 'Tepotinib is approved for the treatment of patients with non-small-cell lung cancer harboring MET exon 14 skipping alterations. While edema is the most prevalent adverse event (AE) and a known class effect of MET inhibitors including tepotinib, there is still limited understanding about the factors contributing to its occurrence. Herein, we apply machine learning (ML)-based approaches to predict the likelihood of occurrence of edema in patients undergoing tepotinib treatment, and to identify factors influencing its development over time. Data from 612 patients receiving tepotinib in five Phase I/II studies were modeled with two ML algorithms, Random Forest, and Gradi

#### Result

Now the output will always correspond to the expected schema. Since everything is provided the model doesn't have to guess the shape or part of the shape of the output.

## Using the Types

Alternatively, we can use Pydantic models to define the expected output format (cf. https://docs.pydantic.dev/latest/). This way we can enforce the structure of the output and provide additional type checking. This can be particularly useful when working with APIs that expect a specific format.


In [20]:
class DOIData(BaseModel):
    id: str = Field(..., description="The DOI identifier.")
    link: str = Field(..., description="The url corresponding to the DOI.")

class ExtractedData(BaseModel):
    title: str = Field(..., description="The title of the research article.")
    authors: List[str] = Field(..., description="A list of authors who contributed to the article.")
    abstract: str = Field(..., description="A brief summary of the article's content and findings.")
    doi: DOIData = Field(..., description="The DOI of the document.")

def generate_completion_pydantic(message: str):
    return client.beta.chat.completions.parse(
        model=MODEL,
        messages=[{"role": "user", "content": message}],
        response_format=ExtractedData,
    )



In [21]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Title
- Authors
- Abstract
- DOI (with the link)

Markdown text:
{markdown_text}
"""

completion = generate_completion_pydantic(prompt)

In [22]:
json.loads(completion.choices[0].message.content)

{'title': 'Explainable machine learning prediction of edema adverse events in patients treated with tepotinib',
 'authors': ['Federico Amato',
  'Rainer Strotmann',
  'Roberto Castell',
  'Rolf Bruns',
  'Vishal Ghori',
  'Andreas Johne',
  'Karin Berghoff',
  'Karthik Venkatakrishnan',
  'Nadia Terranova'],
 'abstract': 'Tepotinib is approved for the treatment of patients with non-small-cell lung cancer harboring MET exon 14 skipping alterations. While edema is the most prevalent adverse event (AE) and a known class effect of MET inhibitors including tepotinib, there is still limited understanding about the factors contributing to its occurrence. Herein, we apply machine learning (ML)-based approaches to predict the likelihood of occurrence of edema in patients undergoing tepotinib treatment, and to identify factors influencing its development over time. Data from 612 patients receiving tepotinib in five Phase I/II studies were modeled with two ML algorithms, Random Forest, and Gradie

## Function calling

Many LLMs don't support the structured output format. In that case you can specify to the llm to call a function to extract the information. This way you can define the function signature and the llm will call the function with the extracted information.

In [23]:
article_extraction_function_description = {
    "type": "function",
    "function": {
        "name": "extract_article_data",
        "description": "Extract article metadata from markdown text.",
        "parameters": {
            "type": "object",
            "properties": {
                "title": {
                    "type": "string",
                    "description": "The title of the research article."
                },
                "authors": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "A list of authors who contributed to the article."
                },
                "abstract": {
                    "type": "string",
                    "description": "A brief summary of the article's content and findings."
                },
                "doi": {
                    "type": "object",
                    "properties": {
                        "id": {
                            "type": "string",
                            "description": "The DOI identifier."
                        },
                        "link": {
                            "type": "string",
                            "description": "The url corresponding to the DOI."
                        }
                    }
                }
            },
            "required": ["title", "authors", "abstract", "doi"]
        }
    }
}

def generate_completion_tool_calls(message: str):
    return client.chat.completions.create(
        model=MODEL, 
        messages=[{"role": "user", "content": message}],
        tools=[article_extraction_function_description],
        tool_choice="auto",
    )

In [24]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please use the given tools to extract the following details:
- Title
- Authors
- Abstract
- DOI (with the link)

Markdown text:
{markdown_text}
"""

completion = generate_completion_tool_calls(prompt)

In [25]:
json.loads(completion.choices[0].message.tool_calls[0].function.arguments)

{'title': 'Explainable machine learning prediction of edema adverse events in patients treated with tepotinib',
 'authors': ['Federico Amato',
  'Rainer Strotmann',
  'Roberto Castello',
  'Rolf Bruns',
  'Vishal Ghori',
  'Andreas Joha',
  'Karin Berghoff',
  'Karthik Venkatakrishnan',
  'Nadia Terranova'],
 'abstract': 'Tepotinib is approved for the treatment of patients with non-small-cell lung cancer harboring MET exon 14 skipping alterations. While edema is the most prevalent adverse event (AE) and a known class effect of MET inhibitors including tepotinib, there is still limited understanding about the factors contributing to its occurrence. Herein, we apply machine learning (ML)-based approaches to predict the likelihood of occurrence of edema in patients undergoing tepotinib treatment, and to identify factors influencing its development over time. Data from 612 patients receiving tepotinib in five Phase I/II studies were modeled with two ML algorithms, Random Forest, and Gradie

#### Notes

Here you can note that we don't call the functions. The goal is not to use them as a tool call, but as a way of extracting the data from the documents 

# Multiple function calls

It's important to note that you can ask the llm to call multiple functions to extract the information in the same call. This way you can have a more modular approach to the data extraction.

![Tool Calling](../data/tool_calling.png)

In [26]:
import json

# Define separate function descriptions for each property.
title_extraction_function_description = {
    "type": "function",
    "function": {
        "name": "extract_title",
        "description": "Extract the title from markdown text.",
        "parameters": {
            "type": "object",
            "properties": {
                "title": {
                    "type": "string",
                    "description": "The title of the research article."
                }
            },
            "required": ["title"]
        }
    }
}

authors_extraction_function_description = {
    "type": "function",
    "function": {
        "name": "extract_authors",
        "description": "Extract the list of authors from markdown text.",
        "parameters": {
            "type": "object",
            "properties": {
                "authors": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "A list of authors who contributed to the article."
                }
            },
            "required": ["authors"]
        }
    }
}

abstract_extraction_function_description = {
    "type": "function",
    "function": {
        "name": "extract_abstract",
        "description": "Extract the abstract from markdown text.",
        "parameters": {
            "type": "object",
            "properties": {
                "abstract": {
                    "type": "string",
                    "description": "A brief summary of the article's content and findings."
                }
            },
            "required": ["abstract"]
        }
    }
}

doi_extraction_function_description = {
    "type": "function",
    "function": {
        "name": "extract_doi",
        "description": "Extract the DOI from markdown text.",
        "parameters": {
            "type": "object",
            "properties": {
                "doi": {
                    "type": "object",
                    "properties": {
                        "id": {
                            "type": "string",
                            "description": "The DOI identifier."
                        },
                        "link": {
                            "type": "string",
                            "description": "The url corresponding to the DOI."
                        }
                    }
                }
            },
            "required": ["doi"]
        }
    }
}


In [27]:
def generate_completion_multiple_tool_calls(message: str):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": message}],
        tools=[
            title_extraction_function_description,
            authors_extraction_function_description,
            abstract_extraction_function_description,
            doi_extraction_function_description,
        ],
        tool_choice="auto",
    )

    # Combine the outputs from each function call.
    extracted_data = {}
    tool_calls = response.choices[0].message.tool_calls
    for tool_call in tool_calls:
        function_name = tool_call.function.name
        # Parse the JSON string of arguments.
        arguments = json.loads(tool_call.function.arguments)
        if function_name == "extract_title":
            extracted_data["title"] = arguments["title"]
        elif function_name == "extract_authors":
            extracted_data["authors"] = arguments["authors"]
        elif function_name == "extract_abstract":
            extracted_data["abstract"] = arguments["abstract"]
        elif function_name == "extract_doi":
            extracted_data["doi"] = arguments["doi"]

    return extracted_data

In [28]:
# Example prompt that provides markdown text.
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please use the given tools to extract the following details:
- Title
- Authors
- Abstract
- DOI (with the link)

Markdown text:
{markdown_text}
"""

extracted_data = generate_completion_multiple_tool_calls(prompt)

In [29]:
extracted_data

{'title': 'Explainable machine learning prediction of edema adverse events in patients treated with tepotinib',
 'authors': ['Federico Amato',
  'Rainer Strotmann',
  'Roberto Castello',
  'Rolf Bruns',
  'Vishal Ghori',
  'Andreas Johannes',
  'Karin Berghoff',
  'Karthik Venkatakrishnan',
  'Nadia Terranova'],
 'abstract': 'Tepotinib is approved for the treatment of patients with non-small-cell lung cancer harboring MET exon 14 skipping alterations. While edema is the most prevalent adverse event (AE) and a known class effect of MET inhibitors including tepotinib, there is still limited understanding about the factors contributing to its occurrence. Herein, we apply machine learning (ML)-based approaches to predict the likelihood of occurrence of edema in patients undergoing tepotinib treatment, and to identify factors influencing its development over time. Data from 612 patients receiving tepotinib in five Phase I/II studies were modeled with two ML algorithms, Random Forest, and Gr

## Cost

Let's compute the cost of the completion. We'll use the following pricing: https://openai.com/api/pricing/

In [30]:
def compute_chatgpt_4o_cost(completion, verbose: bool = False) -> float:
    input_tokens = completion.usage.prompt_tokens
    output_tokens = completion.usage.completion_tokens

    cost_per_1M_input_tokens = 0.15
    cost_per_1M_output_tokens = 0.60

    total_cost = (input_tokens / 1e6) * cost_per_1M_input_tokens
    total_cost += (output_tokens / 1e6) * cost_per_1M_output_tokens

    if verbose:
        print(f"Total input tokens: {input_tokens}")
        print(f"Total output tokens: {output_tokens}")
        print(f"Total tokens: {input_tokens+output_tokens}")
        print(f"Estimated cost: ${total_cost:.4f}")

    return total_cost


In [31]:
compute_chatgpt_4o_cost(completion, verbose=True)

Total input tokens: 13666
Total output tokens: 443
Total tokens: 14109
Estimated cost: $0.0023


0.0023157

As you can see the major part of the cost is the input tokens. Here we're passing the whole document to the llm which make up for more than 90% of the cost. In the next part we'll see how to reduce the cost by passing only the relevant information to the llm (RAG).

## Conclusion

Structured output help the LLM to produce better and more interpretable results. On the chart below you'll find the relative performances in terms of reliability of the output matching the expected json format.

![output_reliability](../data/output_reliability.png)

# Exercises

#### Exercise 1: Update the different methods to also extract the DOI

Guideline:
* Ask for the DOI id with the link.


#### Exercise 2: Extract the Bibliography.

Guideline:
* First define a json schema that will guide the data extraction.
* Then define a tool call that will extract the data from one cited paper.
* Finally call the tool multiple time and aggregate the results

## Solutions

#### Exercise 1

Solution already added to the code above.

#### Exercise 2



In [32]:
bibliography_extraction_function_description = {
    "type": "function",
    "function": {
        "name": "extract_bibliography_item",
        "description": "Extract a single bibliography item from markdown text.",
        "parameters": {
            "type": "object",
            "properties": {
                "title": {
                    "type": "string",
                    "description": "The title of the cited work."
                },
                "authors": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    },
                    "description": "A list of authors of the cited work."
                },
                "year": {
                    "type": "integer",
                    "description": "The publication year of the cited work."
                },
                "doi": {
                    "type": "string",
                    "description": "The DOI (Digital Object Identifier) of the cited work, if available."
                }
            },
            "required": ["title", "authors", "year"]
        }
    }
}


In [33]:
def generate_completion_multiple_tool_calls(message: str):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": message}],
        tools=[bibliography_extraction_function_description],
        tool_choice="auto",
    )


    extracted_items = []
    tool_calls = response.choices[0].message.tool_calls
    for tool_call in tool_calls:
        function_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)

        if function_name == "extract_bibliography_item":
            extracted_items.append(arguments)

    return extracted_items


In [34]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please use the given tools to extract ALL the items of the bibliography

Markdown text:
{markdown_text}
"""

extracted_data = generate_completion_multiple_tool_calls(prompt)

In [35]:
extracted_data

[{'title': 'Drug-disease modeling in the pharmaceutical industry – where mechanistic systems pharmacology and statistical pharmacometrics meet',
  'authors': ['Helmlinger G', 'Al-Huniti N', 'Aksenov S'],
  'year': 2017},
 {'title': 'Basic concepts of pharmacokinetic/ pharmacodynamic (PK/PD) modelling',
  'authors': ['Meibohm B', 'Dorendorf H'],
  'year': 1997},
 {'title': 'Machine learning in modeling disease trajectory and treatment outcomes: an emerging enabler for model-informed precision medicine',
  'authors': ['Terranova N', 'Venkatakrishnan K'],
  'year': 2023},
 {'title': 'Application of machine learning in translational medicine: current status and future opportunities',
  'authors': ['Terranova N', 'Venkatakrishnan K', 'Benincosa LJ'],
  'year': 2021},
 {'title': 'Diversity and inclusion in drug development: rethinking intrinsic and extrinsic factors with patient centricity',
  'authors': ['Venkatakrishnan K', 'Benincosa LJ'],
  'year': 2022},
 {'title': 'Current status and f