# Extracting information from articles files with the OpenAI API

## Accessing APIs
API stands for Application Programming Interface and is a way to communicate to websites, databases and software from a terminal or via a programming language script. Many application on the internet use it to be able to communicate with the application in a more efficient and faster way. For a good start on what an API is and how it works, see [this wiki](https://nl.wikipedia.org/wiki/Application_programming_interface).

To use an API, it is essenstial that you master some programming skills. Many APIs have implementations in R and Python, and for this reason alone it will pay of to learn about programming. 

## OpenAI AIP use case
In the demo we saw that we can we can extract information from a pdf file by uploading it into the ChatGPT GUI. But what if...we have hundreds, or even thousands of pdf files...It would be very tidious and labourious to extract specific info from them, using this approach. Luckely for us, there is a better and more efficient way. 

In the demonstartion below we will walk through the following steps:

 1. Defining an environment variable containing the API key for interacting with OpenAI's GPT-4
 2. Setting up a custum GPT in Python
 3. Defining a prompt in Python
 4. Launching the prompt against several documents (.pdf)
 5. Getting the results in a structured dataframe together

## Prompt
As you may have seen yourself, one way of interacting with generative AI models is through a so-called 'prompt'. This is basically a query or a request entered in a chat-like fashion. To engineer a good prompt need practice and iteration. In interactions that you may have done with chatGPT 

See the full prompt and interactions with GPT-4 [here:](https://chat.openai.com/share/206a1f9d-554c-4b11-be3d-631cde900370)

## API
An API or Application Programming Interface is a piece of software that allows controlled access to computational resources. It has a lookup principle to check if a procedure is allowed or not. Kinda like a waiter presenting you a menu to choose from; you are allowed to choose something from the menu, but not something that is not on the menu, and the cook will cook only that what you oredered. Usually APIs are there to efficiently 'talk' to the computationals resources. Using an API via a programming language (in this case through Python) allows for multiplication of tasks and interactions with the LLM that would be very tidious and error-prone to perform in the OpenAI GUI. In this case, we need to have a specific password (key) that allows access to the LLM. The API we are using here is thus also a gateway. Sort-of a waiter and a bouncer in one person. The key is stored here in a secret file, which prevents me writing the key in the code. Python (the `dotenv` library) takes care of getting the key for me.  

## Load Libraries
When we want to use features from Python libraries, which are collections of functions and/or data, we first need to install and thern load these libraries with the `import` command. Here the libraries were already installed in a virtual environment.

In [8]:
# Import python libraries
import openai
import PyPDF2
from langchain.llms import OpenAI
from dotenv import load_dotenv
import os
import pandas as pd

## Load OpenAI API key from a secret file
To ensure your secret API key is not included in the code, you can first write the key to a hidden file called `.env`. Then we use the `dotenv` package to load the key from this file and store it as an environment variable. The `os` library makes it possible to call this environment variable from Python and load it as a python string that we can then use to call on the OpenAI API.

In [9]:
# Load OpenAI API key variables from .env file
load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
openai.api_key = OPENAI_API_KEY

## Define the assistent
Next step is to define the characteristics of the agent that will interact to a selected model, via the API. Here we define that agent in a Python function to have a specific behaviour (`aiprocessor()`). The result from that interaction is then piped to a new function that processes the information retrieved from the model in a structured way (`parse_ai_text_to_structured_data()`). 

In [10]:

my_ai_model = "gpt-3.5-turbo"

directory_path = "./pdf"  # Replace with your directory path
#pdf_file = "GDL-nieuwsbrief-december.pdf"

def aiprocessor(page_no, text):
    print(f"\n\n..AI processing page {page_no}")
    messages = [
        {
            "role": "system",
            "content": """You are a PDF data extractor, a backend processor.
- User input is messy raw text extracted from a PDF page by page with PyPDF2.
- We are interested in data about methods used
- Your task is to output the retrieved information strictly in a structured pandas dataframe format.
- Output only the structured data, nothing else.
- Output the data as a dataframe with two columns:
1. 'topic' and 2. 'methods'
- Seperate the columns in the data by a comma.
- Set temperature to 0"""
        },
        {
            "role": "user",
            "content": "raw pdf text; extract and format data:" + text
        }
    ]

    api_params = {"model": my_ai_model, "messages": messages, "stream": True}
    try:
        api_response = openai.ChatCompletion.create(**api_params)
        reply = ""
        for delta in api_response:
            if not delta['choices'][0]['finish_reason']:
                word = delta['choices'][0]['delta']['content']
                reply += word
                print(word, end ="")       
        return reply
    except Exception as err:
        error_message = f"API Error page {page_no}: {str(err)}"
        print(error_message)
        return None

# Function to parse AI-processed text into a structured format
def parse_ai_text_to_structured_data(ai_text):
    structured_data = []
    for line in ai_text.split('\n'):
        if line.strip():
            columns = line.split(',')
            if len(columns) == 2:  # Ensure only two columns per row
                structured_data.append(columns)
            else:
                # Handle irregular data (e.g., log an error, skip, or take corrective action)
                print(f"Skipping irregular line: {line}")
    return structured_data

# List all PDF files in the directory
pdf_files = [f for f in os.listdir(directory_path) if f.endswith('.pdf')]
pdf_files


['teunis-corradi.pdf']

## Adapt the code to handle mulptiple PDFs
In order to ensure we can do the analysis for more then one PDF in a row, we adapt the code to:

 - Iterate through each file in the pdf directory.
 - Extract text using the PyPDF2 library, for each pdf file.
 - Pass the extracted text to the `parse_ai_text_to_structured_data()` function.
 - Send a prompt to the OpenAI API to extract specific information from the text.
 - Print out the extracted information. If no text is found in a PDF, it will notify you.

In [11]:
import openai
import os
import PyPDF2
import pandas as pd

# [Assuming your existing aiprocessor and parse_ai_text_to_structured_data functions here]

# List all PDF files in the directory
pdf_files = [f for f in os.listdir(directory_path) if f.endswith('.pdf')]

# Process each PDF file
for pdf_file in pdf_files:
    pdf_file_path = os.path.join(directory_path, pdf_file)
    
    with open(pdf_file_path, 'rb') as pdf_file_obj:
        pdf_reader = PyPDF2.PdfReader(pdf_file_obj)
        file_base_name = os.path.splitext(pdf_file)[0]

        # Iterate over each page in the PDF
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            page_text = page.extract_text()
            
            if page_text:
                ai_processed_text = aiprocessor(page_num, page_text)
                if ai_processed_text:
                    structured_data = parse_ai_text_to_structured_data(ai_processed_text)
                    if structured_data:
                        # Create DataFrame for the current page
                        df = pd.DataFrame(structured_data, columns=["topic", "methods"])
                        
                        # Save DataFrame to CSV for the current page
                        output_csv_file = f"{file_base_name}-page-{page_num + 1}.csv"
                        df.to_csv(output_csv_file, index=False)
                        print(f"Page {page_num + 1} AI-processed text saved to {output_csv_file}")




..AI processing page 0
topic,methods
Natural Language Processing, Delineating adverse outcome pathways and guiding the application of new approach methodologies
Adverse Outcome Pathways, Tying an initial perturbation (molecular initiating event) to a phenotypic toxicological manifestation (adverse outcome), through a series of steps (key events)
New Approach Methodologies, Supporting the development of NAMs, which aim to reduce the use of animal testing for toxicology purposes
Information Extraction, Developing methods to scan existing literature and extract relevant knowledge using machine learning
Automated Text Analysis, Facilitating the process of knowledge extraction for AOP building, leading to the development of NAMs
NLP, Supporting AOPs development in the scope of current projects ONTOX and VHP4SafetySkipping irregular line: Adverse Outcome Pathways, Tying an initial perturbation (molecular initiating event) to a phenotypic toxicological manifestation (adverse outcome), throu

## Concatenate all extracted information into one dataset
Now that we have all extracted information from each page in a structured form, we can clean up the individual datasets and glue them together. 

In [12]:
import pandas as pd
import os
import glob

# Directory containing your CSV files
directory_in = './'
directory_out = './data_out'

# Use glob to list all CSV files in the directory
csv_files = glob.glob(os.path.join(directory_in, '*.csv'))
csv_files



['./teunis-corradi-page-2.csv',
 './teunis-corradi-page-5.csv',
 './teunis-corradi-page-4.csv',
 './teunis-corradi-page-6.csv',
 './teunis-corradi-page-3.csv',
 './teunis-corradi-page-7.csv',
 './teunis-corradi-page-1.csv']

In [6]:

# Empty list to store dataframes
dataframes = []

for file in csv_files:
    # Read the CSV file without header
    df = pd.read_csv(file)
    df = df[1:]  # Take the data less the first row
    # Append to the list of dataframes
    dataframes.append(df)

dataframes



[                                           topic         methods
 1  Adverse outcome pathways in modern toxicology  What are AOPs?
 2                                  AOPs and NAMs             NaN
 3               Current approaches building AOPs             NaN
 4        Advances in Natural Language Processing             NaN,
 Empty DataFrame
 Columns: [topic, methods]
 Index: [],
 Empty DataFrame
 Columns: [topic, methods]
 Index: [],
                               topic  \
 1          AOP developer inspection   
 2             Information gathering   
 3                   Graph databases   
 4                      NAMs support   
 5                      WoE for KERs   
 6       Limitations of the approach   
 7   Animal testing-based toxicology   
 8                               NLP   
 9       Machine learning techniques   
 10                          Funding   
 11                       References   
 
                                               methods  
 1     ensure mode

In [13]:
# Concatenate all dataframes
combined_df = pd.concat(dataframes, ignore_index=True)

# Save the combined dataframe to a new CSV file
combined_df.to_csv(os.path.join(directory_out, 'combined_csv.csv'), index=False)
combined_df

Unnamed: 0,topic,methods
0,Adverse outcome pathways in modern toxicology,What are AOPs?
1,AOPs and NAMs,
2,Current approaches building AOPs,
3,Advances in Natural Language Processing,
4,AOP developer inspection,ensure models are providing correct information
5,Information gathering,store information in machine-readable format
6,Graph databases,enable easy visualization of relationships and...
7,NAMs support,can be used as input for similarity approaches...
8,WoE for KERs,help in the discovery of novel events and comb...
9,Limitations of the approach,availability of open-science format publicatio...
