# Overview

The goal of this program is to perform semantic search and causal map extraction from PDF documents efficiently and effectively. In particular, we focus on identifying sentences that support and align with the causal relationships depicted in causal maps. Causal maps are graphical representations that illustrate the cause-and-effect relationships between variables in a system.

Our program consists of several key steps, each designed to save time and ensure quality results:

Preprocessing PDF Documents: We start by converting the PDF documents into a machine-readable format. During this step, we extract textual content and perform essential preprocessing, such as removing any content that is not useful for analysis (e.g., citations, references).

Embedding and Similarity Search: The extracted sentences are then processed using OpenAI's embeddings to create vector representations of the text. By applying cosine similarity, we compare the embeddings of the sentences to the embeddings of queries derived from the causal maps. This process enables us to identify sentences that closely match the causal relationships described in the maps.

Extracting Causal Maps: In addition to identifying supportive sentences, we also extract causal maps from sentences that describe multiple effects of a given cause. This extraction process is crucial for preventing hallucinations—spurious or ungrounded information—in machine learning models trained on the data. Sentences with multiple effects provide a detailed view of the causal system, allowing models to better understand the complexity of real-world causality.

Analysis and Visualization: We analyze and visualize the distribution of similarity scores obtained from the semantic search process. This step provides valuable insights into the quality of the semantic search results and helps us determine an appropriate cutoff score for filtering sentences.

Overall, this program provides a comprehensive and efficient approach to enhancing the quality of causal data used in machine learning models. By identifying sentences that support causal maps, extracting detailed causal information, and streamlining the analysis process, we improve the interpretability and reliability of insights derived from the models while saving time in the overall workflow.

# Setup and Configuration

Before proceeding with the main steps, we install the necessary Python libraries and dependencies required for this project. 

In [None]:
!pip install PyMuPDF 
!pip install openai

# Importing necessary libraries
# Built-in Python libraries, no versions required
import csv  
import json  
import os 

import networkx as nx  # networkx==3.0
import matplotlib.pyplot as plt  # matplotlib==3.5.1
import nltk  # nltk==3.8.1
nltk.download('punkt')
import pandas as pd  # pandas==1.5.3 
import openai  # openai==0.26.5
import numpy as np  # numpy==1.24.1
from operator import itemgetter  
from matplotlib import patches  # matplotlib==3.5.1
from openai.embeddings_utils import get_embedding, cosine_similarity

# Importing PyMuPDF, aliased as fitz
import fitz  # PyMuPDF

Next we add an OpenAI API key

In [None]:
# Replace "YOUR_API_KEY_HERE" with your actual OpenAI API key
openai.api_key = ""  

In [None]:
directories_to_create = ["pdf_files", "json_files", "csv_files", "embedded_files", "maps", "results"]
#Check if the directories exist in causalmapcapstone folder
def create_directories(dir_list):
    for directory in dir_list:
        if not os.path.exists(directory):
            os.mkdir(directory)
create_directories(directories_to_create)

# File Conversion and Processing

In this step, we extract and clean the textual content from PDF documents for further analysis:

Font Information: We identify different font sizes in the document to distinguish between headers, paragraphs, and other text elements.

Extracting Text: We extract text, skipping sections like "works cited" or "bibliography" that don't contain valuable information. 

Cleaning Text: We remove special characters, extra whitespaces, and unwanted sentences (e.g., with URLs) from the text. We also tokenize the text into individual sentences.

Saving Results: The clean sentences are saved in a structured format for further use.

As a result, we obtain a list of clean sentences from the PDF document, ready for the next steps of analysis, including generating embeddings and similarity searches.

In [None]:
def fonts(doc, granularity=False):
    """Extracts fonts and their usage in PDF documents.
    :param doc: PDF document to iterate through
    :type doc: <class 'fitz.fitz.Document'>
    :param granularity: also use 'font', 'flags' and 'color' to discriminate text
    :type granularity: bool
    :rtype: [(font_size, count), (font_size, count}], dict
    :return: most used fonts sorted by count, font style information
    """
    styles = {}
    font_counts = {}

    for page in doc:
        blocks = page.get_text("dict")["blocks"]
        for b in blocks:  # iterate through the text blocks
            if b['type'] == 0:  # block contains text
                for l in b["lines"]:  # iterate through the text lines
                    for s in l["spans"]:  # iterate through the text spans
                        if granularity:
                            identifier = "{0}_{1}_{2}_{3}".format(s['size'], s['flags'], s['font'], s['color'])
                            styles[identifier] = {'size': s['size'], 'flags': s['flags'], 'font': s['font'],
                                                  'color': s['color']}
                        else:
                            identifier = "{0}".format(s['size'])
                            styles[identifier] = {'size': s['size'], 'font': s['font']}

                        font_counts[identifier] = font_counts.get(identifier, 0) + 1  # count the fonts usage

    font_counts = sorted(font_counts.items(), key=itemgetter(1), reverse=True)

    if len(font_counts) < 1:
        raise ValueError("Zero discriminating fonts found!")

    return font_counts, styles


def font_tags(font_counts, styles):
    """Returns dictionary with font sizes as keys and tags as value.
    :param font_counts: (font_size, count) for all fonts occuring in document
    :type font_counts: list
    :param styles: all styles found in the document
    :type styles: dict
    :rtype: dict
    :return: all element tags based on font-sizes
    """
    p_style = styles[font_counts[0][0]]  # get style for most used font by count (paragraph)
    p_size = p_style['size']  # get the paragraph's size

    # sorting the font sizes high to low, so that we can append the right integer to each tag
    font_sizes = []
    for (font_size, count) in font_counts:
        font_sizes.append(float(font_size))
    font_sizes.sort(reverse=True)

    # aggregating the tags for each font size
    idx = 0
    size_tag = {}
    for size in font_sizes:
        idx += 1
        if size == p_size:
            idx = 0
            size_tag[size] = '<p>'
        if size > p_size:
            size_tag[size] = '<h{0}>'.format(idx)
        elif size < p_size:
            size_tag[size] = '<s{0}>'.format(idx)

    return size_tag


def block_ended(current_size, new_size, new_text, skip_words):
    return new_size != current_size or any(word in new_text.lower() for word in skip_words)

def headers_para(doc, size_tag):
    #Sections that match these words will be removed
    skip_words = ['works cited', 'bibliography', 'reference', 'citation']
    header_para = []
    current_size = None
    block_string = ""
    skip_section = False
    skip_section_size = 0

    for page in doc:
        blocks = page.get_text("dict")["blocks"]
        for b in blocks:
            if b['type'] == 0:
                for l in b["lines"]:
                    for s in l["spans"]:
                        text = s['text'].strip()
                        if not text:
                            continue

                        # Skip block if it only contains pipes
                        if all((c == "|") for c in text):
                            continue

                        text = text.encode('ascii', 'ignore').decode('ascii')
                        text = " ".join(text.split())  # Replace multiple whitespaces with single whitespace
                        text_size = s['size']
                        
                        if skip_section:
                            if text_size > skip_section_size:
                                skip_section = False
                                skip_section_size = 0
                            continue
                        elif any(word in text.lower() for word in skip_words):
                            skip_section = True
                            skip_section_size = text_size
                            continue
                        
                        if block_ended(current_size, text_size, text, skip_words):
                            if block_string:
                                header_para.append(block_string.strip())
                            block_string = text
                            current_size = text_size
                        else:
                            block_string += " " + text

    if block_string:
        header_para.append(block_string.strip())

    return header_para

In [None]:
def process_files():
    """Converts PDF files in 'pdf_files' directory to JSON files in 'json_files' directory.
    Each sentence is split into a new line using tokenization. Unwanted sentences are filtered out."""
    pdf_dir = 'pdf_files'
    json_dir = 'json_files'
    
    print("Converting pdfs to json")
    for filename in os.listdir(pdf_dir):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(pdf_dir, filename)
            with fitz.open(pdf_path) as doc:
                font_counts, styles = fonts(doc, granularity=False)
                size_tag = font_tags(font_counts, styles)
                elements = headers_para(doc, size_tag)

                # Split each sentence into a new line using tokenization
                elements = [nltk.sent_tokenize(element) for element in elements]

                # Flatten the list
                elements = [item for sublist in elements for item in sublist]

                # Filter conditions for unwanted sentences that remain
                remove_list = ['http', 'www.', '.com', '.org', '.edu', '.pdf', '....']
                elements = [element for element in elements if len(element) > 50 and not any(word in element.lower() for word in remove_list)]

                # Name the json file the same as the pdf file
                json_filename = filename.replace('.pdf', '.json')
                json_path = os.path.join(json_dir, json_filename)

                # Remove duplicate elements
                elements = list(dict.fromkeys(elements))

                # Write the elements to a JSON file
                with open(json_path, 'w') as f:
                    # Format the json file
                    json.dump(elements, f, indent=4)
          
process_files()

# Text Embedding and Search

In this step, we utilize the OpenAI embeddings to convert the text data extracted from the PDF documents into continuous vector representations, commonly known as embeddings. These embeddings capture the semantic meaning of the text and are crucial for computing similarity between sentences and queries. By applying OpenAI's get_embedding function, we obtain the embeddings for each sentence, and store them in a DataFrame for later use in the semantic search process.

In [None]:
# function to convert json files to csv
def json_to_csv(filename: str) -> None:
    df = pd.read_json(f"json_files/{filename}")
    df = df.rename(columns={0: "text"})
    csv_filename = filename.replace(".json", ".csv")
    df.to_csv(f"csv_files/{csv_filename}", index=False)

# function to embed text data in a csv file using OpenAI embeddings
def embed_file(filename: str) -> None:
    df = pd.read_csv(f"csv_files/{filename}")
    df['embedding'] = df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
    embed_filename = filename.replace(".csv", "_embed.csv")
    df.to_csv(f"embedded_files/{embed_filename}", index=False)


In [None]:
# Convert the json files to csv files
for filename in os.listdir("json_files"):
    try:
      json_to_csv(filename) 
    except:
      print("")
    
# Check if the file has already been embedded; otherwise, embed it
for filename in os.listdir("csv_files"):
    if not os.path.exists(f"embedded_files/{filename.replace('.csv', '_embed.csv')}"):
        print(f"Embedding {filename}")
        embed_file(filename)

# Analysis and Visualization

In this step, we analyze and visualize the distribution of similarity scores obtained from the semantic search process. We plot a histogram to depict the frequency of similarity scores within specified ranges. The visualization provides valuable insights into how closely the sentences in the document align with the queries derived from the causal maps. It helps identify the most relevant sentences that support the causal relationships and provides a way to assess the overall quality of the semantic search results. Additionally, the visualization can be used to determine an appropriate cutoff score for filtering sentences, ensuring that only those with similarity scores above the chosen threshold are retained for further analysis.

In [None]:
def search_embedded_file_from_map(embedded_filename: str, map_filename: str) -> None:
    min_results = 30  # Minimum number of total results required
    similarity_threshold = 0.88  # Initial similarity threshold
    similarity_step = 0.01  # Step size for decreasing similarity threshold
    min_similarity_threshold = 0.7  # Minimum similarity threshold allowed

    result_df = pd.DataFrame()
    df = pd.read_csv(f"embedded_files/{embedded_filename}")
    df['embedding'] = df['embedding'].apply(lambda x: np.array(eval(x)))
    map_df = pd.read_csv(f"maps/{map_filename}")

    result_filename = f"{embedded_filename.replace('_embed.csv', '')}_{map_filename}"
    res = []

    # Function to process an individual edge and get filtered results
    def process_edge(query: str) -> pd.DataFrame:
        # get the embedding for the query and calculate cosine similarity
        query_embedding = get_embedding(query, engine='text-embedding-ada-002')
        df['query'] = query
        df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, query_embedding))
        df_filtered = df[df['similarity'] > similarity_threshold][['query', 'text', 'similarity']].copy()
        df_filtered.loc[:, 'similarity'] = round(df_filtered['similarity'], 4)  # add similarity row
        return df_filtered

    # Iterate over the rows of the map file and process each edge
    for index, row in map_df.iterrows():
        if row[2] > 0:
            query = f"{row[0]} causes {row[1]}"
        elif row[2] < 0:
            query = f"{row[0]} prevents {row[1]}"
        else:
            continue
        res.append(process_edge(query))

    # Combine the results and sort them by similarity score
    result_df = pd.concat(res, ignore_index=True)
    result_df = result_df.sort_values(by='similarity', ascending=False)

    # Relax the similarity threshold if total results are fewer than min_results
    while len(result_df) < min_results and similarity_threshold > min_similarity_threshold:
        similarity_threshold -= similarity_step
        res = []
        for index, row in map_df.iterrows():
            if row[2] > 0:
                query = f"{row[0]} causes {row[1]}"
            elif row[2] < 0:
                query = f"{row[0]} prevents {row[1]}"
            else:
                continue
            res.append(process_edge(query))
        result_df = pd.concat(res, ignore_index=True)
        result_df = result_df.sort_values(by='similarity', ascending=False)

    # Only include the sentence and similarity rows in the output
    result_df = result_df[['text', 'similarity']]
    result_df.to_csv(f'results/{result_filename}', index=False)

def search_all_embedded_files(map_filename: str):
    for embedded_filename in os.listdir("embedded_files"):
        if embedded_filename.endswith("_embed.csv"):
            search_embedded_file_from_map(embedded_filename, map_filename)

# Execution (make sure to provide the correct map_filename and have necessary CSV files)
map_filename = "Drasic et al (edges).csv"  # modify as needed
search_all_embedded_files(map_filename)

In [None]:
# function to get distribution of similarity scores from a result file
def get_distribution(filename: str, sim_range: tuple) -> None:
    df = pd.read_csv(f"results/{filename}")
    # plot a histogram of the similarity scores within the specified range
    plt.hist(df['similarity'], bins=30, range=sim_range)
    plt.title(f"Distribution of Similarity Scores of \n{filename}")
    plt.xlabel("Similarity Score")
    plt.ylabel("Frequency")
    plt.show()

    df = df[df['similarity'] > .87]
    df.to_csv(filename, index=False)

get_distribution("[2012-36] Kentucky Task Force on Childhood Obesity_Drasic et al (edges).csv", sim_range=(.85, 1))

In [None]:
def makeViolinPlot():
    #Read in "PDFExtractorTop20.csv"
    #Get the false positives, false negatives, true positives, and true negatives
    #Make a violin plot
    #Read in the csv file
    with open("PDFExtractorTop20.csv", 'r') as f:
        reader = csv.reader(f)
        falsePositives86 = []
        falsePositives87 = []
        falsePostives88 = []
        falsePostives89 = []
        for row in reader:
            if row[0] == "0" and float(row[3]) > .86:
                falsePositives86.append(float(row[3]))
            if row[0] == "0" and float(row[3]) > .87:
                falsePositives87.append(float(row[3]))
            if row[0] == "0" and float(row[3]) > .88:
                falsePostives88.append(float(row[3]))
            if row[0] == "0" and float(row[3]) > .89:
                falsePostives89.append(float(row[3]))
        #Make a violin plot
        data = [falsePositives86, falsePositives87, falsePostives88, falsePostives89]
        labels = ["85%", "86%", "87%", "88%", "89%"]
        plt.violinplot(data, showmeans=True, showmedians=True)
        plt.xticks([1, 2, 3, 4, 5], labels)
        plt.xlabel("Similarity Score Threshold")
        plt.ylabel("False Positive Similarity Score")
        plt.title("False Positives in the Top 20 Results")
        plt.show()

makeViolinPlot()

By analyzing the distribution of similarity scores and visualizing the results using histograms and violin plots, we can empirically determine an appropriate cutoff score for filtering sentences. The graphical visualizations support our choice of cutoff score. For example, using a cutoff score of .88 strikes a good balance between minimizing false positives and maximizing the number of relevant results, thereby ensuring the accuracy and effectiveness of our semantic search process.

# Extracting Maps from Sentences to Prevent Hallucinations

A crucial aspect of our program is the extraction of causal maps from sentences that describe cause-and-effect relationships. This step is essential in preventing "hallucinations" in machine learning models, which are instances where the model produces spurious or ungrounded information that is not supported by the data.

To achieve this, we identify sentences that describe multiple effects of a given cause, as these sentences provide a detailed view of the causal system. We then extract the causal relationships from these sentences, creating a structured representation of the cause-and-effect links.

The extracted causal maps contribute to the overall quality of the causal data used in training machine learning models. By ensuring that the data accurately reflects real-world causality and minimizing the risk of hallucinations, we enhance the interpretability and reliability of the insights derived from the models.

This extraction process is a vital step towards building models that can effectively analyze and understand the complexity of real-world causal systems, leading to more accurate and actionable predictions.

In [None]:
import time
def format_text(input_text, sentence):
    try:
        formatted_text = "<S> "
        
        # Split input_text into individual tuples using "), (" as the delimiter
        input_tuples = input_text.strip("()").split("), (")

        for item in input_tuples:
            input_tuple = item.split(", ")

            correlation = input_tuple[0].strip("'")
            cause = input_tuple[1].strip("'")
            
            # Extract all elements starting from the third element as effects and strip single quotes
            effects = [effect.strip("'") for effect in input_tuple[2:]]

            correlation_tag = "<POS>" if correlation == '1' else "<NEG>"

            # Iterate through each effect for a given cause
            for effect in effects:
                formatted_text += f"<H> {cause} {correlation_tag} <T> {effect.strip()} "

        formatted_text += "<E>"
        return formatted_text
    except Exception as e:
        # Print exception message and traceback
        import traceback
        print(str(e))
        traceback.print_exc()
        return ""


def extract_cause_and_effect(sentence):
  # Define the prompt
  prompt = f'''
  Extract the cause and effect from the following sentence: '{sentence}' and print the causes and effects as a tuple, taking into account closely related causes when identifying cause and effect relationships. If a cause leads to multiple positive effects, group them together and label the correlation as 1. If a cause leads to multiple negative effects, group them together and label the correlation as -1.
  For example, if the sentence is 'Exposure to air pollution is associated with an increased risk of respiratory diseases, cardiovascular diseases, and premature death. However, living in areas with good air quality reduces the risk of respiratory diseases.', the output should be [('1', 'air pollution', 'respiratory diseases, cardiovascular diseases, premature death'), ('-1', 'good air quality', 'respiratory diseases')].
  When extracting causes, consider the context and identify closely related causes. If you find causes that are similar or closely related, you must group them. The output should be a string. Each effect should be separate so that the output is a list of tuples.
  For example, if the sentence is 'The environmental consequences of deforestation include soil erosion, loss of biodiversity, and climate change.', the output should be [('1', 'deforestation', 'soil erosion, loss of biodiversity, climate change')].
  For example, if the sentence is 'Excessive screen time in children and adolescents can lead to various health issues, including eye strain, poor posture, and sleep disturbances. On the other hand, engaging in physical activities can improve overall health and enhance mental well-being.', the output should be [('1', 'excessive screen time', 'eye strain, poor posture, sleep disturbances'), ('-1', 'physical activities', 'mental well-being')].
  Note that the causes and effects should be in lowercase and should not contain any punctuation. The cause and effect should be separated by a comma and a space. The cause-effect pairs should be separated by a space. The cause and the effect can not be 1 or -1.
  '''

  retries = 5
  delay = 10
  backoff_factor = 3

  while retries > 0:
      try:
          response = openai.ChatCompletion.create(
              # gpt-4
              model="gpt-4",
              messages=[
                  {"role": "system", "content": "You are a helpful assistant that only returns a tuple"},
                  {"role": "user", "content": prompt},
              ]
          )

          # Extract the assistant's reply
          assistant_reply = response['choices'][0]['message']['content']
          return assistant_reply

      except Exception as e:
          time.sleep(delay)
          retries -= 1
          delay *= backoff_factor

  raise Exception("Rate limit error: Maximum retries reached")

def process_sentences_from_results_folder(input_folder: str, output_filename: str) -> None:
    unique_sentences = set()

    with open(output_filename, 'w', newline='') as output_file:
        csv_writer = csv.writer(output_file)
        csv_writer.writerow(['subgraph', 'sentence'])

        for file in os.listdir(input_folder):
            if file.endswith('.csv'):
                file_path = os.path.join(input_folder, file)
                df = pd.read_csv(file_path)

                # Assuming one sentence per line in the 'text' column
                for index, row in df.iterrows():
                    sentence = row['text']
                    
                    # Skip the sentence if it has already been processed
                    if sentence in unique_sentences:
                        continue

                    unique_sentences.add(sentence)
                    cause_and_effect = extract_cause_and_effect(sentence)  # Call the extract_cause_and_effect function
                    formatted_text = format_text(cause_and_effect, sentence)  # Apply the format_text function

                    # Write the processed sentence to the output file
                    csv_writer.writerow([formatted_text, sentence])

process_sentences_from_results_folder('results', 'finalresult.csv')

Some final data cleaning

In [None]:
import pandas as pd
import re

def remove_brackets_and_chars_from_csv(file_path, column_name):
    def remove_brackets_and_chars(text):
        text = text.replace("(", "").replace(")", "")  # Remove round brackets
        text = text.replace("[", "").replace("]", "")  # Remove square brackets
        text = text.replace("'", "")                   # Remove single quotes
        text = re.sub(r'(?<![0-9A-Za-z_])-1(?![0-9A-Za-z_])', '', text)  # Remove standalone -1
        return text

    def change_pos_to_neg(text):
        words_to_negate = ['reduce', 'decrease', 'prevent']  # Put your words here
        for word in words_to_negate:
            text = re.sub(rf'<POS>\s*<T>\s*\b{word}\b', '<NEG> <T>', text, count=1)
        return text

    def remove_certain_words(text):
        words_to_remove = ['increase', 'increased', 'improve']  # Put your words here
        for word in words_to_remove:
            text = re.sub(rf'(<POS>\s*<T>\s*)\b{word}\b', r'\1', text)
        return text

    text_to_remove = 'are no specific causes and effects mentioned to be extracted'

    try:
        df = pd.read_csv(file_path)

        if column_name not in df.columns:
            return f"The column '{column_name}' was not found in the CSV file."

        df[column_name] = df[column_name].apply(remove_brackets_and_chars)
        df[column_name] = df[column_name].apply(change_pos_to_neg)
        df[column_name] = df[column_name].apply(remove_certain_words)

        df = df[~df[column_name].str.contains(r'\<T\>\s+\<E\>', na=False, regex=True)]
        
        df = df[~df.applymap(lambda x: text_to_remove in str(x)).any(axis=1)]

        df.to_csv(file_path, index=False)

        return f"Successfully modified and saved changes to '{file_path}'."
    except FileNotFoundError:
        return f"The file '{file_path}' was not found."

# Example usage:
remove_brackets_and_chars_from_csv('finalresult.csv', 'subgraph')


# Conclusion

In conclusion, the program we have developed effectively performs semantic search and causal map extraction from PDF documents to enhance our understanding of causal relationships within the text. Through a series of carefully orchestrated steps, including preprocessing of PDF documents, embedding and similarity search, as well as extracting detailed causal maps, we have successfully identified sentences that align with the causal relationships depicted in the maps. Notably, the program's ability to extract maps from sentences describing multiple effects helps mitigate the risk of hallucinations in machine learning models, thereby ensuring that the models produce reliable and interpretable insights.

As a result, this program serves as a powerful tool for researchers, data scientists, and domain experts seeking to extract valuable causal information from textual documents. By automating the extraction process, we save time and resources while maintaining a high level of accuracy. The program's potential to contribute to the advancement of machine learning models in understanding complex causal systems demonstrates its value and importance in the field of natural language processing and causal inference.