
### Evaluation of Large Language Models as a Data Validation Tool 

This notebook is designed to extract, transform, and compare data from WikiData and DBPedia about scientists and their doctoral students. The main objective is to identify inconsistencies between these two data sources and utilise the results generated through LLMs to validate the inconsistencies 



#### Initialization
The first cell sets up the required packages and initializes a Spark session.

In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import udf
from pyspark.sql.functions import concat
from pyspark.sql.functions import col
import pandas as pd
spark = SparkSession.builder \
    .appName("Wikidata_and_DBpedia_Queries") \
    .getOrCreate()

#### Helper Functions

1. **`get_redirected_url`**: This function ensures that the URLs are in a consistent format, particularly for Wikipedia links.

In [4]:
def get_redirected_url(url):
    if url is None:
        return url
    try:
        #response = requests.get(url, allow_redirects=True)
        final_url = url

        if "/wiki/" in final_url:
            final_url = final_url.split("/wiki/")[-1]
            return "/wiki/" + final_url
        return final_url
    except Exception as e:
        print(f"error:{e}")
        return url 

get_redirected_url_udf = udf(get_redirected_url, StringType())

2. **`execute_sparql_query`**: Executes a SPARQL query and returns the results as a pandas DataFrame.

In [5]:
def execute_sparql_query(endpoint_url, query):
    sparql = SPARQLWrapper(endpoint_url)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    data = []
    for result in results["results"]["bindings"]:
        row = {field: result[field]['value'] for field in results['head']['vars']}
        data.append(row)
    return pd.DataFrame(data)


3. **`rename_columns`**: Adds a prefix to each column name in a DataFrame.

In [6]:
def rename_columns(df, prefix):
    for column in df.columns:
        df = df.withColumnRenamed(column, f"{prefix}{column}")
    return df


4. **`extract_name_from_link`**: Extracts and formats a name from a Wikipedia link.

In [7]:
def extract_name_from_link(wiki_link):
    if wiki_link and "/wiki/" in wiki_link:
        name_part = wiki_link.split("/wiki/")[-1]  # Extract the part after /wiki/
        formatted_name = name_part.replace('_', ' ').title()  # Replace underscores with spaces and title case
        return formatted_name
    return None

5. **`prompt_generator`**: Generates a formatted string from a Spark DataFrame to be used as a prompt.

In [8]:
def prompt_generator(spark_df):
    """
    Generate a formatted string from a Spark DataFrame consisting of Wikipedia links to scientists and their doctoral students.

    Parameters:
    spark_df (pyspark.sql.DataFrame): Spark DataFrame with columns 'wikidata_scientistWikipediaLink' and 'wikidata_doctoralStudentWikipediaLink'.

    Returns:
    str: A single string containing all the formatted questions.
    """
    pandas_df = spark_df.toPandas()
    entities = []
    
    for index, row in pandas_df.iterrows():
        scientist_name = extract_name_from_link(row['wikidata_scientistWikipediaLink'])
        student_name = extract_name_from_link(row['wikidata_doctoralStudentWikipediaLink'])
        
        if scientist_name and student_name:
            question = f"<question>Is {student_name} a student of {scientist_name}?</question>"
            entity = "<entity>"+question+"<answer></answer>"+"</entity>"
            entities.append(entity)
    
    return '\n'.join(entities)

#### Queries to extract data from DBpedia and Wikidata

Both the queries extract scientists and their doctoral students who have an english wikipedia page from Wikidata and DBpedia respectively

In [9]:
wikidata_query = """
SELECT Distinct ?scientist ?scientistWikipediaLink ?doctoralStudent ?doctoralStudentWikipediaLink WHERE {
  ?scientist wdt:P31 wd:Q5;        # Instance of human
             wdt:P106 ?occupation; # Occupation: scientist, physicist, chemist, mathematician
             wdt:P185 ?doctoralStudent.  # Must have a doctoral student

  VALUES ?occupation {wd:Q901 wd:Q169470 wd:Q593644 wd:Q170790}  # Occupations include scientist, physicist, chemist, mathematician

  # Get the English Wikipedia link for the scientist
  ?scientistWikipediaLink schema:about ?scientist;
                          schema:inLanguage "en";
                          schema:isPartOf <https://en.wikipedia.org/>.

  

  # Get the English Wikipedia link for the doctoral student
  ?doctoralStudentWikipediaLink schema:about ?doctoralStudent;
                                    schema:inLanguage "en";
                                    schema:isPartOf <https://en.wikipedia.org/>.

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY ?scientistWikipediaLink
"""
dbpedia_query = """
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT DISTINCT ?scientist  ?scientistLink ?doctoralStudent  ?doctoralStudentLink WHERE {
  ?scientist a dbo:Scientist;                     # Instance of Scientist
             dbo:doctoralStudent ?doctoralStudent.  # Has a doctoral student
  
  ?scientist foaf:isPrimaryTopicOf ?scientistLink.
  ?doctoralStudent foaf:isPrimaryTopicOf ?doctoralStudentLink.
}
ORDER BY ?scientist

"""


#### Data Retrieval and Preparation

1. **Executing Queries**:

In [None]:
wikidata_df = execute_sparql_query("https://query.wikidata.org/sparql", wikidata_query)
dbpedia_df = execute_sparql_query("https://dbpedia.org/sparql", dbpedia_query)
wikidata_sdf = spark.createDataFrame(wikidata_df)
dbpedia_sdf = spark.createDataFrame(dbpedia_df)

2. **Renaming Columns**:

In [11]:
wikidata_df = rename_columns(wikidata_sdf, "wikidata_")
dbpedia_df = rename_columns(dbpedia_sdf, "dbpedia_")

3. **Cleaning URLs**:

In [12]:
wikidata_df = wikidata_df.withColumn("wikidata_scientistWikipediaLink", get_redirected_url_udf("wikidata_scientistWikipediaLink"))
wikidata_df = wikidata_df.withColumn("wikidata_doctoralStudentWikipediaLink", get_redirected_url_udf("wikidata_doctoralStudentWikipediaLink"))
dbpedia_df = dbpedia_df.withColumn("dbpedia_scientistLink", get_redirected_url_udf("dbpedia_scientistLink"))
dbpedia_df = dbpedia_df.withColumn("dbpedia_doctoralStudentLink", get_redirected_url_udf("dbpedia_doctoralStudentLink"))

4. **Creating Foreign Keys**:

In [13]:
wikidata_df = wikidata_df.withColumn("wikidata_fk", concat("wikidata_scientistWikipediaLink", "wikidata_doctoralStudentWikipediaLink"))
dbpedia_df = dbpedia_df.withColumn("dbpedia_fk", concat("dbpedia_scientistLink", "dbpedia_doctoralStudentLink"))

#### Data Analysis

1. **Displaying Entries Count**:

In [14]:
print("Entries in WikiData: "+str(wikidata_df.count()))
print("Entries in DBPedia: "+str(dbpedia_df.count()))
dbpedia_df.show()

                                                                                

Entries in WikiData: 11243
Entries in DBPedia: 8646


                                                                                

+--------------------+---------------------+-----------------------+---------------------------+--------------------+
|   dbpedia_scientist|dbpedia_scientistLink|dbpedia_doctoralStudent|dbpedia_doctoralStudentLink|          dbpedia_fk|
+--------------------+---------------------+-----------------------+---------------------------+--------------------+
|http://dbpedia.or...| /wiki/A._P._Balac...|   http://dbpedia.or...|        /wiki/Pierre_Ramond|/wiki/A._P._Balac...|
|http://dbpedia.or...| /wiki/A._Ronald_G...|   http://dbpedia.or...|       /wiki/Víctor_Agui...|/wiki/A._Ronald_G...|
|http://dbpedia.or...| /wiki/A._W._F._Ed...|   http://dbpedia.or...|       /wiki/Elizabeth_A...|/wiki/A._W._F._Ed...|
|http://dbpedia.or...| /wiki/Aaron_John_...|   http://dbpedia.or...|       /wiki/Allen_C._Sk...|/wiki/Aaron_John_...|
|http://dbpedia.or...| /wiki/Aaron_John_...|   http://dbpedia.or...|       /wiki/Daniel_H._N...|/wiki/Aaron_John_...|
|http://dbpedia.or...|    /wiki/Abdus_Salam|   http://db

2. **Combining Data**:

In [15]:
combined_df = wikidata_df.join(dbpedia_df, wikidata_df.wikidata_fk == dbpedia_df.dbpedia_fk, "full_outer").distinct()
output_path = "Combined_Knowledgebase_entries.csv"
combined_df.write.csv(output_path, header=True, mode="overwrite")
consistant_df = combined_df.filter(col("dbpedia_fk").isNotNull() & col("wikidata_fk").isNotNull())
inconsistent_dbpedia_df = combined_df.filter(col("dbpedia_fk").isNull())
inconsistent_wikidata_df = combined_df.filter(col("wikidata_fk").isNull())

                                                                                

3. **Displaying Inconsistencies**:

In [16]:
print("Missing in DBPedia: "+str(combined_df.filter(col("dbpedia_fk").isNull()).count()))
print("Missing in WikiData : "+str(combined_df.filter(col("wikidata_fk").isNull()).count()))
print("Consistent Data: "+str(combined_df.filter(col("dbpedia_fk").isNotNull() & col("wikidata_fk").isNotNull()).count()))
consistant_df.show()

                                                                                

Missing in DBPedia: 7445


                                                                                

Missing in WikiData : 4848


                                                                                

Consistent Data: 3798


                                                                                

+--------------------+-------------------------------+------------------------+-------------------------------------+--------------------+--------------------+---------------------+-----------------------+---------------------------+--------------------+
|  wikidata_scientist|wikidata_scientistWikipediaLink|wikidata_doctoralStudent|wikidata_doctoralStudentWikipediaLink|         wikidata_fk|   dbpedia_scientist|dbpedia_scientistLink|dbpedia_doctoralStudent|dbpedia_doctoralStudentLink|          dbpedia_fk|
+--------------------+-------------------------------+------------------------+-------------------------------------+--------------------+--------------------+---------------------+-----------------------+---------------------------+--------------------+
|http://www.wikida...|           /wiki/Andreas_von...|    http://www.wikida...|                     /wiki/Ernst_Mach|/wiki/Andreas_von...|http://dbpedia.or...| /wiki/Andreas_von...|   http://dbpedia.or...|           /wiki/Ernst_Mach|/w

4. **Generating Prompts**:

In [17]:
consistant_data_prompt = prompt_generator(consistant_df)  
inconsistant_dbpedia_prompt = prompt_generator(inconsistent_dbpedia_df)

                                                                                

In [None]:
print(consistant_data_prompt)
print(inconsistant_dbpedia_prompt)

#### Prompt Execution and Analysis


**GPT 4o** Results for Consistent Data: 

 https://chatgpt.com/share/70343e5e-557a-4af5-8fcf-ae07a7fcadb7
 




**Gemini 1.5 Pro** Results for Consistent Data: 

 https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221ortkXnziU0rDDOEucLrHqWlinFRFKYK-%22%5D,%22action%22:%22open%22,%22userId%22:%22102353347198161437106%22,%22resourceKeys%22:%7B%7D%7D&usp=sharing
