# Fact Checking using Knowledge Graphs

A fact verification system is built which uses knowledge graphs to develop an efficient and accurate engine that can evaluate the authenticity of a given claim based on the available knowledge graph.

The system should be able to extract relevant information from the knowledge graphs and use it to verify the claim. The evidence may be supporting or refuting the claims to classify the claim as valid or invalid

The fact verification engine takes input as claims. Based on the data in the knowledge graph, the model gives an output if the given claim is valid or false, along with evidences which can support or refute the given claim.

This is achieved by creating a knowledge graph from the FEVER (Fact Extraction and Verification) dataset, and training a model on the knowledge graph after suitable preprocessing and feature extraction.

### Mounting Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Link for the dataset:
[FEVER dataset](https://drive.google.com/file/d/1oF-L881NDs_Qkqijqeg-13cddDPWzVER/view?usp=sharing)

## DATASET

The dataset used is called the FEVER (Fact extraction and Verification) Dataset 

The FEVER dataset is used for the following task as after performing literature review, it was noticed that no has created a knowledge graph from this dataset.


*   It consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from.

*   The claims are classified as Supported, Refuted or NotEnoughInfo. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment.

Dataset Source: [Source to FEVER](https://fever.ai/dataset/fever.html)


### Installing necessary packages

In [None]:
pip install transformers==4.20.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.20.1
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m83.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m52.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.12.1 transformers-4.20.1


In [None]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Necessary imports

In [None]:
import pandas as pd
import numpy as np
import json
import networkx as nx
import math 

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from scipy.sparse import hstack

import torch
import torch.nn
import seaborn as sns
import transformers

from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader, TensorDataset
from torch.nn.utils.rnn import pad_sequence

import logging
logging.basicConfig(level=logging.ERROR)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)# to use gpu if possible

cuda


### Traversing the FEVER Dataset to create the knowledge graph

The dataset consists of the following columns:
* id: The ID of the claim

* label: The annotated label for the claim. Can be one of SUPPORTS|REFUTES|NOT ENOUGH INFO.

* claim: The text of the claim.

* evidence: A list of evidence sets (lists of [Annotation ID, Evidence ID, Wikipedia URL, sentence ID] tuples) or a [Annotation ID, Evidence ID, null, null] tuple if the label is NOT ENOUGH INFO.

We have defined the knowledge graph as a networkx graph with the following features:

* The nodes of the knowledge graph consist of claims and evidences.

* The edge connecting the nodes and evidences contains a label which provides us information whether the evidence “supports” or “refutes” the claim.

* Non verifiable claims (not enough info) have been ommited while creating the KG.


In [None]:
# Read the JSONL file and parse the data
def read_fever_jsonl(file_path):
    data = []
    with open(file_path, 'r') as file:
        for line in file:
            data.append(json.loads(line))
    return data

# Create the knowledge graph
def create_knowledge_graph(data):
    kg = nx.DiGraph()
    
    for item in data:
        claim_id = item['id']
        claim_text = item['claim']
        label = item['label']
        
        # Add claim node
        kg.add_node(claim_id, label="claim", text=claim_text)
        
        if label != "NOT ENOUGH INFO":
            for evidence_group in item['evidence']:
                for evidence in evidence_group:
                    evidence_id = evidence[1]
                    evidence_title = evidence[2]
                    evidence_sentence_num = evidence[3]
                    
                    # Add evidence node
                    kg.add_node(evidence_id, label="evidence", title=evidence_title, sentence_num=evidence_sentence_num)
                    
                    # Add edge between claim and evidence with the relationship label
                    kg.add_edge(claim_id, evidence_id, label=label)
    
    return kg

In [None]:
file_path = "/content/drive/MyDrive/NAM/train.jsonl"
data = read_fever_jsonl(file_path)
knowledge_graph = create_knowledge_graph(data)

### Exploratory Data Analysis

Total number of nodes and edges in the knowledge graph

In [None]:
print("Nodes in the knowledge graph:", knowledge_graph.number_of_nodes())
print("Edges in the knowledge graph:", knowledge_graph.number_of_edges())

Nodes in the knowledge graph: 268910
Edges in the knowledge graph: 221476


### Total number of edges with "REFUTES" label

In [None]:
def count_refutes_edges(kg):
    count = 0
    for _, _, edge_data in kg.edges(data=True):
        if edge_data['label'] == 'REFUTES':
          count+=1
    return count
    
refutes_count = count_refutes_edges(knowledge_graph)
print("Number of nodes with a REFUTES relationship:", refutes_count)

Number of nodes with a REFUTES relationship: 60227



### Total number of edges with "SUPPORTS" label

In [None]:
def count_supports_edges(kg):
    count = 0
    for _, _, edge_data in kg.edges(data=True):
        if edge_data['label'] == 'SUPPORTS':
          count+=1
    return count
    

supports_count = count_supports_edges(knowledge_graph)
print("Number of nodes with a SUPPORTS relationship:", supports_count)

Number of nodes with a SUPPORTS relationship: 161249


## Preprocessing

The following preprocessing steps have been taken into account:

* Tokenization
* Removing stop words
* Lemmatization

In [None]:

def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)


### Finding top-k evidences for a given claim

In [None]:
def find_top_k_evidences(kg, claim_id, k=5):
    edges = [(evidence_id, data['label']) for _, evidence_id, data in kg.out_edges(claim_id, data=True)]
    evidences = sorted(edges, key=lambda x: x[1], reverse=True)[:k]
    return evidences


In [None]:
def prepare_dataset(kg):
    dataset = []
    for claim_id, data in kg.nodes(data=True):
        if data['label'] == 'claim':
            claim_text = preprocess_text(data['text'])
            evidences = find_top_k_evidences(kg, claim_id)
            for evidence_id, relationship in evidences:
                evidence_data = kg.nodes[evidence_id]
                evidence_text = preprocess_text(evidence_data['title'])
                dataset.append((claim_text, evidence_text, relationship))
    return dataset

dataset = prepare_dataset(knowledge_graph)


### Feature Extraction

In [None]:
def extract_features(dataset):
    vectorizer = TfidfVectorizer()
    claims, evidences, labels = zip(*dataset)
    claim_features = vectorizer.fit_transform(claims)
    evidence_features = vectorizer.transform(evidences)
    return claim_features, evidence_features, labels, vectorizer

claim_features, evidence_features, labels, vectorizer = extract_features(dataset)


### Dividing training and testing sets

In [None]:
X = hstack([claim_features, evidence_features])
y = [1 if label == "SUPPORTS" else 0 for label in labels]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Training the model

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Accuracy metrics with the testing set

In [None]:

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))


Accuracy: 0.8110532407407407
Precision: 0.8108741504569955
Recall: 0.9699484189280108
F1 Score: 0.8833065277884148


### Predicting claim veracity function

In [None]:
def predict_claim_veracity(claim_text, kg, model, vectorizer):
    claim_id = claims(knowledge_graph, claim_text)
    if claim_id is None:
        print("Claim not found in the knowledge graph.")
        return None

    claim = preprocess_text(claim_text)
    print("Claim:", claim)
    claim_features = vectorizer.transform([claim])
    #print("Claim features shape:", claim_features.shape)

    top_k_evidences = find_top_k_evidences(kg, claim_id)
    predictions = []
    for evidence_id, relationship in top_k_evidences:
        evidence_data = kg.nodes[evidence_id]
        evidence_text = evidence_data['title']
        #evidence_text = preprocess_text(evidence_data['title'])
        # print("Evidence text:", evidence_text)
        evidence_features = vectorizer.transform([evidence_text])
        #print("Evidence features shape:", evidence_features.shape)
        features = hstack([claim_features, evidence_features])
        # print("Features shape:", features.shape)

        prediction = model.predict(features)
        probability = model.predict_proba(features)

        if prediction[0] == 1:
            predicted_relationship = "SUPPORTS"
        else:
            predicted_relationship = "REFUTES"

        confidence = max(probability[0])

        predictions.append((predicted_relationship, evidence_text, confidence))
        #print("Prediction:", prediction)
        print("Probability:", probability)

    return predictions, top_k_evidences

def claims(kg, claim_text):
    for node_id, data in knowledge_graph.nodes(data=True):
        if data['label'] == 'claim' and data['text'] == claim_text:
            return node_id
    return None


### Main fact checking function

In [None]:
# Fact Checking function
def fact_check(claim_text):
    predictions, top_k_evidences = predict_claim_veracity(claim_text, knowledge_graph, model, vectorizer)
    if predictions is not None:
        supports_count = 0
        refutes_count = 0
        for predicted_relationship, evidence_text, confidence in predictions:
            print(f"{predicted_relationship} with evidence: {evidence_text} and confidence: {confidence:.2f}")
            if predicted_relationship == "SUPPORTS":
                supports_count += 1
            elif predicted_relationship == "REFUTES":
                refutes_count += 1
        if supports_count > refutes_count:
            print("Claim is valid.")
        else:
            print("Claim is invalid.")  
    else:
        print("No predictions were made for the claim.")
 
 

## Working

Below, while running the project with some test claims to demonstrate the working of the project,

For every claim provided, the outputs are in the format of:

*   Claim
*   Probability[invalid claim, valid claim]
* Evidences to support/refute claim along with confidence
* Conclusion



### Test Claim-1: Tetris has sold millions of physical copies.

In [None]:
fact_check("Tetris has sold millions of physical copies.")

Claim: tetri sold million physical copy
Probability: [[0.25977798 0.74022202]]
SUPPORTS with evidence: Tetris and confidence: 0.74
Claim is valid.


### Test Claim-2: Ariana Grande never lent her voice to animated television and films.

In [None]:
fact_check("Ariana Grande never lent her voice to animated television and films.")

Claim: ariana grande never lent voice animated television film
Probability: [[0.73749738 0.26250262]]
REFUTES with evidence: Ariana_Grande and confidence: 0.74
Claim is invalid.


### Test Claim-3: London is the location of zero enclaves.

In [None]:
fact_check("Roman Atwood is a content creator.")

Claim: roman atwood content creator
Probability: [[0.24535581 0.75464419]]
Probability: [[0.24535581 0.75464419]]
SUPPORTS with evidence: Roman_Atwood and confidence: 0.75
SUPPORTS with evidence: Roman_Atwood and confidence: 0.75
Claim is valid.


### Test Claim-4: London is the location of zero enclaves.

In [None]:
fact_check("London is the location of zero enclaves.")

Claim: london location zero enclave
Probability: [[0.78579529 0.21420471]]
REFUTES with evidence: City_of_London and confidence: 0.79
Claim is invalid.


Building such as the above fact verification using knowledge graphs has many real-world applications, such as detecting fake news, identifying misinformation, and verifying claims in legal or political contexts. These applications can help to promote transparency, accountability, and trust in public discourse.