How can I use Knowledge Graphs to mitigate the hallucinations in large language models?

To mitigate hallucinations in large language models (LLMs) using Knowledge Graphs (KGs), you can cross-reference the output of the LLM with facts stored in the KG. If there's a discrepancy, you can correct the LLM's output. Here's a simplified example in Python:

In [None]:
!pip install -q openai langchain pyvis gradio newspaper3k chromadb==0.4.3 tiktoken pypdf "langchain[docarray]"

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode context the generation is conditioned on
input_ids = tokenizer.encode('The capital of France is', return_tensors='pt')

# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Generate text until the output length (which includes the context length) reaches 50
outputs = model.generate(input_ids, max_length=50)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Simulated Knowledge Graph (in reality, this would be much larger and more complex)
knowledge_graph = {
'France': {'capital': 'Paris'},
'Germany': {'capital': 'Berlin'},
# ... more knowledge
}

# Function to check LLM output against the Knowledge Graph
def validate_with_kg(text, kg):
  words = text.split()
  for i, word in enumerate(words):
    # Check if the word is a key in the KG
    if word in kg:
      # Assume the next word is the entity of interest, e.g., 'capital'
      entity = words[i+2] if i+2 < len(words) else None
      # Validate against KG
      if entity and entity in kg[word]:
        return f"Correct: {word} -> {entity}: {kg[word][entity]}"
      else:
        # Find the correct entity from KG
        correct_entity = list(kg[word].keys())[0]
        correct_value = kg[word][correct_entity]
        return f"Error: {word} -> {entity}. Corrected: {word} -> {correct_entity}: {correct_value}"
  return "No knowledge graph match found."

# Example usage
result = validate_with_kg(text, knowledge_graph)
print(result)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Error: France -> the. Corrected: France -> capital: Paris


In this example, the LLM generates a continuation of the sentence "The capital of France is". The validate_with_kg function then checks if the generated text contains any information that can be validated against the Knowledge Graph. If there's a mismatch, it corrects the output based on the KG.

Please note that this is a highly simplified example. In practice, you would need a more sophisticated method for parsing and understanding the text, as well as a much larger and more complex KG. Additionally, real-world KGs might be queried via APIs or database queries rather than a simple Python dictionary.

In [1]:
\begin{algorithm}
\caption{Mitigate LLM Hallucinations using Knowledge Graph}
\begin{algorithmic}[1]
\Require Pre-trained LLM model, Knowledge Graph (KG)
\Ensure Corrected output from LLM based on KG

\Function{GenerateText}{$model, tokenizer, inputText$}
\State $inputIds \gets \Call{Encode}{tokenizer, inputText}$
\State $output \gets \Call{ModelGenerate}{model, inputIds}$
\State $text \gets \Call{Decode}{tokenizer, output}$
\State \Return $text$
\EndFunction

\Function{ValidateWithKG}{$text, kg$}
\State $words \gets \Call{Split}{text}$
\ForAll{$word \in words$}
\If{$word \in kg$}
\State $entity \gets \Call{NextWord}{words, word}$
\If{$entity \in kg[word]$}
\State \Return $\Call{FormatCorrect}{word, entity, kg}$
\Else
\State $correctEntity \gets \Call{GetCorrectEntity}{kg, word}$
\State $correctValue \gets kg[word][correctEntity]$
\State \Return $\Call{FormatError}{word, entity, correctEntity, correctValue}$
\EndIf
\EndIf
\EndFor
\State \Return $\text{"No KG match found."}$
\EndFunction

\State $model \gets \Call{LoadModel}{\text{"gpt2"}}$
\State $tokenizer \gets \Call{LoadTokenizer}{\text{"gpt2"}}$
\State $kg \gets \Call{LoadKnowledgeGraph}{}$
\State $text \gets \Call{GenerateText}{model, tokenizer, \text{"The capital of France is"}}$
\State $result \gets \Call{ValidateWithKG}{text, kg}$
\State \Call{Print}{$result$}

\end{algorithmic}
\end{algorithm}

SyntaxError: unexpected character after line continuation character (<ipython-input-1-f5812965a825>, line 1)