# Knowledge Extraction with open-source LLMs
## Roberta Large Fine-tuned and RAG Integration

In this notebook, as a part of proof-of-concept, an example code of integration fine-tuned QA models and RAG is presented.


In [None]:
#intsall requirements
!pip install datasets transformers torch torchvision torchaudio tqdm
!pip install requests>=2.32.1
!pip install accelerate
!pip install langchain
!pip install chromadb
!pip install tiktoken
!pip install sentence_transformers==2.7.0
!pip install python-dotenv
!pip install langchain_community
!pip install langchain_huggingface


Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-

In [None]:
#import libraries
import json
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
import chromadb
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain import HuggingFacePipeline
import torch
from google.colab import drive
import random
from datasets import Dataset, DatasetDict

In [None]:
# Set Seed for reproducibility
seed = 123
random.seed(seed)
if torch.cuda.is_available():
      torch.manual_seed(seed)
      torch.cuda.manual_seed_all(seed)


In [None]:
# Mount Google Drive (specific to Google Colab environment)
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load the SQuAD JSON files
with open('/content/drive/MyDrive/Colab Notebooks/E.ON_Data_Challenge/SQuAD/dev-v2.0.json') as g:
    dev_data = json.load(g)

In [None]:
# Overall, this function takes raw data with articles, questions, and answers,
# and transforms it into a structured dictionary separating titles, contexts,
# questions, answer texts, and answer starting positions.

# Function to transform the data into the required format
def transform_data(data):
    transformed_data = {
        'id': [],
        'title': [],
        'context': [],
        'question': [],
        'answers': []
    }
    for article in data['data']:
        title = article['title']
        for paragraph in article['paragraphs']:
            context = paragraph['context']
            for qa in paragraph['qas']:
                transformed_data['id'].append(qa['id'])
                transformed_data['title'].append(title)
                transformed_data['context'].append(context)
                transformed_data['question'].append(qa['question'])
                transformed_data['answers'].append({
                    'text': [answer['text'] for answer in qa['answers']],
                    'answer_start': [answer['answer_start'] for answer in qa['answers']]
                })
    return transformed_data


In [None]:
# Transform the data
dev_transformed = transform_data(dev_data)

# Create Dataset objects
dev_dataset = Dataset.from_dict(dev_transformed)


In [None]:
# Check if GPU is available and move the model to GPU if it is
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [None]:
# Choose a pre-trained model
model_name = "ozgurkk/roberta-large-finetuned-squad"
# intialize model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

In [None]:
tokenizer.is_fast

True

# RAG

In [None]:
# extract contexts from data  to upload as document to vectore space
# since there is no source for another text data, documents, etc.
dev_contexts = [item['context'] for item in dev_dataset]

In [None]:
## Create unique list of contexts
dev_contexts = list(dict.fromkeys(dev_contexts))
len(dev_contexts)

1204

In [None]:
# Convert contexts to documents for ingestion
# Text splitter to create chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,
    chunk_overlap = 150) # chuck size and chunk_overlap can be utilized but it is out of scope this prject
documents = text_splitter.create_documents(dev_contexts)
all_splits = text_splitter.split_documents(documents)

In [None]:

# Use embeddings from HuggingFace
# there are options as openAI etc. but for proof of concept we used huggingface embeddings
# all-MiniLM-L6-v2 is the one of the most rated sentence-transformers in huggign face
# model selection can be optimized
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Initialize ChromaDB client
chroma_client = chromadb.Client()

# Add documents to ChromaDB
# for Proof-of-Context,  all contexts  in squad evaluation dataset uploaded to vector space
vectorstore = Chroma.from_documents(
    documents=all_splits,
    collection_name="squad-rag-chroma",
    embedding=embeddings,
)

print("Ingestion completed. Collections in ChromaDB: ", chroma_client.list_collections())


  warn_deprecated(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Ingestion completed. Collections in ChromaDB:  [Collection(id=6136034b-d620-4c95-b17c-9384fc9910b8, name=squad-rag-chroma)]


In [None]:
# example of similratity search
docs = vectorstore.similarity_search(dev_dataset[0]['question'], k=1)
print("Question: " + dev_dataset[0]['question'])
print( "Actual Context: " + dev_dataset[0]['context'])
print( "Related context found: " + " ".join([doc.page_content for doc in docs]))

Question: In what country is Normandy located?
Actual Context: The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.
Related context found: In the course of the 10th century, the initially destructive incursions of Norse war bands into the rivers of France evolved into more permanent encampments that included 

In [None]:
# Create the HuggingFace pipeline for Q&A
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer, device=device)

In [None]:
# define functions to predict answer of question

def generate_answer(question, retriever, qa_pipeline, device):
    # Retrieve relevant context
    docs = retriever.similarity_search(question, k=1)  # Optimze k
    # MMR is another method to find documents in vectore space
    #docs = vectorstore.max_marginal_relevance_search(demo_question,k=2, fetch_k=3)
    context = " ".join([doc.page_content for doc in docs])
    # Use the Q&A pipeline to generate the answer based on the retrieved context
    inputs = {
        'question': question,
        'context': context
    }

    if context != "" :
      answer = qa_pipeline(inputs)
      return {'answer': answer['answer'], 'score' : answer['score'],'context' : context}
    else:
      answer = qa_pipeline(inputs)
      return  {'answer': "", 'score' : answer['score'],'context' : context}




In [None]:
# Example inference
demo_question = dev_dataset[0]['question']
demo_answer = dev_dataset[0]['answers']['text'][0]
print("Context:" + dev_dataset[0]['context'])
print("Question:" + demo_question)
print("Actual Answer:" + demo_answer)

# Perform inference
answer = generate_answer(demo_question, vectorstore, qa_pipeline, device)
print(answer['answer'])


Context:The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.
Question:In what country is Normandy located?
Actual Answer:France
Neustria
