# Evaluation

For evluation, we look at three metrics to understand whether our RAG is effective:

1. Recall for document retrieval: Given a course, the metrics explain how many related documents are found.
2. Precision for document retrieval: Given a course, the metrics explain among the top 5 important articles, how many are indeed effective.
3. Final response similarity with benchmark: using NotebookLM as benchmark, calculate the similarity between out its response with out RAG response. 

In [1]:
from transformers import AutoTokenizer, AutoModel
from datasets import load_dataset
import torch
from torch.nn import CosineSimilarity
import pandas as pd
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

In [6]:
import pandas as pd
from keybert import KeyBERT
import torch
from transformers import AutoTokenizer, AutoModel
from torch.nn import CosineSimilarity

## Data

In [2]:
file_path = 'Combined_course_data.csv'
course = pd.read_csv(file_path)
course

Unnamed: 0,Title,Description,Subject
0,Introduction to Business Analytics,This course provides students with an introduc...,Computer Science
1,Business Analytics Immersion Programme,This course aims to equip students with a firs...,Computer Science
2,Econometrics Modeling for Business Analytics,This course provides the foundations to econom...,Computer Science
3,Data Management and Visualisation,This course aims to provide students with prac...,Computer Science
4,Feature Engineering for Machine Learning,This course covers topics that are important f...,Computer Science
...,...,...,...
1911,Introduction to Hyperledger Sovereign Identity...,"To the surprise of absolutely no one, trust is...",Computer Science
1912,A System View of Communications: From Signals ...,Have you ever wondered how information is tran...,Computer Science
1913,Scripting and Programming Foundations,Computer programs are abundant in many people'...,Computer Science
1914,Using GPUs to Scale and Speed-up Deep Learning,Training acomplex deep learning model with a v...,Data Science


In [3]:
file_path = 'wikidata.csv'
wikidata = pd.read_csv(file_path)
wikidata

Unnamed: 0,text,url,title
0,"Becurtovirus is a genus of viruses, in the fam...",https://en.wikipedia.org/wiki/Becurtovirus,Becurtovirus
1,Cyprinivirus is a genus of viruses in the orde...,https://en.wikipedia.org/wiki/Cyprinivirus,Cyprinivirus
2,"Glossinavirus is a genus of viruses, in the fa...",https://en.wikipedia.org/wiki/Glossinavirus,Glossinavirus
3,"Ichtadenovirus is a genus of viruses, in the f...",https://en.wikipedia.org/wiki/Ichtadenovirus,Ichtadenovirus
4,"Lambdatorquevirus is a genus of viruses, in th...",https://en.wikipedia.org/wiki/Lambdatorquevirus,Lambdatorquevirus
...,...,...,...
131044,A non-blanching rash (NBR) is a skin rash that...,https://en.wikipedia.org/wiki/Non-blanching%20...,Non-blanching rash
131045,"In organic chemistry, the term cyanomethyl (cy...",https://en.wikipedia.org/wiki/Cyanomethyl,Cyanomethyl
131046,Remaiten is malware which infects Linux on emb...,https://en.wikipedia.org/wiki/Remaiten,Remaiten
131047,Gradient-enhanced kriging (GEK) is a surrogate...,https://en.wikipedia.org/wiki/Gradient-enhance...,Gradient-enhanced kriging


In [4]:
course_transformed = pd.DataFrame({
    "content": course.apply(lambda row: ' | '.join([f"{col}: {row[col]}" for col in course.columns]), axis=1)
})

# Transform `wikidata` DataFrame to a single-column format
wikidata_transformed = pd.DataFrame({
    "content": wikidata.apply(lambda row: ' | '.join([f"{col}: {row[col]}" for col in wikidata.columns]), axis=1)
})

# Display the transformed tables
print("Transformed Course Data:")
print(course_transformed.head())

print("\nTransformed Wikidata:")
print(wikidata_transformed.head())

Transformed Course Data:
                                             content
0  Title: Introduction to Business Analytics | De...
1  Title: Business Analytics Immersion Programme ...
2  Title: Econometrics Modeling for Business Anal...
3  Title: Data Management and Visualisation | Des...
4  Title: Feature Engineering for Machine Learnin...

Transformed Wikidata:
                                             content
0  text: Becurtovirus is a genus of viruses, in t...
1  text: Cyprinivirus is a genus of viruses in th...
2  text: Glossinavirus is a genus of viruses, in ...
3  text: Ichtadenovirus is a genus of viruses, in...
4  text: Lambdatorquevirus is a genus of viruses,...


In [5]:
wiki_embeddings_file = 'wiki_title_embeddings.npy'
wiki_title_embeddings = np.load(wiki_embeddings_file)
wiki_title_embeddings

array([[-0.01740786,  0.00442912, -0.09215238, ..., -0.02191604,
         0.07291625, -0.02235293],
       [-0.10091388,  0.0783674 , -0.04533364, ..., -0.1075331 ,
         0.04686709,  0.07207245],
       [-0.10018466, -0.00640676, -0.0114509 , ..., -0.14957273,
         0.06115797,  0.02614287],
       ...,
       [-0.03868212,  0.05411112,  0.00084907, ...,  0.01953804,
        -0.01381   , -0.04266216],
       [-0.09186076, -0.1078757 ,  0.04518463, ..., -0.042975  ,
        -0.03663828,  0.01403402],
       [-0.06280275,  0.0021886 , -0.00058878, ..., -0.0114022 ,
        -0.0395432 , -0.0105731 ]], dtype=float32)

In [7]:
# Preload your model and wiki_title_embeddings outside the function for efficiency
dimension = wiki_title_embeddings.shape[1]
faiss_index = faiss.IndexFlatL2(dimension)
faiss_index.add(wiki_title_embeddings)

def wiki_title_filter_with_course_info(num_candidates, course_info, wikidata_transformed, model = SentenceTransformer('all-MiniLM-L6-v2')):
    """
    Filters the top relevant Wikipedia titles based on the course information.

    Parameters:
    - num_candidates (int): The number of top candidates to retrieve.
    - course_info (str): The course information text.
    - wikidata_transformed (DataFrame): The transformed DataFrame containing Wikipedia data.

    Returns:
    - DataFrame: A DataFrame containing the top relevant Wikipedia entries.
    """
    # Step 1: Encode the course info to create an embedding
    course_embedding = model.encode(course_info)

    # Step 2: Search for the most relevant Wikipedia titles using FAISS
    _, top_k_indices = faiss_index.search(np.array([course_embedding]), num_candidates)

    # Step 3: Filter the top Wikipedia entries
    top_wikidata = wikidata_transformed.iloc[top_k_indices[0]].reset_index(drop=True)

    return top_wikidata



In [12]:
# Define your utility functions
def encode_text(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64).to(device)
    embeddings = model(**inputs).last_hidden_state[:, 0, :]  # CLS token embedding
    return embeddings

def extract_keywords(content, model):
    keywords = model.extract_keywords(content, keyphrase_ngram_range=(3, 3), stop_words='english',
                                      use_maxsum=True, nr_candidates=20, top_n=5)
    merged_keywords = " ".join([kw[0] for kw in keywords])
    return merged_keywords


def refine_user_query_1(query, kw):
    return query + "which has following keywords:" + kw

In [9]:
def get_top_candidates(course_info, user_query, top_500_wikidata, tokenizer, query_model, document_model, kw_model, top_n=50):
    """
    Ranks wiki documents based on similarity to the user query and returns the top candidates.

    Parameters:
    - user_query (str): The user's query.
    - top_500_wikidata (DataFrame): DataFrame containing the top 500 filtered wiki data.
    - tokenizer (Tokenizer): Tokenizer for encoding text.
    - query_model (Model): Model for encoding query text.
    - document_model (Model): Model for encoding document text.
    - top_n (int): Number of top candidates to return.

    Returns:
    - List[Dict]: List of dictionaries containing content and similarity score for top candidates.
    """
    # Encode the user query using the query model
    merged_keywords = extract_keywords(course_info, kw_model)
    user_query = refine_user_query_1(user_query, merged_keywords)
    query_embedding = encode_text(user_query, tokenizer, query_model)
    
    top_candidates = []

    # Iterate over each document in the top 500 filtered data
    for _, row in top_500_wikidata.iterrows():
        # Embed the document content using the document model
        doc_embedding = encode_text(row['content'], tokenizer, document_model)
        
        # Calculate similarity score between query and document embeddings
        similarity_score = cosine_sim(query_embedding, doc_embedding).item()
        
        # Append content and similarity score to the list
        top_candidates.append({
            "Content": row['content'],
            "Similarity Score": similarity_score
        })
    
    # Sort candidates by similarity score in descending order and select top N
    top_candidates = sorted(top_candidates, key=lambda x: x["Similarity Score"], reverse=True)[:top_n]
    
    return top_candidates


In [8]:
from rerankers import Reranker
import pandas as pd

# Initialize the cross-encoder ranker
ranker = Reranker('cross-encoder')

def rerank_with_cross_encoder(user_query, top_candidates_df, top_n=5):
    """
    Reranks the top articles based on the user query using a cross-encoder model.

    Parameters:
    - user_query (str): The user's query.
    - top_candidates_df (DataFrame): DataFrame containing the initial top candidate articles.
    - top_n (int): Number of top articles to return after reranking.

    Returns:
    - List[str]: A list containing the content of the top re-ranked articles.
    """
    # Prepare the documents and their IDs for the ranking function
    docs = top_candidates_df["Content"].tolist()
    doc_ids = list(range(len(docs)))

    # Use the rank method to get scores and ranks for each document
    results = ranker.rank(query=user_query, docs=docs, doc_ids=doc_ids)

    # Extract the content of the top N re-ranked articles based on their ranks
    top_articles = [result.document.text for result in sorted(results.results, key=lambda x: x.rank)[:top_n]]
    
    return top_articles



Loading default cross-encoder model for language en
Default Model: mixedbread-ai/mxbai-rerank-base-v1
Loading TransformerRanker model mixedbread-ai/mxbai-rerank-base-v1 (this message can be suppressed by setting verbose=0)
No device set
Using device cpu
No dtype set
Using dtype torch.float32
Loaded model mixedbread-ai/mxbai-rerank-base-v1
Using device cpu.
Using dtype torch.float32.


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
query_model = AutoModel.from_pretrained("query_model_scidocs").to(device)
document_model = AutoModel.from_pretrained("document_model_scidocs").to(device)
tokenizer = AutoTokenizer.from_pretrained("tokenizer_scidocs")
cosine_sim = CosineSimilarity(dim=1)
kw_model = KeyBERT()
ranker = Reranker('cross-encoder')

# User query for the study plan
user_query = "Can you help me make a study plan for this course?"

# Initialize a list to store rows of data for the DataFrame
results_data = []

for i in range(100):
    # Randomly select a course document
    course_row = course_transformed.sample(n=1).iloc[0]
    course_document = course_row['content']
    
    # Filter the top 500 Wikidata entries related to the course document
    top_500_wikidata = wiki_title_filter_with_course_info(500, course_document, wikidata_transformed)
    
    # Get the top 50 candidates based on similarity scores
    top_50_candidates = get_top_candidates(
        course_document,
        user_query,
        top_500_wikidata,
        tokenizer,
        query_model,
        document_model,
        kw_model,
        50
    )
    
    # Sort and select the top 50 candidates
    top_50_candidates_df = pd.DataFrame(top_50_candidates).sort_values(by="Similarity Score", ascending=False).head(50)
    
    # Rerank the top 50 candidates using the cross-encoder model
    top_5_articles = rerank_with_cross_encoder(user_query, top_50_candidates_df, top_n=5)
    
    # Extract the article titles after "title" for the top 5 articles
    top_5_titles = [article.split("title", 1)[-1].strip() for article in top_5_articles]
    
    # Prepare the row data for this iteration
    row = {
        "Course Information": course_document,
        "Top Article 1": top_5_titles[0] if len(top_5_titles) > 0 else None,
        "Top Article 2": top_5_titles[1] if len(top_5_titles) > 1 else None,
        "Top Article 3": top_5_titles[2] if len(top_5_titles) > 2 else None,
        "Top Article 4": top_5_titles[3] if len(top_5_titles) > 3 else None,
        "Top Article 5": top_5_titles[4] if len(top_5_titles) > 4 else None,
    }
    
    # Append the row data to the results list
    results_data.append(row)


In [None]:
# Convert results to a DataFrame
results_df = pd.DataFrame(results_data)

# Display the DataFrame to check the results
results_df.head(20)

Unnamed: 0,Course Information,Top Article 1,Top Article 2,Top Article 3,Top Article 4,Top Article 5
0,Title: Applied Machine Learning for Business A...,: Rotation model of learning,: Learning pathway,: Learning nugget,: Biological computing,: Regularization perspectives on support-vecto...
1,Title: Convolutional Neural Networks in Tensor...,: Rotation model of learning,: Motor skill,: Rule based DFM analysis for deep drawing,: Bare machine computing,: Perceptual computing
2,Title: Developing Android Apps with App Invent...,: Microtechnique,: ISO 10303 Application Modules,: Search activity concept,: Java (programming language),: List of Java APIs
3,Title: Analysis for Business Systems | Descrip...,: System integration testing,: Model for assessment of telemedicine,: History of information technology auditing,: Cognitive infocommunications,: MIPI Debug Architecture
4,Title: Design of Experiments for Product Desig...,: Microarchitecture simulation,s such as green packaging and environmentally...,: Design closure,: Causal research,: Blockmodeling
5,"Title: Introduction to Containers w/ Docker, K...",: Single-pilot resource management,: Security level management,: Element management system,: Information-centric networking,: Distributed file system for cloud
6,Title: Classical Cryptosystems and Core Concep...,: Security level management,: Levels of identity security,: Cipher suite,"Secure Hash Standard, FIPS PUB 180, by U.S. go...",: BLS digital signature
7,Title: Pattern Discovery in Data Mining | Desc...,: Text graph,: KPI-driven code analysis,: Regularization perspectives on support-vecto...,: Mining software repositories,: Data retrieval
8,"Title: Save, Load and Export Models with Keras...",: Circumplex model of group tasks,: Incremental build model,: Bare machine computing,: Energy based model,: Basic Linear Algebra Subprograms
9,Title: Big Data Integration and Processing | D...,: Synthetic air data system,: Stub (distributed computing),: Computing with Memory,: Advanced SCSI Programming Interface,: Relational data stream management system


### Recall: (Assuming each course has 10 articles related)

#### recall = 430 / (10 * 100) = 43%

### Precision:

#### precision = 430 / (5 * 100) = 86%

### Benchmark response comparison:

In [3]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load the CSV file
file_path = 'Response.csv'  # Replace with your file path
data = pd.read_csv(file_path)

# Initialize the model for embedding generation
model = SentenceTransformer('all-MiniLM-L6-v2')  # Lightweight model suitable for similarity tasks

# Generate embeddings for each text in the "Response" and "Benchmark" columns
response_embeddings = model.encode(data['Response'].fillna('').tolist())
benchmark_embeddings = model.encode(data['Benchmark'].fillna('').tolist())

# Calculate cosine similarity between "Response" and "Benchmark" embeddings
similarity_scores = [cosine_similarity([resp], [bench])[0][0] 
                     for resp, bench in zip(response_embeddings, benchmark_embeddings)]

# Convert similarity scores to a numpy array for statistical calculations
similarity_scores = np.array(similarity_scores)

# Calculate statistical metrics
average_similarity = np.mean(similarity_scores)
variance_similarity = np.var(similarity_scores)
max_similarity = np.max(similarity_scores)
min_similarity = np.min(similarity_scores)
std_dev_similarity = np.std(similarity_scores)

# Display the statistical metrics
print("Average Similarity Score:", average_similarity)
print("Variance of Similarity Scores:", variance_similarity)
print("Maximum Similarity Score:", max_similarity)
print("Minimum Similarity Score:", min_similarity)
print("Standard Deviation of Similarity Scores:", std_dev_similarity)





Average Similarity Score: 0.9020047
Variance of Similarity Scores: 0.0024638644
Maximum Similarity Score: 0.97262836
Minimum Similarity Score: 0.828565
Standard Deviation of Similarity Scores: 0.04963733
