# Comparison of RAG vs No-RAG Performance
**Experiment Setup:**

We evaluated the performance of two setups:

With RAG: Using retrieval-augmented generation over the article content.

Without RAG: Using the LLM directly on the full article without retrieval.

Same dataset and questions were used across both settings to ensure consistency.

Evaluation was done using a WatsonX-based grading system focused on answering the question correctly, not just matching ground truth text.



# With RAG:
Using retrieval-augmented generation over the article content.

In [27]:
import json
import pandas as pd

# Load the evaluation file
with open('/content/answers_with_eval.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(data)

# Extract blog URL from each entry (if available)
df['blog_url'] = df.get('article_url') if 'article_url' in df.columns else None
if 'blog_url' not in df.columns:
    df['blog_url'] = df['question'].map(lambda x: next((row['url'] for row in data if row['question'] == x), None))

# Basic statistics
num_questions = len(df)
num_blogs = df['blog_url'].nunique()
eval_counts = df['evaluation'].value_counts()
percentages = df['evaluation'].value_counts(normalize=True) * 100

print(f"Total unique blogs: {num_blogs}")
print(f"Total questions evaluated: {num_questions}\n")

# Overall average grade
overall_avg_grade = df['grade'].mean()
print(f"Overall average grade: {overall_avg_grade:.3f}\n")

print("Evaluation Breakdown:")
for label in eval_counts.index:
    print(f" - {label}: {eval_counts[label]} ({percentages[label]:.1f}%)")

# Show side-by-side answers
print("\nSample comparison of RAG vs Ground Truth:")
display(df[['question', 'ground_truth_answer', 'rag_answer', 'evaluation']].head(10))

# Accuracy by question
accuracy_per_question = df.groupby('question')['evaluation'].apply(
    lambda x: (x == 'Correct').sum() / len(x)
).reset_index().rename(columns={'evaluation': 'accuracy'})
accuracy_per_question = accuracy_per_question.sort_values(by='accuracy', ascending=False)

print("\nAccuracy per question:")
display(accuracy_per_question)


Total unique blogs: 35
Total questions evaluated: 350

Overall average grade: 0.724

Evaluation Breakdown:
 - Correct: 217 (62.0%)
 - Incorrect: 75 (21.4%)
 - Partially Correct: 57 (16.3%)
 - Invalid: 1 (0.3%)

Sample comparison of RAG vs Ground Truth:


Unnamed: 0,question,ground_truth_answer,rag_answer,evaluation
0,What kind of malware is it?,"It is a credit card skimming malware, specific...",The malware is a type of web-based credit card...,Correct
1,What kind of attack does it involve?,It involves a web-based credit card skimming a...,The attack involves web-based credit card skim...,Partially Correct
2,Does the malware using web inject?,"Yes, the malware uses web inject to steal sens...",The answer cannot be determined from the provi...,Incorrect
3,Is there credential theft from the browser?,"Yes, the malware is designed to steal form dat...","Yes, the attack steals form data, login creden...",Correct
4,Does the malware install malicious chrome exte...,"Yes, the malware installs malicious browser ex...","Yes, the malware installs a malicious Chrome e...",Correct
5,Does is steals credit cards or bank information?,The RolandSkimmer threat steals sensitive fina...,It collects the victim's sensitive information...,Correct
6,Is it related to wallet stealer?,The RolandSkimmer is related to credit card sk...,The answer cannot be determined from the provi...,Incorrect
7,What kind of information the malware steals?,"The malware, known as RolandSkimmer, steals se...","The malware steals sensitive financial data, s...",Partially Correct
8,Where does it forward the stolen information?,The stolen information is sent to the attacker...,The answer cannot be determined from the provi...,Incorrect
9,"What are the targeted located (regions, banks ...",The threat actor targets users in Bulgaria. Th...,The answer cannot be determined from the provi...,Incorrect



Accuracy per question:


Unnamed: 0,question,accuracy
1,Does the malware install malicious chrome exte...,0.857143
2,Does the malware using web inject?,0.8
4,Is there credential theft from the browser?,0.771429
3,Is it related to wallet stealer?,0.742857
0,Does is steals credit cards or bank information?,0.628571
9,Where does it forward the stolen information?,0.628571
5,"What are the targeted located (regions, banks ...",0.6
7,What kind of information the malware steals?,0.514286
8,What kind of malware is it?,0.485714
6,What kind of attack does it involve?,0.171429


# Without RAG:
Using the LLM directly on the full article without retrieval.

In [24]:
import json
import pandas as pd

# Load the evaluation file
with open('/content/answers_no_rag_eval.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(data)

# Extract blog URL from each entry (if available)
df['blog_url'] = df.get('article_url') if 'article_url' in df.columns else None
if 'blog_url' not in df.columns:
    df['blog_url'] = df['question'].map(lambda x: next((row['url'] for row in data if row['question'] == x), None))

# Basic statistics
num_questions = len(df)
num_blogs = df['blog_url'].nunique()
eval_counts = df['evaluation'].value_counts()
percentages = df['evaluation'].value_counts(normalize=True) * 100

print(f"Total unique blogs: {num_blogs}")
print(f"Total questions evaluated: {num_questions}\n")

# Overall average grade
overall_avg_grade = df['grade'].mean()
print(f"Overall average grade: {overall_avg_grade:.3f}\n")

print("Evaluation Breakdown:")
for label in eval_counts.index:
    print(f" - {label}: {eval_counts[label]} ({percentages[label]:.1f}%)")

# Show side-by-side answers
print("\nSample comparison of RAG vs Ground Truth:")

# Accuracy by question
accuracy_per_question = df.groupby('question')['evaluation'].apply(
    lambda x: (x == 'Correct').sum() / len(x)
).reset_index().rename(columns={'evaluation': 'accuracy'})
accuracy_per_question = accuracy_per_question.sort_values(by='accuracy', ascending=False)

print("\nAccuracy per question:")
display(accuracy_per_question)


Total unique blogs: 35
Total questions evaluated: 350

Overall average grade: 0.735

Evaluation Breakdown:
 - Correct: 217 (62.0%)
 - Incorrect: 69 (19.7%)
 - Partially Correct: 63 (18.0%)
 - Invalid: 1 (0.3%)

Sample comparison of RAG vs Ground Truth:

Accuracy per question:


Unnamed: 0,question,accuracy
1,Does the malware install malicious chrome exte...,0.885714
2,Does the malware using web inject?,0.857143
4,Is there credential theft from the browser?,0.742857
3,Is it related to wallet stealer?,0.742857
0,Does is steals credit cards or bank information?,0.657143
5,"What are the targeted located (regions, banks ...",0.571429
7,What kind of information the malware steals?,0.514286
8,What kind of malware is it?,0.485714
9,Where does it forward the stolen information?,0.485714
6,What kind of attack does it involve?,0.257143
