<a href="https://colab.research.google.com/github/yug-sinha/ANSWER-RETRIEVAL-EVALUATOR-SYSTEM/blob/main/ClearFeed(ml_ans).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pandas scikit-learn google-generativeai rouge-score bert-score

In [None]:
import json
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import google.generativeai as genai
from rouge_score import rouge_scorer
from bert_score import score as bert_score

In [None]:
# File paths (hardcoded)
KB_FILE = '/Clearfeed_kb.json'
EVAL_FILE = '/clearfeed_qa_pairs.csv'

Update Path According to your Requirements after Uploading the files

In [None]:
# Step 1: Load and preprocess the JSON dataset
with open(KB_FILE, 'r') as f:
    data = json.load(f)

# Flatten JSON into a DataFrame
urls, titles, texts = [], [], []

for url, content in data.items():
    urls.append(url)
    titles.append(content['title'])
    texts.append(content['text'])

corpus_df = pd.DataFrame({'url': urls, 'title': titles, 'text': texts})

# Combine title and text for TF-IDF vectorization
corpus_df['content'] = corpus_df['title'] + " " + corpus_df['text']

 ### **Explanation of the Approach: TF-IDF Vectorization with Cosine Similarity**

1. **TF-IDF Vectorization**:
   - **What It Does**:
     TF-IDF (Term Frequency-Inverse Document Frequency) represents text data as numerical vectors, emphasizing the importance of words in a document relative to the entire corpus.
     - **Term Frequency (TF)**: Measures how frequently a term appears in a document.
     - **Inverse Document Frequency (IDF)**: Reduces the weight of common terms across all documents, giving higher importance to unique terms.
   - **Purpose**: Converts textual content into numerical form suitable for similarity calculations.

2. **Cosine Similarity**:
   - **What It Does**:
     Measures the cosine of the angle between two vectors (query vector and document vectors). Values range from 0 (completely dissimilar) to 1 (identical).
   - **Purpose**: Determines how similar the query is to each document in the corpus.

---

### **How It's Used in the Code**:

- **Step 1**: `TfidfVectorizer` is initialized and fit on the `corpus_df['content']`, creating a matrix (`tfidf_matrix`) where each row represents a document's vector.
- **Step 2**: When a question is input by the user, it is transformed into a query vector using the same TF-IDF vectorizer.
- **Step 3**: Cosine similarity is computed between the query vector and all document vectors in the `tfidf_matrix`.
- **Step 4**: The documents are ranked based on similarity scores, and the top `k` URLs (default is 5) are retrieved and displayed.

This approach efficiently retrieves the most relevant URLs based on term importance and contextual overlap.

In [None]:
# Step 2: Build a TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus_df['content'])

def retrieve_top_k_urls(question, top_k=5):
    """
    Retrieve the top 5 URLs for a given question.
    """
    query_vec = tfidf_vectorizer.transform([question])
    scores = cosine_similarity(query_vec, tfidf_matrix)
    top_indices = scores[0].argsort()[-top_k:][::-1]
    return corpus_df.iloc[top_indices][['url', 'title']].reset_index(drop=True)

# 1. Take a question input from the user
question = input("Enter your question: ")
top_results = retrieve_top_k_urls(question)

# Format and print the top URLs
print("\nTop 5 URLs:")
for idx, row in top_results.iterrows():
    print(f"{idx + 1}) {row['url']}")

In [None]:
# Configure the Google Generative AI SDK with the API key
GEMINI_API_KEY = "AIzaSyCrrLhhFIDWW3AGA8TZvLAVURvzVm4Ry30" #Committed API Key
genai.configure(api_key=GEMINI_API_KEY)

# Load the JSON knowledge base
with open('/Clearfeed_kb.json', 'r') as f: #Update Path According to your requirements
    knowledge_base = json.load(f)

def generate_answer_from_gemini(question, top_results):
    """
    Generate an answer using Google Gemini API and the top 5 results from the JSON knowledge base.
    """
    # Prepare the context from the top 5 results
    context = "\n\n".join(
        f"{row['title']}:\n{knowledge_base[row['url']]['text']}" for _, row in top_results.iterrows()
    )

    # Prepare the prompt
    prompt = f"Using the following knowledge base, answer the question:\n\n{context}\n\nQuestion: {question}\n\nAnswer:"

    # Configure and start a chat session
    model = genai.GenerativeModel(
        model_name="gemini-1.5-flash",  # Use the correct model name
        generation_config={
            "temperature": 0.7,
            "top_p": 0.9,
            "top_k": 40,
            "max_output_tokens": 150,
            "response_mime_type": "text/plain"
        }
    )
    chat_session = model.start_chat(history=[])
    response = chat_session.send_message(prompt)

    # Return the response text
    return response.text if response else "No response from Gemini API"

# Generate the answer using only the top 5 results
generated_answer = generate_answer_from_gemini(question, top_results)
print("Generated Answer:")
print(generated_answer)

In [None]:
def evaluate_model(question):
    """
    Evaluate the model's answer generation quality using ROUGE-L scores.
    """
        # Initialize the ROUGE scorer
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)  # Using ROUGE-L for evaluation

    # Combine the content of the top results
    context = "\n\n".join(
    f"{row['title']}:\n{knowledge_base[row['url']]['text']}" for _, row in top_results.iterrows()
)

    # Calculate ROUGE-L score between the generated answer and the combined content
    score = scorer.score(context, generated_answer)

    return score['rougeL'].fmeasure  # ROUGE-L F1 score

# Evaluate the system and print the results
print("\nEvaluating the model's performance...")
rougeL_score = evaluate_model(question)
print("\nROUGE-L Score:", rougeL_score)

### **Explanation of the Approach: ROUGE-L Evaluation**

1. **ROUGE-L Metric**:
   - **What It Does**: ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation) measures the overlap of the longest common subsequence (LCS) between the generated text and reference text.  
     - **Precision**: How much of the generated text aligns with the reference.  
     - **Recall**: How much of the reference text aligns with the generated text.  
     - **F1-Score**: A harmonic mean of precision and recall.  
   - **Purpose**: Evaluates the fluency and structural similarity between generated and reference content, focusing on sentence-level coherence.

2. **Advantages**:
   - Captures both order and sequence of words.  
   - Suitable for comparing long text summaries.

---

### **How It's Used in the Code**:

- **Step 1**: The `RougeScorer` is initialized with `rougeL` and configured to use stemming to normalize terms.  
- **Step 2**: The context is prepared by concatenating the titles and corresponding text of the top retrieved results.  
- **Step 3**: The `generated_answer` is compared with the combined `context` using ROUGE-L, calculating the LCS overlap.  
- **Step 4**: The `F1-Score` from ROUGE-L is returned as the evaluation metric, providing a balanced measure of how well the generated answer matches the context.  

This implementation uses ROUGE-L to assess the quality of generated answers by comparing their structural and linguistic alignment with the input context.

In [None]:
def evaluate_model_with_factual_consistency(question):
    """
    Evaluate the model's answer generation quality using BERTScore for factual consistency.
    """
    # Combine the content of the top results
    context = "\n\n".join(
        f"{row['title']}:\n{knowledge_base[row['url']]['text']}" for _, row in top_results.iterrows()
    )

    # Calculate BERTScore between the generated answer and the combined content
    P, R, F1 = bert_score([generated_answer], [context], lang="en")  # Ensure the language matches
    return F1.mean().item()  # Return the mean F1 score as a factual consistency metric

# Evaluate the system and print the results
print("\nEvaluating the model's performance with factual consistency...")
factual_consistency_score = evaluate_model_with_factual_consistency(question)
print("\nFactual Consistency (BERTScore F1):", factual_consistency_score)

### **Explanation of the Approach: BERTScore Evaluation**

1. **BERTScore Metric**:
   - **What It Does**: BERTScore evaluates the semantic similarity between two pieces of text by leveraging pre-trained BERT embeddings. It compares word-level representations of the generated text and the reference (or context) using precision, recall, and F1 scores.
     - **Precision**: Measures how well the generated answer’s words match the reference.
     - **Recall**: Measures how much of the reference's words are captured by the generated answer.
     - **F1-Score**: A balance between precision and recall, indicating overall similarity.
   - **Purpose**: Unlike traditional metrics like ROUGE, which focus on exact word overlap, BERTScore measures semantic consistency, making it suitable for evaluating the factual correctness and fluency of generated content.

2. **Advantages**:
   - Captures deeper semantic meaning by using contextual embeddings from BERT.
   - Handles synonymy and paraphrasing better than surface-level overlap metrics.

---

### **How It's Used in the Code**:

- **Step 1**: The context is prepared by concatenating the titles and text of the top relevant URLs from the knowledge base.  
- **Step 2**: The `generated_answer` is compared with the context using BERTScore. It calculates the precision, recall, and F1 score for the semantic similarity between the two texts.
- **Step 3**: The mean F1 score is returned, which is used as the factual consistency metric to evaluate how well the generated answer aligns with the factual content in the context.

This approach uses BERTScore to assess how factually consistent and semantically aligned the model's generated answer is with the provided content, ensuring that the response is both relevant and accurate.