# Question Answering System
## Similarity-based Model Training + Conditional Answering + Simple RAG System

This notebook demonstrates:
1. **Similarity-based Question Answering**: Finding the most relevant documents using TF-IDF and cosine similarity
2. **Conditional Answering**: Providing answers based on similarity thresholds
3. **Simple RAG System**: Retrieval-Augmented Generation for better question answering

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import string
from collections import defaultdict

print("Libraries imported successfully!")

In [None]:
# Knowledge Base: Documents for Question Answering
documents = [
    "Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
    "Python is a popular programming language used for data science, web development, and artificial intelligence.",
    "Natural language processing involves teaching computers to understand and process human language.",
    "Deep learning uses neural networks with multiple layers to solve complex problems like image recognition.",
    "Data science combines statistics, programming, and domain expertise to extract insights from data.",
    "Computer vision enables machines to interpret and understand visual information from the world.",
    "Supervised learning uses labeled data to train models that can make predictions on new data.",
    "Unsupervised learning finds hidden patterns in data without using labeled examples.",
    "Reinforcement learning trains agents to make decisions by learning from rewards and penalties.",
    "Big data refers to large, complex datasets that require special tools and techniques to process."
]

# Sample Question-Answer pairs for training
qa_pairs = [
    {"question": "What is machine learning?", "answer": "Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data."},
    {"question": "Which programming language is popular for AI?", "answer": "Python is a popular programming language used for data science, web development, and artificial intelligence."},
    {"question": "What does NLP stand for?", "answer": "Natural language processing involves teaching computers to understand and process human language."},
    {"question": "How does deep learning work?", "answer": "Deep learning uses neural networks with multiple layers to solve complex problems like image recognition."},
    {"question": "What is supervised learning?", "answer": "Supervised learning uses labeled data to train models that can make predictions on new data."}
]

print(f"Knowledge base created with {len(documents)} documents")
print(f"Training data contains {len(qa_pairs)} Q&A pairs")

In [None]:
# Text Preprocessing Functions
def preprocess_text(text):
    """Clean and preprocess text for better similarity matching"""
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

def extract_keywords(text):
    """Extract important keywords from text"""
    # Simple keyword extraction (remove common stop words)
    stop_words = {'is', 'are', 'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by'}
    words = preprocess_text(text).split()
    keywords = [word for word in words if word not in stop_words and len(word) > 2]
    return keywords

# Test preprocessing
sample_question = "What is machine learning?"
print(f"Original: {sample_question}")
print(f"Preprocessed: {preprocess_text(sample_question)}")
print(f"Keywords: {extract_keywords(sample_question)}")

In [None]:
# Similarity-based Question Answering Model
class SimilarityQA:
    def __init__(self, threshold=0.3):
        self.vectorizer = TfidfVectorizer()
        self.documents = []
        self.qa_pairs = []
        self.doc_vectors = None
        self.qa_vectors = None
        self.threshold = threshold
        
    def train(self, documents, qa_pairs):
        """Train the model with documents and Q&A pairs"""
        self.documents = documents
        self.qa_pairs = qa_pairs
        
        # Create vectors for documents
        self.doc_vectors = self.vectorizer.fit_transform(documents)
        
        # Create vectors for questions in Q&A pairs
        questions = [pair['question'] for pair in qa_pairs]
        self.qa_vectors = self.vectorizer.transform(questions)
        
        print(f"Model trained with {len(documents)} documents and {len(qa_pairs)} Q&A pairs")
    
    def find_best_answer(self, question):
        """Find the best answer using similarity matching"""
        # Vectorize the input question
        question_vector = self.vectorizer.transform([question])
        
        # Calculate similarity with Q&A pairs (exact match)
        qa_similarities = cosine_similarity(question_vector, self.qa_vectors)[0]
        best_qa_idx = np.argmax(qa_similarities)
        best_qa_score = qa_similarities[best_qa_idx]
        
        # Calculate similarity with documents (retrieval)
        doc_similarities = cosine_similarity(question_vector, self.doc_vectors)[0]
        best_doc_idx = np.argmax(doc_similarities)
        best_doc_score = doc_similarities[best_doc_idx]
        
        # Conditional answering based on threshold
        if best_qa_score > self.threshold:
            return {
                'answer': self.qa_pairs[best_qa_idx]['answer'],
                'confidence': best_qa_score,
                'source': 'QA_pairs',
                'method': 'exact_match'
            }
        elif best_doc_score > self.threshold:
            return {
                'answer': self.documents[best_doc_idx],
                'confidence': best_doc_score,
                'source': 'documents',
                'method': 'similarity_retrieval'
            }
        else:
            return {
                'answer': "Sorry, I couldn't find a relevant answer to your question.",
                'confidence': max(best_qa_score, best_doc_score),
                'source': 'none',
                'method': 'below_threshold'
            }

# Initialize and train the model
qa_model = SimilarityQA(threshold=0.2)
qa_model.train(documents, qa_pairs)

In [None]:
# Test Similarity-based Question Answering
test_questions = [
    "What is machine learning?",  # Exact match in Q&A pairs
    "Tell me about deep learning",  # Should find from documents
    "What is Python used for?",  # Similar to Q&A pair
    "How does computer vision work?",  # Should find from documents
    "What is quantum computing?"  # No relevant answer
]

print("=" * 60)
print("SIMILARITY-BASED QUESTION ANSWERING RESULTS")
print("=" * 60)

for i, question in enumerate(test_questions, 1):
    print(f"\n{i}. Question: {question}")
    result = qa_model.find_best_answer(question)
    print(f"   Answer: {result['answer']}")
    print(f"   Confidence: {result['confidence']:.3f}")
    print(f"   Source: {result['source']}")
    print(f"   Method: {result['method']}")
    print("-" * 50)

In [None]:
# Simple RAG (Retrieval-Augmented Generation) System
class SimpleRAG:
    def __init__(self, top_k=3):
        self.vectorizer = TfidfVectorizer()
        self.documents = []
        self.doc_vectors = None
        self.top_k = top_k
        
    def add_documents(self, documents):
        """Add documents to the knowledge base"""
        self.documents = documents
        self.doc_vectors = self.vectorizer.fit_transform(documents)
        print(f"RAG system loaded with {len(documents)} documents")
    
    def retrieve_documents(self, query):
        """Retrieve top-k most relevant documents"""
        query_vector = self.vectorizer.transform([query])
        similarities = cosine_similarity(query_vector, self.doc_vectors)[0]
        
        # Get top-k most similar documents
        top_indices = np.argsort(similarities)[::-1][:self.top_k]
        retrieved_docs = []
        
        for idx in top_indices:
            if similarities[idx] > 0:  # Only include relevant documents
                retrieved_docs.append({
                    'document': self.documents[idx],
                    'similarity': similarities[idx],
                    'index': idx
                })
        
        return retrieved_docs
    
    def generate_answer(self, query, retrieved_docs):
        """Generate answer based on retrieved documents"""
        if not retrieved_docs:
            return "No relevant documents found for your query."
        
        # Simple answer generation using the most relevant document
        best_doc = retrieved_docs[0]
        
        # Extract relevant sentences (simple approach)
        sentences = best_doc['document'].split('.')
        query_keywords = extract_keywords(query)
        
        # Find sentence with most keyword matches
        best_sentence = ""
        max_matches = 0
        
        for sentence in sentences:
            if sentence.strip():
                sentence_keywords = extract_keywords(sentence)
                matches = len(set(query_keywords) & set(sentence_keywords))
                if matches > max_matches:
                    max_matches = matches
                    best_sentence = sentence.strip()
        
        if best_sentence:
            return best_sentence + "."
        else:
            return best_doc['document']
    
    def answer_question(self, query):
        """Complete RAG pipeline: retrieve + generate"""
        # Step 1: Retrieve relevant documents
        retrieved_docs = self.retrieve_documents(query)
        
        # Step 2: Generate answer
        answer = self.generate_answer(query, retrieved_docs)
        
        return {
            'query': query,
            'answer': answer,
            'retrieved_documents': len(retrieved_docs),
            'top_similarity': retrieved_docs[0]['similarity'] if retrieved_docs else 0
        }

# Initialize RAG system
rag_system = SimpleRAG(top_k=3)
rag_system.add_documents(documents)

In [None]:
# Test RAG System
rag_test_questions = [
    "What is artificial intelligence?",
    "How does machine learning work?",
    "Tell me about neural networks",
    "What is the difference between supervised and unsupervised learning?",
    "How is Python used in data science?"
]

print("=" * 60)
print("RAG SYSTEM QUESTION ANSWERING RESULTS")
print("=" * 60)

for i, question in enumerate(rag_test_questions, 1):
    print(f"\n{i}. Question: {question}")
    result = rag_system.answer_question(question)
    print(f"   Answer: {result['answer']}")
    print(f"   Retrieved docs: {result['retrieved_documents']}")
    print(f"   Top similarity: {result['top_similarity']:.3f}")
    print("-" * 50)

In [None]:
# Hybrid Question Answering System (Similarity + RAG)
class HybridQA:
    def __init__(self, similarity_threshold=0.3, rag_top_k=3):
        self.similarity_qa = SimilarityQA(threshold=similarity_threshold)
        self.rag_system = SimpleRAG(top_k=rag_top_k)
        
    def train(self, documents, qa_pairs):
        """Train both systems"""
        self.similarity_qa.train(documents, qa_pairs)
        self.rag_system.add_documents(documents)
        
    def answer_question(self, question):
        """Hybrid approach: try similarity first, then RAG"""
        # Try similarity-based approach first
        similarity_result = self.similarity_qa.find_best_answer(question)
        
        # If similarity approach gives a good answer, use it
        if similarity_result['source'] in ['QA_pairs', 'documents'] and similarity_result['confidence'] > 0.4:
            return {
                'question': question,
                'answer': similarity_result['answer'],
                'method': 'similarity_based',
                'confidence': similarity_result['confidence'],
                'source': similarity_result['source']
            }
        
        # Otherwise, use RAG approach
        rag_result = self.rag_system.answer_question(question)
        
        return {
            'question': question,
            'answer': rag_result['answer'],
            'method': 'rag_based',
            'confidence': rag_result['top_similarity'],
            'source': 'retrieved_documents'
        }

# Initialize Hybrid QA system
hybrid_qa = HybridQA(similarity_threshold=0.2, rag_top_k=3)
hybrid_qa.train(documents, qa_pairs)

In [None]:
# Test Hybrid QA System
hybrid_test_questions = [
    "What is machine learning?",  # Should use similarity (exact match)
    "Explain deep learning concepts",  # Should use RAG
    "How does computer vision work?",  # Should use RAG
    "What programming languages are used for AI?",  # Should use similarity
]

print("=" * 70)
print("HYBRID QUESTION ANSWERING SYSTEM RESULTS")
print("=" * 70)

for i, question in enumerate(hybrid_test_questions, 1):
    print(f"\n{i}. Question: {question}")
    result = hybrid_qa.answer_question(question)
    print(f"   Answer: {result['answer']}")
    print(f"   Method: {result['method']}")
    print(f"   Confidence: {result['confidence']:.3f}")
    print(f"   Source: {result['source']}")
    print("-" * 60)

In [None]:
# Performance Evaluation and Comparison
def evaluate_qa_systems():
    """Compare all three QA approaches"""
    evaluation_questions = [
        "What is machine learning?",
        "Tell me about Python programming",
        "How does deep learning work?",
        "What is reinforcement learning?",
        "Explain big data concepts"
    ]
    
    results = []
    
    for question in evaluation_questions:
        # Test similarity-based
        sim_result = qa_model.find_best_answer(question)
        
        # Test RAG
        rag_result = rag_system.answer_question(question)
        
        # Test hybrid
        hybrid_result = hybrid_qa.answer_question(question)
        
        results.append({
            'question': question,
            'similarity_confidence': sim_result['confidence'],
            'rag_confidence': rag_result['top_similarity'],
            'hybrid_method': hybrid_result['method'],
            'hybrid_confidence': hybrid_result['confidence']
        })
    
    return results

# Run evaluation
evaluation_results = evaluate_qa_systems()

print("=" * 80)
print("PERFORMANCE COMPARISON OF QA SYSTEMS")
print("=" * 80)
print(f"{'Question':<35} {'Similarity':<12} {'RAG':<12} {'Hybrid Method':<15} {'Hybrid Conf':<12}")
print("-" * 80)

for result in evaluation_results:
    question_short = result['question'][:30] + "..." if len(result['question']) > 30 else result['question']
    print(f"{question_short:<35} {result['similarity_confidence']:<12.3f} {result['rag_confidence']:<12.3f} {result['hybrid_method']:<15} {result['hybrid_confidence']:<12.3f}")

print("\n" + "=" * 80)

In [None]:
# Interactive Demo Function
def interactive_qa_demo():
    """Interactive function to test QA systems"""
    
    def ask_question(question, system='hybrid'):
        """Ask a question to the specified QA system"""
        print(f"\nQuestion: {question}")
        print("=" * 50)
        
        if system == 'similarity' or system == 'all':
            print("SIMILARITY-BASED ANSWER:")
            sim_result = qa_model.find_best_answer(question)
            print(f"Answer: {sim_result['answer']}")
            print(f"Confidence: {sim_result['confidence']:.3f} | Source: {sim_result['source']}")
            
        if system == 'rag' or system == 'all':
            print("\nRAG-BASED ANSWER:")
            rag_result = rag_system.answer_question(question)
            print(f"Answer: {rag_result['answer']}")
            print(f"Confidence: {rag_result['top_similarity']:.3f}")
            
        if system == 'hybrid' or system == 'all':
            print("\nHYBRID ANSWER:")
            hybrid_result = hybrid_qa.answer_question(question)
            print(f"Answer: {hybrid_result['answer']}")
            print(f"Method: {hybrid_result['method']} | Confidence: {hybrid_result['confidence']:.3f}")
    
    return ask_question

# Create demo function
ask_question = interactive_qa_demo()

# Demo examples
print("INTERACTIVE QA DEMO")
print("=" * 60)

# Test with different types of questions
ask_question("What is artificial intelligence?", "all")
ask_question("How does machine learning differ from traditional programming?", "hybrid")

## Summary and Key Insights

### Three Question Answering Approaches Implemented:

1. **Similarity-based QA Model**:
   - Uses TF-IDF vectorization and cosine similarity
   - Matches questions to pre-trained Q&A pairs and documents
   - Implements conditional answering with confidence thresholds
   - Best for: Exact or near-exact question matches

2. **Simple RAG System**:
   - Retrieval-Augmented Generation approach
   - Retrieves top-k relevant documents for any query
   - Generates answers by extracting relevant content
   - Best for: Open-ended questions requiring document synthesis

3. **Hybrid QA System**:
   - Combines both similarity and RAG approaches
   - Uses similarity-based method for high-confidence matches
   - Falls back to RAG for complex or novel questions
   - Best for: Production systems requiring robust coverage

### Key Features:
- ✅ **Conditional Answering**: Returns "don't know" for low-confidence queries
- ✅ **Text Preprocessing**: Improves matching accuracy
- ✅ **Confidence Scoring**: Provides reliability metrics
- ✅ **Multiple Retrieval Methods**: Exact match + similarity search
- ✅ **Performance Evaluation**: Compares different approaches

### Real-world Applications:
- Customer support chatbots
- Educational Q&A systems
- Documentation search engines
- Knowledge base assistants