# ⬡ ScholarSynth | RAG Setup
**Search. Synthesize. Succeed.**

### 📋 Implementation Plan
- Environment setup and imports
- ArXiv API integration and paper retrieval
- Text chunking implementation for academic papers
- Qdrant vector database setup
- Paper indexing and semantic search
- Basic question-answering interface
- Vibe checking and initial evaluation


In [2]:
# Cell 1: Environment Setup and Imports
import os
import sys
import numpy as np
import pandas as pd
from typing import List, Dict, Any
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

# Load environment variables
load_dotenv()

# LangChain imports - UPDATED for newer LangChain versions
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.documents import Document

# ArXiv API
import arxiv

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

print('✅ Environment setup complete!')


✅ Environment setup complete!


In [3]:
# Cell 2: ArXiv API Integration and Paper Retrieval
class ArXivPaperRetriever:
    """
    ArXiv paper retriever for academic research.
    Implements AIE8 Week 2 concepts: Data collection and API integration.
    """
    
    def __init__(self, max_results: int = 10):
        self.max_results = max_results
        
    def search_papers(self, query: str, max_results: int = None) -> List[Dict[str, Any]]:
        """
        Search for papers on ArXiv based on query.
        
        Args:
            query: Search query for papers
            max_results: Maximum number of papers to retrieve
            
        Returns:
            List of paper dictionaries with metadata
        """
        if max_results is None:
            max_results = self.max_results
            
        try:
            # Create ArXiv search
            search = arxiv.Search(
                query=query,
                max_results=max_results,
                sort_by=arxiv.SortCriterion.Relevance
            )
            
            papers = []
            for result in search.results():
                paper = {
                    'title': result.title,
                    'authors': [author.name for author in result.authors],
                    'abstract': result.summary,
                    'published': result.published,
                    'arxiv_id': result.entry_id.split('/')[-1],
                    'categories': result.categories,
                    'pdf_url': result.pdf_url,
                    'doi': result.doi
                }
                papers.append(paper)
                
            print(f'✅ Retrieved {len(papers)} papers for query: "{query}"')
            return papers
            
        except Exception as e:
            print(f'❌ Error retrieving papers: {e}')
            return []

# Initialize the retriever
arxiv_retriever = ArXivPaperRetriever(max_results=5)

# Test the retriever with a sample query
test_query = "machine learning transformer architecture"
sample_papers = arxiv_retriever.search_papers(test_query)

print(f"\n📚 Sample Papers Retrieved:")
for i, paper in enumerate(sample_papers[:2], 1):
    print(f"{i}. {paper['title']}")
    print(f"   Published: {paper['published'].strftime('%Y-%m-%d')} | Categories: {', '.join(paper['categories'][:2])}")


✅ Retrieved 5 papers for query: "machine learning transformer architecture"

📚 Sample Papers Retrieved:
1. Differentiable Neural Architecture Transformation for Reproducible Architecture Improvement
   Published: 2020-06-15 | Categories: cs.LG, cs.CV
2. Transfer NAS: Knowledge Transfer between Search Spaces with Transformer Agents
   Published: 2019-06-19 | Categories: cs.LG, stat.ML


In [4]:
# Cell 3: Text Chunking Implementation
class AcademicTextSplitter:
    """
    Text splitter optimized for academic papers.
    Implements AIE8 Week 2 concepts: Document chunking and preprocessing.
    """
    
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.text_splitter = CharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separator='\n\n'
        )
    
    def split_paper(self, paper: Dict[str, Any]) -> List[Document]:
        """
        Split a paper into chunks for vector storage.
        
        Args:
            paper: Paper dictionary with title, abstract, etc.
            
        Returns:
            List of Document objects
        """
        # Combine title and abstract for chunking
        full_text = f"Title: {paper['title']}\n\nAbstract: {paper['abstract']}"
        
        # Split into chunks
        chunks = self.text_splitter.split_text(full_text)
        
        # Create Document objects
        documents = []
        for i, chunk in enumerate(chunks):
            doc = Document(
                page_content=chunk,
                metadata={
                    'title': paper['title'],
                    'authors': ', '.join(paper['authors'][:3]),
                    'published': paper['published'].strftime('%Y-%m-%d'),
                    'arxiv_id': paper['arxiv_id'],
                    'categories': ', '.join(paper['categories'][:2]),
                    'chunk_id': i,
                    'total_chunks': len(chunks)
                }
            )
            documents.append(doc)
        
        return documents
    
    def analyze_chunking(self, documents: List[Document]) -> Dict[str, Any]:
        """
        Analyze chunking statistics.
        
        Args:
            documents: List of Document objects
            
        Returns:
            Dictionary with chunking statistics
        """
        chunk_lengths = [len(doc.page_content) for doc in documents]
        
        stats = {
            'total_chunks': len(documents),
            'avg_chunk_length': np.mean(chunk_lengths),
            'min_chunk_length': np.min(chunk_lengths),
            'max_chunk_length': np.max(chunk_lengths),
            'std_chunk_length': np.std(chunk_lengths)
        }
        
        return stats

# Initialize the text splitter
text_splitter = AcademicTextSplitter(chunk_size=1000, chunk_overlap=200)

# Test chunking with sample papers
if sample_papers:
    all_documents = []
    for paper in sample_papers[:2]:  # Use first 2 papers for testing
        docs = text_splitter.split_paper(paper)
        all_documents.extend(docs)
    
    # Analyze chunking
    chunking_stats = text_splitter.analyze_chunking(all_documents)
    
    print(f"\n📊 Chunking Statistics:")
    print(f"   Total chunks: {chunking_stats['total_chunks']}")
    print(f"   Avg length: {chunking_stats['avg_chunk_length']:.0f} chars")
    print(f"   Range: {chunking_stats['min_chunk_length']:.0f} - {chunking_stats['max_chunk_length']:.0f} chars")
    
    print(f"\n📝 Sample Chunk Preview:")
    doc = all_documents[0]
    print(f"   {doc.page_content[:150]}...")
else:
    print("⚠️ No papers available for chunking")



📊 Chunking Statistics:
   Total chunks: 2
   Avg length: 830 chars
   Range: 815 - 844 chars

📝 Sample Chunk Preview:
   Title: Differentiable Neural Architecture Transformation for Reproducible Architecture Improvement

Abstract: Recently, Neural Architecture Search (NA...


In [5]:
# Cell 4: Qdrant Vector Database Setup
# Upgraded from in-memory to production-ready Qdrant vector database
# Implements AIE8 Week 2-3 concepts: Production vector databases and embeddings

try:
    from qdrant_client import QdrantClient
    from qdrant_client.models import Distance, VectorParams, PointStruct
    QDRANT_AVAILABLE = True
except ImportError:
    print("⚠️ Qdrant not installed. Run: pip install qdrant-client")
    print("   Falling back to in-memory vector store...")
    QDRANT_AVAILABLE = False

class AcademicVectorStore:
    """
    Production-ready vector store for academic papers using Qdrant.
    Falls back to in-memory storage if Qdrant is not available.
    """
    
    def __init__(self, collection_name: str = "academic_papers", use_qdrant: bool = True):
        self.collection_name = collection_name
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.use_qdrant = use_qdrant and QDRANT_AVAILABLE
        
        if self.use_qdrant:
            # Initialize Qdrant (in-memory for demo, can connect to server)
            self.qdrant_client = QdrantClient(":memory:")
            # For production server: QdrantClient(host="localhost", port=6333)
            self.document_map = {}  # Store metadata separately
            print("✅ Initialized Qdrant vector database (production-ready)")
        else:
            # Fallback to in-memory
            self.documents = []
            self.embeddings_cache = []
            print("✅ Initialized in-memory vector store (fallback mode)")
    
    def create_collection(self):
        """Create a new collection for academic papers."""
        if self.use_qdrant:
            try:
                # Create Qdrant collection with proper configuration
                self.qdrant_client.create_collection(
                    collection_name=self.collection_name,
                    vectors_config=VectorParams(
                        size=1536,  # OpenAI text-embedding-3-small dimension
                        distance=Distance.COSINE  # Cosine similarity for semantic search
                    )
                )
                print(f"✅ Created Qdrant collection (1536-dim, cosine similarity)")
                return True
            except Exception as e:
                print(f"❌ Error creating collection: {e}")
                return False
        else:
            print(f"✅ Created collection '{self.collection_name}' (in-memory)")
            return True
    
    def add_documents(self, documents: List[Document]) -> bool:
        """Add documents to the vector store."""
        if not documents:
            return False
            
        try:
            # Generate embeddings
            texts = [doc.page_content for doc in documents]
            embeddings = self.embeddings.embed_documents(texts)
            
            if self.use_qdrant:
                # Upload to Qdrant
                points = []
                for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
                    point_id = len(self.document_map)
                    points.append(
                        PointStruct(
                            id=point_id,
                            vector=embedding,
                            payload={
                                "text": doc.page_content,
                                "metadata": doc.metadata
                            }
                        )
                    )
                    self.document_map[point_id] = doc
                
                self.qdrant_client.upsert(
                    collection_name=self.collection_name,
                    points=points
                )
                print(f"✅ Added {len(documents)} documents to Qdrant")
            else:
                # In-memory fallback
                self.documents.extend(documents)
                self.embeddings_cache.extend(embeddings)
                print(f"✅ Added {len(documents)} documents to in-memory store")
            
            return True
            
        except Exception as e:
            print(f"❌ Error adding documents: {e}")
            return False
    
    def search_similar(self, query: str, k: int = 5) -> List[Dict[str, Any]]:
        """Search for similar documents."""
        try:
            # Generate query embedding
            query_embedding = self.embeddings.embed_query(query)
            
            if self.use_qdrant:
                # Search using Qdrant
                search_results = self.qdrant_client.search(
                    collection_name=self.collection_name,
                    query_vector=query_embedding,
                    limit=k
                )
                
                # Format results
                results = []
                for hit in search_results:
                    results.append({
                        'text': hit.payload['text'],
                        'score': hit.score,
                        'metadata': hit.payload['metadata']
                    })
                return results
                
            else:
                # In-memory fallback
                if not self.documents:
                    return []
                    
                similarities = []
                for i, doc_embedding in enumerate(self.embeddings_cache):
                    similarity = np.dot(query_embedding, doc_embedding) / (
                        np.linalg.norm(query_embedding) * np.linalg.norm(doc_embedding)
                    )
                    similarities.append((i, similarity))
                
                similarities.sort(key=lambda x: x[1], reverse=True)
                top_k = similarities[:k]
                
                results = []
                for idx, score in top_k:
                    doc = self.documents[idx]
                    results.append({
                        'text': doc.page_content,
                        'score': score,
                        'metadata': doc.metadata
                    })
                return results
            
        except Exception as e:
            print(f"❌ Error searching: {e}")
            return []
    
    def get_stats(self):
        """Get statistics about the vector store."""
        if self.use_qdrant:
            try:
                info = self.qdrant_client.get_collection(self.collection_name)
                return {
                    'total_documents': info.points_count,
                    'vector_size': info.config.params.vectors.size,
                    'distance_metric': info.config.params.vectors.distance.name,
                    'backend': 'Qdrant'
                }
            except:
                return {'backend': 'Qdrant', 'status': 'Error'}
        else:
            return {
                'total_documents': len(self.documents),
                'vector_size': 1536,
                'distance_metric': 'COSINE',
                'backend': 'In-Memory'
            }

# Initialize vector store with Qdrant
vector_store = AcademicVectorStore(use_qdrant=True)

# Create collection and add documents
if vector_store.create_collection():
    if 'all_documents' in locals() and all_documents:
        vector_store.add_documents(all_documents)
        
        # Display statistics
        stats = vector_store.get_stats()
        print(f"\n📊 Vector Store: {stats['total_documents']} documents indexed ({stats['backend']})")
        
        # Test search
        test_query = "transformer architecture attention mechanism"
        search_results = vector_store.search_similar(test_query, k=2)
        
        print(f"\n🔍 Top Search Results:")
        for i, result in enumerate(search_results, 1):
            print(f"{i}. {result['metadata'].get('title', 'N/A')[:80]}... (Score: {result['score']:.3f})")
    else:
        print("⚠️ No documents available to add to vector store")


✅ Initialized Qdrant vector database (production-ready)
✅ Created Qdrant collection (1536-dim, cosine similarity)
✅ Added 2 documents to Qdrant

📊 Vector Store: 2 documents indexed (Qdrant)

🔍 Top Search Results:
1. Transfer NAS: Knowledge Transfer between Search Spaces with Transformer Agents... (Score: 0.406)
2. Differentiable Neural Architecture Transformation for Reproducible Architecture ... (Score: 0.393)


In [6]:
# Cell 5: Basic Question-Answering Interface
class AcademicResearchAssistant:
    """
    Basic RAG-based research assistant.
    Implements AIE8 Week 3 concepts: End-to-end RAG pipeline.
    """
    
    def __init__(self, vector_store: AcademicVectorStore):
        self.vector_store = vector_store
        self.llm = ChatOpenAI(model="gpt-4", temperature=0.1)
        
    def ask_question(self, question: str, k: int = 3) -> Dict[str, Any]:
        """Answer a research question using RAG."""
        try:
            # Retrieve relevant documents
            relevant_docs = self.vector_store.search_similar(question, k=k)
            
            if not relevant_docs:
                return {
                    'answer': 'I could not find relevant information to answer your question.',
                    'sources': [],
                    'confidence': 0.0
                }
            
            # Prepare context for LLM
            context = '\n\n'.join([doc['text'] for doc in relevant_docs])
            
            # Create prompt
            prompt = f"""
            Based on the following academic research context, please answer the question.
            
            Context:
            {context}
            
            Question: {question}
            
            Please provide a comprehensive answer based on the research context. Include specific details and cite relevant information.
            """
            
            # Get answer from LLM
            response = self.llm.invoke(prompt)
            answer = response.content
            
            # Calculate confidence based on similarity scores
            avg_score = np.mean([doc['score'] for doc in relevant_docs])
            confidence = min(avg_score * 2, 1.0)
            
            # Prepare sources
            sources = []
            for doc in relevant_docs:
                sources.append({
                    'title': doc['metadata'].get('title', 'Unknown'),
                    'authors': doc['metadata'].get('authors', 'Unknown'),
                    'published': doc['metadata'].get('published', 'Unknown'),
                    'arxiv_id': doc['metadata'].get('arxiv_id', 'Unknown'),
                    'relevance_score': doc['score']
                })
            
            return {
                'answer': answer,
                'sources': sources,
                'confidence': confidence,
                'num_sources': len(sources)
            }
            
        except Exception as e:
            print(f"❌ Error answering question: {e}")
            return {
                'answer': f"I encountered an error while processing your question: {e}",
                'sources': [],
                'confidence': 0.0
            }

# Initialize the research assistant
# Check if vector store has data (works for both Qdrant and in-memory modes)
has_data = False
if hasattr(vector_store, 'use_qdrant') and vector_store.use_qdrant:
    # For Qdrant mode
    has_data = len(vector_store.document_map) > 0
elif hasattr(vector_store, 'documents'):
    # For in-memory mode
    has_data = len(vector_store.documents) > 0

if has_data:
    research_assistant = AcademicResearchAssistant(vector_store)
    
    # Test with sample questions
    test_questions = [
        "What is a transformer architecture?",
        "How does attention mechanism work?"
    ]
    
    print("🤖 Testing ScholarSynth Q&A:\n")
    
    for i, question in enumerate(test_questions, 1):
        print(f"Q{i}: {question}")
        
        result = research_assistant.ask_question(question)
        
        print(f"A: {result['answer'][:200]}...")
        print(f"Confidence: {result['confidence']:.2f} | Sources: {result['num_sources']}")
        
        if result['sources']:
            for j, source in enumerate(result['sources'][:2], 1):
                print(f"   [{j}] {source['title'][:60]}... (Score: {source['relevance_score']:.3f})")
        print()
else:
    print("❌ Vector store not available. Please run the previous cells first.")


🤖 Testing ScholarSynth Q&A:

Q1: What is a transformer architecture?
A: The transformer architecture is not explicitly defined in the provided research context. However, it can be inferred that it is a type of model or agent used in Neural Architecture Search (NAS) method...
Confidence: 0.93 | Sources: 2
   [1] Differentiable Neural Architecture Transformation for Reprod... (Score: 0.463)
   [2] Transfer NAS: Knowledge Transfer between Search Spaces with ... (Score: 0.463)

Q2: How does attention mechanism work?
A: The research context does not provide information on how the attention mechanism works....
Confidence: 0.42 | Sources: 2
   [1] Transfer NAS: Knowledge Transfer between Search Spaces with ... (Score: 0.212)
   [2] Differentiable Neural Architecture Transformation for Reprod... (Score: 0.210)



In [7]:
# Cell 6: Vibe Checking and Initial Evaluation
import time

class VibeChecker:
    """
    Vibe checker for evaluating the research assistant.
    Implements AIE8 Week 1 concepts: System evaluation and baseline establishment.
    """
    
    def __init__(self, research_assistant: AcademicResearchAssistant):
        self.research_assistant = research_assistant
        
    def run_vibe_check(self, test_questions: List[str]) -> Dict[str, Any]:
        """Run comprehensive vibe check on the research assistant."""
        results = {
            'total_questions': len(test_questions),
            'successful_answers': 0,
            'avg_confidence': 0.0,
            'avg_sources': 0.0,
            'response_times': [],
            'detailed_results': []
        }
        
        print("🔍 Running Vibe Check...")
        print(f"Testing {len(test_questions)} questions\n")
        
        for i, question in enumerate(test_questions, 1):
            start_time = time.time()
            result = self.research_assistant.ask_question(question)
            end_time = time.time()
            
            response_time = end_time - start_time
            results['response_times'].append(response_time)
            
            # Check if answer is meaningful
            is_meaningful = (
                len(result['answer']) > 50 and 
                'error' not in result['answer'].lower() and
                result['confidence'] > 0.1
            )
            
            if is_meaningful:
                results['successful_answers'] += 1
            
            results['avg_confidence'] += result['confidence']
            results['avg_sources'] += result['num_sources']
            
            detailed_result = {
                'question': question,
                'answer_length': len(result['answer']),
                'confidence': result['confidence'],
                'sources': result['num_sources'],
                'response_time': response_time,
                'is_meaningful': is_meaningful
            }
            results['detailed_results'].append(detailed_result)
            
            print(f"[{i}/{len(test_questions)}] {question[:50]}...")
            print(f"      {len(result['answer'])} chars | Confidence: {result['confidence']:.2f} | {response_time:.2f}s | {'✅' if is_meaningful else '❌'}")
        
        # Calculate averages
        results['avg_confidence'] /= len(test_questions)
        results['avg_sources'] /= len(test_questions)
        results['avg_response_time'] = np.mean(results['response_times'])
        results['success_rate'] = results['successful_answers'] / len(test_questions)
        
        return results
    
    def print_summary(self, results: Dict[str, Any]):
        """Print vibe check summary."""
        print("\n" + "="*60)
        print("🎯 VIBE CHECK SUMMARY")
        print("="*60)
        print(f"Success Rate: {results['success_rate']:.1%} ({results['successful_answers']}/{results['total_questions']})")
        print(f"Average Confidence: {results['avg_confidence']:.2f}")
        print(f"Average Response Time: {results['avg_response_time']:.2f}s")
        
        # Overall assessment
        if results['success_rate'] >= 0.8 and results['avg_confidence'] >= 0.5:
            print("\n🎉 EXCELLENT! System is performing well.")
        elif results['success_rate'] >= 0.6 and results['avg_confidence'] >= 0.3:
            print("\n✅ GOOD! System working but could improve.")
        else:
            print("\n⚠️ NEEDS IMPROVEMENT! System needs optimization.")
        
        print("="*60)

# Run vibe check if research assistant is available
if 'research_assistant' in locals() and research_assistant is not None:
    vibe_checker = VibeChecker(research_assistant)
    
    # Test questions for vibe check
    test_questions = [
        "What is machine learning?",
        "How do neural networks work?",
        "What are the applications of AI?",
        "What is deep learning?",
        "How does natural language processing work?"
    ]
    
    # Run the vibe check
    vibe_results = vibe_checker.run_vibe_check(test_questions)
    
    # Print summary
    vibe_checker.print_summary(vibe_results)
else:
    print("❌ Research assistant not available. Please run Cell 5 first.")


🔍 Running Vibe Check...
Testing 5 questions

[1/5] What is machine learning?...
      1039 chars | Confidence: 0.46 | 5.76s | ✅
[2/5] How do neural networks work?...
      78 chars | Confidence: 0.73 | 1.08s | ✅
[3/5] What are the applications of AI?...
      1515 chars | Confidence: 0.59 | 8.55s | ✅
[4/5] What is deep learning?...
      1655 chars | Confidence: 0.59 | 7.97s | ✅
[5/5] How does natural language processing work?...
      91 chars | Confidence: 0.59 | 2.47s | ✅

🎯 VIBE CHECK SUMMARY
Success Rate: 100.0% (5/5)
Average Confidence: 0.59
Average Response Time: 5.17s

🎉 EXCELLENT! System is performing well.


### Achievements:
1. ✅ **Environment Setup** - All dependencies and imports configured
2. ✅ **ArXiv Integration** - Academic paper retrieval working
3. ✅ **Text Chunking** - Optimized for academic papers (1000 chars, 200 overlap)
4. ✅ **Qdrant Vector Database** - Production-ready vector store with Qdrant (with in-memory fallback)
5. ✅ **Semantic Search** - Cosine similarity search with OpenAI embeddings (text-embedding-3-small)
6. ✅ **Basic RAG** - Question-answering pipeline working
7. ✅ **Vibe Check** - Initial evaluation completed

