# üéì AI Study Assistant - Complete Tutorial

Welcome to the comprehensive tutorial for building and using the AI Study Assistant!

This notebook will guide you through:
- ‚úÖ Environment setup and dependencies
- üê≥ Docker configuration
- üìö Document processing and RAG pipeline
- ü§ñ Model loading and inference
- üöÄ FastAPI deployment
- ‚ö° Performance optimization
- üß™ Testing and benchmarking

**Estimated Time:** 2-3 hours

**Prerequisites:**
- Python 3.11+
- Docker (optional)
- Basic knowledge of NLP and REST APIs

Let's get started! üöÄ

## 1Ô∏è‚É£ Environment Setup and Dependencies

First, let's install all required libraries and set up our environment.

In [None]:
# Check if we're in the project root directory
import os
import sys

project_root = os.path.abspath('.')
print(f"Project Root: {project_root}")

# Add project to Python path
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    
print("‚úì Environment configured")

In [None]:
# Import core libraries
import torch
import transformers
from sentence_transformers import SentenceTransformer
import chromadb
import requests
import json
from pathlib import Path
from typing import List, Dict, Any
import time

print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print("‚úì Core libraries imported")

## 2Ô∏è‚É£ Configuration and Setup

Load configuration from the project's Config class.

In [None]:
# Load configuration
from src.config import Config

config = Config()

print("Configuration loaded:")
print(f"  - Embedding Model: {config.EMBEDDING_MODEL_NAME}")
print(f"  - LLM Model: {config.LLM_MODEL_NAME}")
print(f"  - Summarizer Model: {config.SUMMARIZER_MODEL}")
print(f"  - Vector DB Path: {config.VECTOR_DB_PATH}")
print(f"  - Batch Size: {config.BATCH_SIZE}")
print(f"  - Max Length: {config.MAX_LENGTH}")
print("‚úì Configuration ready")

## 3Ô∏è‚É£ Text Preprocessing

Let's test the text preprocessing pipeline.

In [None]:
# Test text preprocessing
from src.preprocessing.text_preprocessor import TextPreprocessor

preprocessor = TextPreprocessor()

sample_text = """
Machine learning is a subset of artificial intelligence. 
It enables computers to learn from data without being explicitly programmed.
Deep learning uses neural networks with multiple layers.
"""

result = preprocessor.preprocess_full(sample_text)

print("Preprocessing Results:")
print(f"  - Sentences: {len(result['sentences'])}")
print(f"  - Tokens: {len(result['tokens'])}")
print(f"  - Lemmatized: {' '.join(result['lemmatized'][:10])}...")
print(f"  - POS Tags: {result['pos_tags'][:5]}")
print("‚úì Text preprocessing works")

## 4Ô∏è‚É£ Keyword Extraction

Test the RAKE-based keyword extraction.

In [None]:
# Test keyword extraction
from src.training.keyword_extractor import KeywordExtractor

extractor = KeywordExtractor()

text = """
Python programming language is widely used in data science and machine learning.
Natural language processing and computer vision are key areas of artificial intelligence.
Deep learning models require significant computational resources and large datasets.
"""

keywords = extractor.extract_keywords(text, top_n=10)

print("Top Keywords:")
for keyword, score in keywords:
    print(f"  - {keyword}: {score:.3f}")
print("‚úì Keyword extraction works")

## 5Ô∏è‚É£ TextRank Summarization

Test extractive summarization using TextRank.

In [None]:
# Test TextRank summarization
from src.training.textrank_summarizer import TextRankSummarizer

summarizer = TextRankSummarizer(num_sentences=2)

long_text = """
Machine learning is a method of data analysis that automates analytical model building.
It is a branch of artificial intelligence based on the idea that systems can learn from data.
Machine learning algorithms build a model based on sample data, known as training data.
The algorithms make predictions or decisions without being explicitly programmed to do so.
Machine learning algorithms are used in a wide variety of applications.
Email filtering and computer vision are examples where it is difficult to develop conventional algorithms.
Machine learning is closely related to computational statistics and mathematical optimization.
"""

summary = summarizer.summarize(long_text)

print("Original Length:", len(long_text))
print("Summary Length:", len(summary))
print(f"Compression: {len(summary)/len(long_text)*100:.1f}%")
print("\nSummary:")
print(summary)
print("‚úì TextRank summarization works")

## 6Ô∏è‚É£ Vector Database with ChromaDB

Initialize and test ChromaDB for semantic search.

In [None]:
# Initialize ChromaDB
from src.inference.chromadb_manager import ChromaDBManager

# Use a test database
db_manager = ChromaDBManager(persist_directory="./test_chroma_db")

# Add sample documents
documents = [
    "Python is a high-level programming language.",
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are inspired by biological brains.",
    "Deep learning uses multiple layers in neural networks.",
    "Natural language processing deals with text and speech."
]

metadata = [
    {"source": "python_doc", "type": "definition"},
    {"source": "ml_doc", "type": "definition"},
    {"source": "nn_doc", "type": "explanation"},
    {"source": "dl_doc", "type": "explanation"},
    {"source": "nlp_doc", "type": "definition"}
]

ids = [f"doc_{i}" for i in range(len(documents))]

db_manager.add_documents(documents, metadata, ids)

# Check collection stats
stats = db_manager.get_collection_stats()
print(f"Collection Stats: {stats}")
print("‚úì ChromaDB initialized with documents")

In [None]:
# Test semantic search
query = "What is machine learning?"
results = db_manager.query(query, n_results=3)

print(f"\nQuery: {query}")
print(f"\nTop {len(results)} Results:")
for i, result in enumerate(results, 1):
    print(f"\n{i}. Score: {result['score']:.3f}")
    print(f"   Text: {result['text']}")
    print(f"   Metadata: {result['metadata']}")
print("‚úì Semantic search works")

## 7Ô∏è‚É£ RAG Pipeline

Build and test the complete RAG (Retrieval-Augmented Generation) pipeline.

In [None]:
# Initialize RAG components
from src.inference.rag_retriever import RAGRetriever
from src.inference.llm_reader import LLMReader

# Create retriever with our vector store
retriever = RAGRetriever(vector_store_manager=db_manager)

# Retrieve relevant documents
query = "Explain deep learning"
retrieved_docs = retriever.retrieve_documents(query, top_k=3)

print(f"Query: {query}")
print(f"\nRetrieved {len(retrieved_docs)} documents:")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"\n{i}. Score: {doc['score']:.3f}")
    print(f"   Text: {doc['text'][:100]}...")
print("‚úì RAG retrieval works")

In [None]:
# Initialize LLM Reader (this will load GPT-2 model)
print("Loading LLM model (this may take a moment)...")
reader = LLMReader()

# Combine retrieved context
context = "\n".join([doc['text'] for doc in retrieved_docs])

# Generate answer
answer = reader.generate_answer(query, context, max_length=100)

print(f"\nQuery: {query}")
print(f"\nContext ({len(context)} chars):")
print(context)
print(f"\nGenerated Answer:")
print(answer)
print("\n‚úì RAG generation works")

## 8Ô∏è‚É£ API Testing

Now let's test the FastAPI endpoints. Make sure the API is running first!

```bash
# In a terminal, run:
python -m uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
```

Or use Docker:
```bash
docker-compose up -d
```

In [None]:
# Test API health check
BASE_URL = "http://localhost:8000"

try:
    response = requests.get(f"{BASE_URL}/health", timeout=5)
    if response.status_code == 200:
        print("‚úì API is running!")
        print(f"Response: {response.json()}")
    else:
        print(f"‚ö† API returned status {response.status_code}")
except requests.exceptions.RequestException as e:
    print("‚ùå API is not running. Please start it first:")
    print("   python -m uvicorn api.main:app --reload")
    print(f"   Error: {e}")

In [None]:
# Test keyword extraction endpoint
text = "Machine learning and artificial intelligence are transforming industries worldwide."
params = {"text": text, "top_n": 5}

try:
    response = requests.post(f"{BASE_URL}/extract-keywords", params=params)
    if response.status_code == 200:
        data = response.json()
        print("‚úì Keyword extraction API works!")
        print(f"\nKeywords:")
        for keyword, score in data['keywords']:
            print(f"  - {keyword}: {score:.3f}")
    else:
        print(f"Error: {response.status_code}")
except requests.exceptions.RequestException as e:
    print(f"‚ùå Error: {e}")

In [None]:
# Test summarization endpoint
text = """
Machine learning is a method of data analysis that automates analytical model building.
It is a branch of artificial intelligence based on the idea that systems can learn from data.
The algorithms make predictions or decisions without being explicitly programmed to do so.
Machine learning algorithms are used in a wide variety of applications like email filtering.
"""

params = {
    "text": text,
    "summary_type": "extractive"
}

try:
    response = requests.post(f"{BASE_URL}/summarize", params=params)
    if response.status_code == 200:
        data = response.json()
        print("‚úì Summarization API works!")
        print(f"\nOriginal: {len(text)} chars")
        print(f"Summary: {len(data['summary'])} chars")
        print(f"Type: {data['summary_type']}")
        print(f"\nSummary:\n{data['summary']}")
    else:
        print(f"Error: {response.status_code}")
except requests.exceptions.RequestException as e:
    print(f"‚ùå Error: {e}")

In [None]:
# Test chat endpoint (Q&A)
params = {
    "query": "What is machine learning?",
    "top_k": 3
}

try:
    response = requests.post(f"{BASE_URL}/chat", params=params)
    if response.status_code == 200:
        data = response.json()
        print("‚úì Chat API works!")
        print(f"\nQuery: {params['query']}")
        print(f"\nAnswer:\n{data['answer']}")
        print(f"\nSources ({len(data['sources'])}):")
        for i, source in enumerate(data['sources'][:3], 1):
            print(f"\n{i}. {source['text'][:100]}...")
            print(f"   Metadata: {source['metadata']}")
    else:
        print(f"Error: {response.status_code}")
except requests.exceptions.RequestException as e:
    print(f"‚ùå Error: {e}")

In [None]:
# Test documents listing endpoint
try:
    response = requests.get(f"{BASE_URL}/documents")
    if response.status_code == 200:
        data = response.json()
        print("‚úì Documents API works!")
        print(f"\nTotal Documents: {data['total_documents']}")
        print(f"\nFirst {min(5, len(data['documents']))} documents:")
        for i, doc in enumerate(data['documents'][:5], 1):
            print(f"\n{i}. ID: {doc['id']}")
            print(f"   Metadata: {doc.get('metadata', {})}")
    else:
        print(f"Error: {response.status_code}")
except requests.exceptions.RequestException as e:
    print(f"‚ùå Error: {e}")

## 9Ô∏è‚É£ Performance Profiling

Let's benchmark the performance of our system.

In [None]:
# Check system resources
from src.performance_profiler import system_resources

print("System Resources:")
resources = system_resources()
print("\n‚úì System resources checked")

In [None]:
# Benchmark keyword extraction
from src.performance_profiler import benchmark_keyword_extraction

print("Benchmarking Keyword Extraction...")
results = benchmark_keyword_extraction()
print(f"\n‚úì Benchmark complete: {results['time']:.3f}s")

In [None]:
# Benchmark summarization
from src.performance_profiler import benchmark_summarization

print("Benchmarking Summarization (this may take a moment)...")
results = benchmark_summarization()

print("\n‚úì Benchmark Results:")
for model, metrics in results.items():
    print(f"\n{model.upper()}:")
    print(f"  Time: {metrics['time']:.3f}s")
    print(f"  Compression: {metrics['compression_ratio']:.1f}%")

## üîü Evaluation Metrics

Test the evaluation metrics for summarization and other tasks.

In [None]:
# Test evaluation metrics
from src.evaluation_metrics import EvaluationMetrics

metrics = EvaluationMetrics()

# Test ROUGE scores
reference = "Machine learning is a subset of artificial intelligence that enables computers to learn."
hypothesis = "Machine learning allows computers to learn and is part of AI."

rouge_scores = metrics.compute_rouge(reference, hypothesis)

print("ROUGE Scores:")
for metric, score in rouge_scores.items():
    print(f"  {metric}: {score:.3f}")

# Test BLEU score
bleu_score = metrics.compute_bleu(reference, hypothesis)
print(f"\nBLEU Score: {bleu_score:.3f}")

print("\n‚úì Evaluation metrics work")

## üéØ Summary and Next Steps

Congratulations! You've completed the AI Study Assistant tutorial! üéâ

### What you've learned:
- ‚úÖ Text preprocessing and keyword extraction
- ‚úÖ Extractive summarization with TextRank
- ‚úÖ Vector database with ChromaDB
- ‚úÖ RAG pipeline (retrieval + generation)
- ‚úÖ FastAPI endpoints testing
- ‚úÖ Performance profiling
- ‚úÖ Evaluation metrics

### Next Steps:

1. **Try with your own data:**
   - Upload PDFs via `/upload` endpoint
   - Ask questions about your documents

2. **Fine-tune models:**
   - Train T5 with LoRA for better summarization
   - Fine-tune BERT for domain-specific NER

3. **Deploy to production:**
   - Use Docker Compose: `docker-compose up -d`
   - Set up monitoring with MLflow
   - Configure autoscaling

4. **Optimize performance:**
   - Enable GPU acceleration
   - Implement caching
   - Use quantized models

5. **Build a frontend:**
   - Create a web UI (React, Vue, etc.)
   - Add authentication
   - Implement file management

### Resources:
- üìö README.md - Complete documentation
- üöÄ DEPLOYMENT_CHECKLIST.md - Deployment guide
- üìñ API Docs: http://localhost:8000/docs
- üß™ Run tests: `pytest tests/ -v`
- üìä Profile: `python src/performance_profiler.py`

Happy coding! üöÄ