# üéØ Stage 1: PDF Ingestion & Structure Detection
## Research Paper Simplifier - Backend Development

**Goal:** Upload PDF ‚Üí Extract text ‚Üí Detect sections ‚Üí Understand structure with AI

**What we'll build:**
1. PDF text extraction
2. Rule-based section detection
3. AI-powered structure analysis with CrewAI
4. Metadata extraction

---

## üì¶ Step 1: Imports and Setup

In [2]:
# Imports
import os
import json
import re
from pathlib import Path
from typing import Dict, List, Optional
from datetime import datetime

# PDF processing
# import fitz  # PyMuPDF
import pdfplumber

# AI and LLM
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Verify API key
if not os.getenv("OPENAI_API_KEY"):
    print("‚ùå OPENAI_API_KEY not found!")
    print("Please create a .env file with: OPENAI_API_KEY=your_key_here")
else:
    print("‚úÖ Environment setup complete!")
    print(f"‚úÖ OpenAI API Key loaded: {os.getenv('OPENAI_API_KEY')[:10]}...")

‚ùå OPENAI_API_KEY not found!
Please create a .env file with: OPENAI_API_KEY=your_key_here


## üìÑ Step 2: PDF Text Extraction

In [None]:
class PDFExtractor:
    """Extract text and metadata from PDF files"""
    
    def __init__(self, pdf_path: str):
        self.pdf_path = pdf_path
        self.metadata = {}
        self.full_text = ""
        self.pages = []
        
    def extract_with_pymupdf(self) -> Dict:
        """Extract text using PyMuPDF (fast and reliable)"""
        print(f"üìÑ Extracting text from: {self.pdf_path}")
        
        doc = fitz.open(self.pdf_path)
        
        # Extract metadata
        self.metadata = {
            "title": doc.metadata.get("title", "Unknown"),
            "author": doc.metadata.get("author", "Unknown"),
            "pages": doc.page_count,
            "created": doc.metadata.get("creationDate", "Unknown"),
        }
        
        # Extract text from each page
        for page_num, page in enumerate(doc, start=1):
            page_text = page.get_text()
            self.pages.append({
                "page_number": page_num,
                "text": page_text,
                "char_count": len(page_text)
            })
            self.full_text += page_text + "\n"
        
        doc.close()
        
        print(f"‚úÖ Extracted {len(self.pages)} pages")
        print(f"‚úÖ Total characters: {len(self.full_text):,}")
        
        return {
            "metadata": self.metadata,
            "full_text": self.full_text,
            "pages": self.pages
        }

print("‚úÖ PDFExtractor class defined")

## üîç Step 3: Section Detection (Rule-based)

In [None]:
class SectionDetector:
    """Detect common research paper sections using pattern matching"""
    
    # Common section headers in research papers
    SECTION_PATTERNS = {
        "abstract": [
            r"^abstract\s*$",
            r"^summary\s*$",
        ],
        "introduction": [
            r"^1\.?\s*introduction",
            r"^introduction\s*$",
            r"^i\.?\s*introduction",
        ],
        "related_work": [
            r"^2\.?\s*related work",
            r"^related work\s*$",
            r"^literature review",
            r"^background",
        ],
        "methodology": [
            r"^3\.?\s*method",
            r"^methodology\s*$",
            r"^approach\s*$",
            r"^model\s*$",
        ],
        "results": [
            r"^4\.?\s*results",
            r"^results\s*$",
            r"^experiments\s*$",
            r"^evaluation\s*$",
        ],
        "discussion": [
            r"^5\.?\s*discussion",
            r"^discussion\s*$",
            r"^analysis\s*$",
        ],
        "conclusion": [
            r"^6\.?\s*conclusion",
            r"^conclusion\s*$",
            r"^concluding remarks",
        ],
        "references": [
            r"^references\s*$",
            r"^bibliography\s*$",
        ],
    }
    
    def __init__(self, text: str):
        self.text = text
        self.sections = {}
        
    def detect_sections(self) -> Dict[str, str]:
        """Detect sections using pattern matching"""
        lines = self.text.split('\n')
        current_section = "unknown"
        section_content = {key: [] for key in self.SECTION_PATTERNS.keys()}
        section_content["unknown"] = []
        
        for line in lines:
            line_lower = line.strip().lower()
            
            # Check if line matches any section header
            matched = False
            for section_name, patterns in self.SECTION_PATTERNS.items():
                for pattern in patterns:
                    if re.match(pattern, line_lower, re.IGNORECASE):
                        current_section = section_name
                        matched = True
                        break
                if matched:
                    break
            
            # Add line to current section
            if not matched and line.strip():
                section_content[current_section].append(line)
        
        # Convert lists to strings
        self.sections = {
            section: '\n'.join(content).strip()
            for section, content in section_content.items()
            if content and '\n'.join(content).strip()
        }
        
        print(f"‚úÖ Detected sections: {list(self.sections.keys())}")
        return self.sections

print("‚úÖ SectionDetector class defined")

## ü§ñ Step 4: CrewAI Agents Setup

In [None]:
# Initialize LLM
llm = ChatOpenAI(
    model="gpt-4o-mini",  # Faster and cheaper for testing
    temperature=0.1,  # Low temperature for consistent results
)

# Agent 1: Paper Structure Analyzer
paper_analyzer = Agent(
    role="Research Paper Structure Analyst",
    goal="Analyze the structure of academic papers and identify key sections accurately",
    backstory="""You are an expert at reading academic papers across all disciplines.
    You have a PhD-level understanding of paper structure and can identify sections
    even when they're not clearly labeled. You understand that papers might use
    different terminology (e.g., 'Methods' vs 'Methodology' vs 'Approach').""",
    llm=llm,
    verbose=True,
    allow_delegation=False,
)

# Agent 2: Content Extractor
content_extractor = Agent(
    role="Academic Content Extractor",
    goal="Extract and organize key information from research papers",
    backstory="""You are skilled at extracting metadata, key points, and essential
    information from academic papers. You can identify the research problem, 
    contributions, methodology, and results even when they're not explicitly stated.""",
    llm=llm,
    verbose=True,
    allow_delegation=False,
)

print("‚úÖ CrewAI Agents created:")
print(f"  - {paper_analyzer.role}")
print(f"  - {content_extractor.role}")

## üìã Step 5: Define Tasks

In [None]:
def create_structure_analysis_task(paper_text: str) -> Task:
    """Task for analyzing paper structure"""
    return Task(
        description=f"""Analyze this research paper and identify its structure.
        
        Paper text (first 3000 characters):
        ```
        {paper_text[:3000]}
        ```
        
        Your task:
        1. Identify all major sections (Abstract, Introduction, Methods, Results, etc.)
        2. Note any unconventional section names
        3. Estimate where each section begins and ends
        4. Identify if this is a standard research paper or a different format
        
        Return a JSON object with this structure:
        {{
            "paper_type": "research_article|review|conference_paper|preprint",
            "sections_found": ["section1", "section2", ...],
            "structure_quality": "high|medium|low",
            "notes": "any important observations"
        }}
        """,
        agent=paper_analyzer,
        expected_output="A JSON object describing the paper structure"
    )

def create_metadata_extraction_task(paper_text: str) -> Task:
    """Task for extracting paper metadata"""
    return Task(
        description=f"""Extract key metadata from this research paper.
        
        Paper text (first 2000 characters):
        ```
        {paper_text[:2000]}
        ```
        
        Your task:
        1. Extract the paper title
        2. Identify authors (if mentioned)
        3. Determine the research domain/field
        4. Identify the main research problem
        5. Note any key contributions mentioned in abstract
        
        Return a JSON object with this structure:
        {{
            "title": "paper title",
            "authors": ["author1", "author2"],
            "field": "research field",
            "problem": "main research problem",
            "contributions": ["contribution1", "contribution2"]
        }}
        """,
        agent=content_extractor,
        expected_output="A JSON object with paper metadata"
    )

print("‚úÖ Task creation functions defined")

## üîÑ Step 6: Complete Processing Pipeline

In [None]:
class PaperProcessor:
    """Complete pipeline for processing research papers"""
    
    def __init__(self, pdf_path: str):
        self.pdf_path = pdf_path
        self.extractor = PDFExtractor(pdf_path)
        self.extracted_data = None
        self.detected_sections = None
        self.ai_analysis = None
        
    def extract_text(self):
        """Step 1: Extract text from PDF"""
        print("\n" + "="*50)
        print("STEP 1: Text Extraction")
        print("="*50)
        self.extracted_data = self.extractor.extract_with_pymupdf()
        return self.extracted_data
    
    def detect_sections(self):
        """Step 2: Detect sections using rules"""
        print("\n" + "="*50)
        print("STEP 2: Section Detection (Rule-based)")
        print("="*50)
        detector = SectionDetector(self.extracted_data["full_text"])
        self.detected_sections = detector.detect_sections()
        
        print("\nSection Summary:")
        for section, content in self.detected_sections.items():
            word_count = len(content.split())
            print(f"  üìÑ {section.upper()}: {word_count} words")
        
        return self.detected_sections
    
    def analyze_with_ai(self):
        """Step 3: Analyze with CrewAI agents"""
        print("\n" + "="*50)
        print("STEP 3: AI Analysis with CrewAI")
        print("="*50)
        
        # Create tasks
        structure_task = create_structure_analysis_task(
            self.extracted_data["full_text"]
        )
        metadata_task = create_metadata_extraction_task(
            self.extracted_data["full_text"]
        )
        
        # Create crew
        crew = Crew(
            agents=[paper_analyzer, content_extractor],
            tasks=[structure_task, metadata_task],
            process=Process.sequential,
            verbose=True
        )
        
        # Execute
        print("\nü§ñ Running AI agents...")
        result = crew.kickoff()
        
        self.ai_analysis = result
        return result
    
    def get_final_output(self) -> Dict:
        """Get complete processed output"""
        return {
            "pdf_path": self.pdf_path,
            "processed_at": datetime.now().isoformat(),
            "metadata": self.extracted_data["metadata"],
            "page_count": len(self.extracted_data["pages"]),
            "total_chars": len(self.extracted_data["full_text"]),
            "detected_sections": {
                section: {
                    "word_count": len(content.split()),
                    "preview": content[:200] + "..."
                }
                for section, content in self.detected_sections.items()
            },
            "ai_analysis": str(self.ai_analysis),
            "sections_full": self.detected_sections
        }
    
    def process(self):
        """Run complete pipeline"""
        self.extract_text()
        self.detect_sections()
        self.analyze_with_ai()
        return self.get_final_output()

print("‚úÖ PaperProcessor class defined")

## üß™ Step 7: Test with Your PDF

**IMPORTANT:** Update the `PDF_PATH` variable below with the path to your research paper PDF

In [None]:
# UPDATE THIS PATH TO YOUR PDF!
PDF_PATH = "sample_paper.pdf"  # Change this to your actual PDF path

# Check if file exists
if not os.path.exists(PDF_PATH):
    print(f"‚ùå PDF not found: {PDF_PATH}")
    print("""\nPlease:
    1. Download a research paper PDF
    2. Place it in the same folder as this notebook
    3. Update PDF_PATH variable above
    """)
else:
    print(f"‚úÖ Found PDF: {PDF_PATH}")
    
    # Process the paper
    processor = PaperProcessor(PDF_PATH)
    
    # Run pipeline
    result = processor.process()
    
    print("\n" + "="*50)
    print("FINAL RESULT")
    print("="*50)
    print(f"\nüìÑ Paper: {result['metadata']['title']}")
    print(f"üìä Pages: {result['page_count']}")
    print(f"üìù Total characters: {result['total_chars']:,}")
    print(f"\nüîç Detected sections: {list(result['detected_sections'].keys())}")

## üíæ Step 8: Save Results

In [None]:
def save_results(result: Dict, output_path: str):
    """Save processed results to JSON file"""
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(result, f, indent=2, ensure_ascii=False)
    print(f"‚úÖ Results saved to: {output_path}")

# Save if we have results
if 'result' in locals():
    output_file = "stage1_output.json"
    save_results(result, output_file)
    
    # Show sample output
    print("\nüìã Sample Output:")
    print(json.dumps(result["detected_sections"], indent=2)[:500])
    print("\n[Output truncated... see full results in stage1_output.json]")

## üìä Step 9: Visualize Section Distribution (Optional)

In [None]:
import matplotlib.pyplot as plt

def visualize_sections(sections: Dict):
    """Create a simple bar chart of section sizes"""
    section_names = list(sections.keys())
    word_counts = [len(sections[s].split()) for s in section_names]
    
    plt.figure(figsize=(10, 6))
    plt.barh(section_names, word_counts, color='skyblue')
    plt.xlabel('Word Count')
    plt.ylabel('Section')
    plt.title('Research Paper Section Distribution')
    plt.tight_layout()
    plt.show()

# Visualize if we have results
if 'result' in locals() and result.get('sections_full'):
    visualize_sections(result['sections_full'])

## üéØ Summary

**What we accomplished:**
- ‚úÖ Extracted text from PDF using PyMuPDF
- ‚úÖ Detected sections using rule-based patterns
- ‚úÖ Analyzed structure with CrewAI agents
- ‚úÖ Extracted metadata intelligently
- ‚úÖ Saved results to JSON

**Next steps (Stage 2):**
1. Implement smart text chunking
2. Generate embeddings
3. Store in vector database
4. Enable semantic search

---

**Ready to move to Stage 2?** Let me know!