# Real Package Analysis with PyPevol

This notebook demonstrates how to analyze real PyPI packages using PyPevol. We'll walk through the complete process from fetching packages to generating detailed evolution reports.

## What You'll Learn:
- 📦 How to analyze real packages from PyPI
- 🔍 Understanding the complete analysis pipeline
- 📊 Interpreting real-world API evolution patterns
- 📈 Generating comprehensive reports
- ⚡ Performance tips for large packages

## Prerequisites:
- Basic understanding of PyPevol APIs (see `01_basic_api_usage.ipynb`)
- Internet connection for downloading packages from PyPI

## 1. Setup and Imports

First, let's import all the necessary components and set up our environment.

In [1]:
# Core PyPevol imports
from pypevol import PackageAnalyzer
from pypevol.models import APIType, AnalysisResult
from pypevol.fetcher import PyPIFetcher

# Standard library imports
import json
import time
from datetime import datetime
from pathlib import Path

# Optional: For progress tracking
try:
    from tqdm import tqdm
    HAS_TQDM = True
except ImportError:
    HAS_TQDM = False
    print("💡 Tip: Install tqdm for progress bars: pip install tqdm")

print("🚀 PyPevol Real Package Analysis")
print("=====================================")
print(f"Analysis started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

🚀 PyPevol Real Package Analysis
Analysis started at: 2025-08-05 02:42:20




## 2. Choose Your Target Package

Let's start with a well-known package that has clear API evolution. We'll use `click` (the CLI library) as our primary example, but you can easily substitute any PyPI package.

In [2]:
# Primary target package
TARGET_PACKAGE = "requests"

# Alternative packages you can try:
ALTERNATIVE_PACKAGES = [
    "requests",      # HTTP library - great for studying API stability
    "flask",         # Web framework - interesting microframework evolution  
    "numpy",         # Scientific computing - massive API surface
    "pandas",        # Data analysis - rapid API evolution
    "django",        # Web framework - mature, long evolution history
    "scikit-learn",  # Machine learning - API design patterns
    "tensorflow",    # ML framework - aggressive API changes
    "pytest",        # Testing framework - plugin architecture evolution
]

print(f"🎯 Target Package: {TARGET_PACKAGE}")
print(f"📋 Alternative packages to try: {', '.join(ALTERNATIVE_PACKAGES[:4])}...")
print(f"\n💡 To analyze a different package, change TARGET_PACKAGE above and re-run!")

🎯 Target Package: requests
📋 Alternative packages to try: requests, flask, numpy, pandas...

💡 To analyze a different package, change TARGET_PACKAGE above and re-run!


## 3. Initialize the Analyzer

Create our analyzer instance and configure it for the target package.

In [3]:
# Initialize the analyzer
analyzer = PackageAnalyzer()

# Create output directory for our analysis
output_dir = Path(f"analysis_output_{TARGET_PACKAGE}")
output_dir.mkdir(exist_ok=True)

print(f"📁 Analysis output directory: {output_dir.absolute()}")
print(f"🔧 Analyzer initialized and ready!")

# Show analyzer configuration
print(f"\n⚙️  Analyzer Configuration:")
print(f"  Max versions to analyze: {getattr(analyzer, 'max_versions', 'unlimited')}")
print(f"  Include private APIs: {getattr(analyzer, 'include_private', True)}")
print(f"  Cache enabled: {getattr(analyzer, 'use_cache', True)}")

📁 Analysis output directory: /mnt/data/py-package-evol/demo/analysis_output_requests
🔧 Analyzer initialized and ready!

⚙️  Analyzer Configuration:
  Max versions to analyze: unlimited
  Include private APIs: True
  Cache enabled: True


## 4. Fetch Package Information

Before analysis, let's explore what versions are available and select which ones to analyze.

In [4]:
# Initialize fetcher to explore available versions
fetcher = PyPIFetcher()

print(f"🔍 Fetching information for '{TARGET_PACKAGE}'...")
start_time = time.time()

try:
    # Get package metadata - use fetcher's built-in version handling
    all_version_info = fetcher.get_all_versions(TARGET_PACKAGE)
    
    print(f"✅ Package found! ({time.time() - start_time:.2f}s)")
    
    if all_version_info:
        # Get package metadata for summary info
        package_info = fetcher.get_package_metadata(TARGET_PACKAGE)
        
        print(f"\n📦 Package Information:")
        print(f"  Name: {package_info.get('info', {}).get('name', TARGET_PACKAGE)}")
        print(f"  Author: {package_info.get('info', {}).get('author', 'Unknown')}")
        print(f"  Summary: {package_info.get('info', {}).get('summary', 'No description available')[:100]}...")
        print(f"  Latest version: {package_info.get('info', {}).get('version', 'Unknown')}")
        
        # Filter to stable versions (exclude pre-releases, dev versions, etc.)
        stable_version_info = [
            v for v in all_version_info 
            if not any(tag in v.version.lower() for tag in ['rc', 'dev', 'alpha', 'beta', 'pre'])
            and '.' in v.version  # Must have at least one dot (semantic versioning)
            and v.release_date is not None  # Must have a valid release date
        ]
        
        print(f"\n📊 Version Statistics:")
        print(f"  Total releases: {len(all_version_info)}")
        print(f"  Stable releases with dates: {len(stable_version_info)}")
        
        if stable_version_info:
            # Versions are already sorted by release date from fetcher.get_all_versions()
            sorted_versions = [v.version for v in stable_version_info]
            
            print(f"  Version range: {sorted_versions[0]} → {sorted_versions[-1]} (chronological)")
            
            # Show recent versions (last 10 chronologically)
            recent_versions = sorted_versions[-10:] if len(sorted_versions) > 10 else sorted_versions
            print(f"  Recent versions: {', '.join(recent_versions)}")
            
            # Show the date range for verification
            if len(stable_version_info) > 1:
                first_date = stable_version_info[0].release_date.strftime('%Y-%m-%d')
                last_date = stable_version_info[-1].release_date.strftime('%Y-%m-%d')
                print(f"  Date range: {first_date} → {last_date}")
        else:
            sorted_versions = []
            print("  ⚠️  No stable versions with valid release dates found")
    else:
        stable_version_info = []
        sorted_versions = []
        print("  ❌ No versions found")
    
except Exception as e:
    print(f"❌ Failed to fetch package info: {e}")
    print(f"💡 Make sure the package name is correct and you have internet access")
    stable_version_info = []
    sorted_versions = []

🔍 Fetching information for 'requests'...


✅ Package found! (1.93s)

📦 Package Information:
  Name: requests
  Author: Kenneth Reitz
  Summary: Python HTTP for Humans....
  Latest version: 2.32.4

📊 Version Statistics:
  Total releases: 151
  Stable releases with dates: 151
  Version range: 0.2.0 → 2.32.4 (chronological)
  Recent versions: 2.27.1, 2.28.0, 2.28.1, 2.28.2, 2.29.0, 2.30.0, 2.31.0, 2.32.2, 2.32.3, 2.32.4
  Date range: 2011-02-14 → 2025-06-09


## 5. Select Versions for Analysis

For demonstration purposes, we'll select a representative sample of versions. In real analysis, you might want to analyze all versions or focus on specific ranges.

In [5]:
# Strategy for selecting versions to analyze
def select_analysis_versions(versions, strategy="sample", max_versions=5):
    """Select which versions to analyze based on different strategies.
    
    Note: Versions should already be sorted chronologically by upload date.
    """
    if not versions:
        return []
    
    # Versions are already sorted chronologically, so we don't need to sort again
    chronological_versions = versions
    
    if strategy == "all":
        return chronological_versions[:max_versions]  # Limit to prevent overwhelming analysis
    
    elif strategy == "sample":
        # Intelligent sampling: first, last, and evenly distributed middle versions
        if len(chronological_versions) <= max_versions:
            return chronological_versions
        
        selected = [chronological_versions[0]]  # Always include first (chronologically)
        
        if max_versions > 2:
            # Add evenly spaced middle versions
            step = len(chronological_versions) // (max_versions - 1)
            for i in range(step, len(chronological_versions) - 1, step):
                if len(selected) < max_versions - 1:
                    selected.append(chronological_versions[i])
        
        selected.append(chronological_versions[-1])  # Always include last (most recent)
        return selected
    
    elif strategy == "recent":
        # Focus on recent versions (chronologically last)
        return chronological_versions[-max_versions:]
    
    elif strategy == "major":
        # Try to select major version releases (rough heuristic)
        # Keep chronological order but filter by major version
        major_versions = {}
        selected_major = []
        for v in chronological_versions:
            major = v.split('.')[0]
            if major not in major_versions:
                major_versions[major] = True
                selected_major.append(v)
                if len(selected_major) >= max_versions:
                    break
        return selected_major
    
    else:
        return chronological_versions[:max_versions]

# Configure analysis parameters
ANALYSIS_STRATEGY = "recent"  # Options: "sample", "recent", "major", "all"
MAX_VERSIONS = 5  # Limit for demonstration purposes

if sorted_versions:
    selected_versions = select_analysis_versions(
        sorted_versions,  # Use chronologically sorted versions
        strategy=ANALYSIS_STRATEGY, 
        max_versions=MAX_VERSIONS
    )
    
    print(f"🎯 Analysis Strategy: {ANALYSIS_STRATEGY}")
    print(f"📋 Selected versions for analysis (chronological): {selected_versions}")
    print(f"⏱️  Estimated analysis time: {len(selected_versions) * 2}-{len(selected_versions) * 5} minutes")
    
    # Show what each version represents
    if len(selected_versions) > 1:
        print(f"\n📈 Analysis Span:")
        print(f"  From: {selected_versions[0]} (earliest selected)")
        print(f"  To: {selected_versions[-1]} (most recent selected)")
        print(f"  Coverage: {len(selected_versions)}/{len(sorted_versions)} stable versions")
        
        # Show the chronological progression with dates
        print(f"\n🕒 Chronological progression:")
        selected_version_info = [v for v in stable_version_info if v.version in selected_versions]
        for i, version_info in enumerate(selected_version_info, 1):
            date_str = version_info.release_date.strftime('%Y-%m-%d') if version_info.release_date else 'Unknown'
            print(f"  {i}. {version_info.version} ({date_str})")
        
else:
    selected_versions = []
    print("❌ No versions available for analysis")

🎯 Analysis Strategy: recent
📋 Selected versions for analysis (chronological): ['2.30.0', '2.31.0', '2.32.2', '2.32.3', '2.32.4']
⏱️  Estimated analysis time: 10-25 minutes

📈 Analysis Span:
  From: 2.30.0 (earliest selected)
  To: 2.32.4 (most recent selected)
  Coverage: 5/151 stable versions

🕒 Chronological progression:
  1. 2.30.0 (2023-05-03)
  2. 2.31.0 (2023-05-22)
  3. 2.32.2 (2024-05-21)
  4. 2.32.3 (2024-05-29)
  5. 2.32.4 (2025-06-09)


## 6. Run the Analysis

Now for the main event! Let's analyze the selected versions and track API evolution.

**⚠️ Important Note:** The PyMevol analyzer doesn't support analyzing arbitrary specific versions directly. Instead, it analyzes versions within a range (`from_version` to `to_version`) and can limit the total number with `max_versions`. 

To avoid analyzing very old versions with incompatible syntax (like pre-2010 packages that use old Python syntax), we'll specify the version range from our earliest to latest selected version.

In [6]:
if not selected_versions:
    print("⚠️  No versions to analyze. Please check the package name and try again.")
else:
    print(f"🚀 Starting analysis of {TARGET_PACKAGE}...")
    print(f"📊 Analyzing {len(selected_versions)} versions: {', '.join(selected_versions)}")
    print(f"⏰ Started at: {datetime.now().strftime('%H:%M:%S')}")
    
    # Run the analysis
    analysis_start = time.time()
    
    # The main analysis call - specify version range to avoid analyzing old versions
    # This will analyze versions from the earliest selected to the latest selected
    result = analyzer.analyze_package(
        package_name=TARGET_PACKAGE,
        from_version=selected_versions[0],  # Start from the earliest selected version
        to_version=selected_versions[-1],   # End at the latest selected version
        max_versions=len(selected_versions) # Limit to our selected count
    )
    
    analysis_duration = time.time() - analysis_start
    
    print(f"\n✅ Analysis completed successfully!")
    print(f"⏱️  Total time: {analysis_duration:.1f} seconds ({analysis_duration/60:.1f} minutes)")
    print(f"📊 Analysis rate: {analysis_duration/len(result.versions):.1f} seconds per version")
    
    # Quick summary of results
    if result:
        summary = result.generate_summary()
        print(f"\n📈 Quick Results Summary:")
        print(f"  Package: {summary['package_name']}")
        print(f"  Versions analyzed: {summary['total_versions']}")
        print(f"  Total API changes: {summary['total_changes']}")
        print(f"  Unique APIs tracked: {summary['unique_apis']}")
        print(f"  Most common change: {max(summary['change_types'].items(), key=lambda x: x[1])[0] if summary['change_types'] else 'None'}")
        
        # Show which versions were actually analyzed
        actual_versions = [v.version for v in result.versions]
        print(f"\n📋 Actual versions analyzed: {', '.join(sorted(actual_versions))}")
        

🚀 Starting analysis of requests...
📊 Analyzing 5 versions: 2.30.0, 2.31.0, 2.32.2, 2.32.3, 2.32.4
⏰ Started at: 02:42:22



✅ Analysis completed successfully!
⏱️  Total time: 1.9 seconds (0.0 minutes)
📊 Analysis rate: 0.4 seconds per version

📈 Quick Results Summary:
  Package: requests
  Versions analyzed: 5
  Total API changes: 231
  Unique APIs tracked: 231
  Most common change: added

📋 Actual versions analyzed: 2.30.0, 2.31.0, 2.32.2, 2.32.3, 2.32.4


## 7. Explore the Results

Let's dive deep into the analysis results and understand what we discovered about the package's API evolution.

In [7]:
if result is None:
    print("⚠️  No results to explore. Please ensure the analysis completed successfully.")
else:
    print(f"🔍 Detailed Analysis Results for {result.package_name}")
    print("=" * 50)
    
    # 1. Version-by-version breakdown
    print(f"\n📊 Version Breakdown:")
    for i, version_info in enumerate(sorted(result.versions, key=lambda v: v.version)):
        version_apis = result.api_elements.get(version_info.version, [])
        version_changes = [c for c in result.changes if getattr(c, 'to_version', None) == version_info.version]
        
        print(f"\n  📦 Version {version_info.version}")
        print(f"     Released: {version_info.release_date}")
        print(f"     APIs: {len(version_apis)}")
        print(f"     Changes: {len(version_changes)}")
        
        # API type breakdown for this version
        api_types = {}
        for api in version_apis:
            api_types[api.type.value] = api_types.get(api.type.value, 0) + 1
        
        if api_types:
            type_summary = ", ".join([f"{count} {type_}{'s' if count != 1 else ''}" for type_, count in api_types.items()])
            print(f"     Types: {type_summary}")
        
        # Show significant changes for this version
        if version_changes:
            change_types = {}
            for change in version_changes:
                change_types[change.change_type.value] = change_types.get(change.change_type.value, 0) + 1
            
            change_summary = ", ".join([f"{count} {type_}" for type_, count in change_types.items()])
            print(f"     Changes: {change_summary}")
            
            # Show a few example changes
            for change in version_changes[:3]:  # Show first 3
                print(f"       • {change.change_type.value}: {change.element.name}")
            
            if len(version_changes) > 3:
                print(f"       ... and {len(version_changes) - 3} more")
    
    # 2. Overall evolution patterns
    print(f"\n🔄 Evolution Patterns:")
    
    # Find most/least stable APIs
    api_version_counts = {}
    for version, apis in result.api_elements.items():
        for api in apis:
            signature = api.get_signature()
            api_version_counts[signature] = api_version_counts.get(signature, 0) + 1
    
    total_versions = len(result.versions)
    stable_apis = [api for api, count in api_version_counts.items() if count == total_versions]
    unstable_apis = [api for api, count in api_version_counts.items() if count == 1]
    
    print(f"  Stable APIs (all versions): {len(stable_apis)}")
    print(f"  Unstable APIs (1 version): {len(unstable_apis)}")
    print(f"  API stability rate: {len(stable_apis) / len(api_version_counts) * 100:.1f}%")
    
    # Show examples of stable APIs
    if stable_apis:
        print(f"\n  📌 Examples of stable APIs:")
        for api in stable_apis[:5]:
            print(f"    • {api}")
    
    # Show examples of volatile APIs
    if unstable_apis:
        print(f"\n  ⚡ Examples of version-specific APIs:")
        for api in unstable_apis[:5]:
            print(f"    • {api}")
    
    # 3. Breaking changes analysis
    breaking_changes = [c for c in result.changes if not getattr(c, 'is_backwards_compatible', True)]
    if breaking_changes:
        print(f"\n⚠️  Breaking Changes Analysis:")
        print(f"  Total breaking changes: {len(breaking_changes)}")
        
        # Group by version
        breaking_by_version = {}
        for change in breaking_changes:
            version = getattr(change, 'to_version', 'unknown')
            if version not in breaking_by_version:
                breaking_by_version[version] = []
            breaking_by_version[version].append(change)
        
        for version in sorted(breaking_by_version.keys()):
            changes = breaking_by_version[version]
            print(f"\n    v{version}: {len(changes)} breaking changes")
            for change in changes[:3]:  # Show first 3
                print(f"      • {change.change_type.value}: {change.element.name}")
                if hasattr(change, 'description') and change.description:
                    print(f"        └─ {change.description[:60]}...")
    else:
        print(f"\n✅ No breaking changes detected (or all changes are backwards compatible)")

🔍 Detailed Analysis Results for requests

📊 Version Breakdown:

  📦 Version 2.30.0
     Released: 2023-05-03 15:44:03.457714+00:00
     APIs: 231
     Changes: 228
     Types: 17 constants, 68 functions, 44 classs, 102 methods
     Changes: 228 added
       • added: NETRC_FILES
       • added: DEFAULT_CA_BUNDLE_PATH
       • added: DEFAULT_PORTS
       ... and 225 more

  📦 Version 2.31.0
     Released: 2023-05-22 15:12:42.313790+00:00
     APIs: 231
     Changes: 0
     Types: 17 constants, 68 functions, 44 classs, 102 methods

  📦 Version 2.32.2
     Released: 2024-05-21 18:51:29.562156+00:00
     APIs: 232
     Changes: 2
     Types: 17 constants, 68 functions, 44 classs, 103 methods
     Changes: 1 added, 1 deprecated
       • added: get_connection_with_tls_context
       • deprecated: get_connection

  📦 Version 2.32.3
     Released: 2024-05-29 15:37:47.027275+00:00
     APIs: 233
     Changes: 1
     Types: 17 constants, 68 functions, 44 classs, 104 methods
     Changes: 1 added
