<a href="https://colab.research.google.com/github/teutedrini/OntoAligner/blob/main/eda/neo4j_eda_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neo4j Property Analysis - Exploratory Data Analysis

This notebook demonstrates how to use the Neo4j Property Analyzer to perform exploratory data analysis on Neo4j graph databases.

## Features
- Analyze node properties to determine if they are categorical or unique
- Fast mode using Cypher aggregations for large graphs
- Standard mode with DataFrame analysis for detailed insights
- Generate HTML profiling reports
- Performance monitoring and tracking

## Setup and Imports

In [None]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
import time
import json

# Import the Neo4j analyzer
from neo4j_analyzer import Neo4jPropertyAnalyzer, PerformanceMonitor
from neo4j_analyzer.report_generator import ReportGenerator
from neo4j_analyzer.results_saver import ResultsSaver

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Configuration

Update these settings to connect to your Neo4j database:

In [None]:
# Neo4j Connection Settings
NEO4J_URI = "bolt://54.174.185.202"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "receipt-compasses-annexs"

# Analysis Settings
USE_FAST_MODE = True      # Set to True for very large graphs
SAMPLE_SIZE = 50000       # Sample size for standard mode (None for all)
FETCH_SIZE = 2000         # Batch size for data extraction
ENABLE_PERFORMANCE_TRACKING = True

## 1. Initialize Analyzer and Explore Database

In [None]:
# Initialize performance monitor
perf_monitor = PerformanceMonitor() if ENABLE_PERFORMANCE_TRACKING else None

# Initialize analyzer
analyzer = Neo4jPropertyAnalyzer(
    uri=NEO4J_URI,
    user=NEO4J_USER,
    password=NEO4J_PASSWORD,
    fetch_size=FETCH_SIZE,
    performance_monitor=perf_monitor
)

print("✓ Connected to Neo4j database")

In [None]:
# Get all node labels
labels = analyzer.get_node_labels()
print(f"Found {len(labels)} node labels:")
for label in labels:
    count = analyzer.get_node_count(label)
    print(f"  - {label}: {count:,} nodes")

## 2. Analyze Properties - Fast Mode

Fast mode uses Cypher aggregations directly in the database, making it ideal for large graphs.

In [None]:
# Store all results
all_results = {}

# Analyze each label
for label in labels:
    print(f"\n{'='*60}")
    print(f"Analyzing label: {label}")
    print(f"{'='*60}")
    
    if USE_FAST_MODE:
        summary = analyzer.get_property_summary_fast(label)
    else:
        summary = analyzer.get_property_summary(label, sample_size=SAMPLE_SIZE)
    
    all_results[label] = summary
    ReportGenerator.print_summary(summary, label)

## 3. Visualize Property Types Distribution

In [None]:
# Aggregate property types across all labels
property_type_counts = {}

for label, summary in all_results.items():
    for prop_name, prop_info in summary.items():
        prop_type = prop_info.get('type', 'UNKNOWN')
        if prop_type not in property_type_counts:
            property_type_counts[prop_type] = 0
        property_type_counts[prop_type] += 1

# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
ax1.pie(property_type_counts.values(), labels=property_type_counts.keys(), 
        autopct='%1.1f%%', startangle=90)
ax1.set_title('Property Types Distribution')

# Bar chart
ax2.bar(property_type_counts.keys(), property_type_counts.values())
ax2.set_xlabel('Property Type')
ax2.set_ylabel('Count')
ax2.set_title('Property Types Count')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print(f"\nTotal properties analyzed: {sum(property_type_counts.values())}")

## 4. Analyze Specific Label in Detail

In [None]:
# Select a label to analyze in detail (change this to your label of interest)
if labels:
    LABEL_TO_ANALYZE = labels[0]  # Analyze first label
    
    print(f"Detailed analysis for: {LABEL_TO_ANALYZE}")
    print("="*60)
    
    # Get detailed summary
    detailed_summary = all_results.get(LABEL_TO_ANALYZE, {})
    
    # Create DataFrame for better visualization
    properties_data = []
    for prop_name, prop_info in detailed_summary.items():
        properties_data.append({
            'Property': prop_name,
            'Type': prop_info.get('type', 'UNKNOWN'),
            'Unique Values': prop_info.get('unique_count', 0),
            'Total Values': prop_info.get('total_count', 0),
            'Null Count': prop_info.get('null_count', 0),
            'Uniqueness %': prop_info.get('uniqueness_ratio', 0) * 100
        })
    
    if properties_data:
        df_props = pd.DataFrame(properties_data)
        display(df_props.sort_values('Uniqueness %', ascending=False))
        
        # Visualize uniqueness ratio
        plt.figure(figsize=(12, 6))
        plt.barh(df_props['Property'], df_props['Uniqueness %'])
        plt.xlabel('Uniqueness Ratio (%)')
        plt.ylabel('Property')
        plt.title(f'Property Uniqueness for {LABEL_TO_ANALYZE}')
        plt.tight_layout()
        plt.show()

## 5. Extract Sample Data for Further Analysis

In [None]:
# Extract a sample of nodes to DataFrame for detailed analysis
if not USE_FAST_MODE and labels:
    sample_label = labels[0]
    print(f"Extracting sample data for: {sample_label}")
    
    # Extract to DataFrame
    df_sample = analyzer.extract_nodes_to_dataframe(
        label=sample_label,
        sample_size=1000  # Get 1000 sample nodes
    )
    
    print(f"\nExtracted {len(df_sample)} nodes")
    print(f"Columns: {list(df_sample.columns)}")
    
    # Display first few rows
    display(df_sample.head())
    
    # Basic statistics
    print("\nBasic Statistics:")
    display(df_sample.describe())

## 6. Generate HTML Profiling Report (Optional)

Generate a comprehensive HTML report using ydata-profiling (only works in standard mode).

In [None]:
# Uncomment to generate HTML report (requires standard mode)
# if not USE_FAST_MODE and labels:
#     report_label = labels[0]
#     output_file = f"{report_label}_profile_report.html"
#     
#     print(f"Generating HTML report for {report_label}...")
#     analyzer.analyze_properties(
#         label=report_label,
#         sample_size=5000,
#         output_html=output_file,
#         minimal=True
#     )
#     print(f"✓ Report saved to: {output_file}")

## 7. Performance Analysis

In [None]:
# Display performance metrics if tracking is enabled
if ENABLE_PERFORMANCE_TRACKING and perf_monitor:
    print("Performance Metrics:")
    print("="*60)
    
    metrics = perf_monitor.get_metrics()
    
    if metrics:
        # Create DataFrame for metrics
        metrics_data = []
        for metric in metrics:
            metrics_data.append({
                'Operation': metric.get('operation', 'Unknown'),
                'Duration (s)': metric.get('duration', 0),
                'Label': metric.get('metadata', {}).get('label', 'N/A')
            })
        
        df_metrics = pd.DataFrame(metrics_data)
        display(df_metrics)
        
        # Visualize performance
        if len(df_metrics) > 0:
            plt.figure(figsize=(12, 6))
            operation_times = df_metrics.groupby('Operation')['Duration (s)'].sum()
            operation_times.plot(kind='bar')
            plt.xlabel('Operation')
            plt.ylabel('Total Duration (seconds)')
            plt.title('Performance by Operation')
            plt.xticks(rotation=45, ha='right')
            plt.tight_layout()
            plt.show()

## 8. Save Results

In [None]:
# Save analysis results to JSON
if all_results:
    config = {
        "neo4j_uri": NEO4J_URI,
        "use_fast_mode": USE_FAST_MODE,
        "sample_size": SAMPLE_SIZE,
        "fetch_size": FETCH_SIZE,
    }
    
    results_file = ResultsSaver.save_analysis_results(all_results, config)
    print(f"✓ Results saved to: {results_file}")

# Save performance report
if ENABLE_PERFORMANCE_TRACKING and perf_monitor:
    perf_file = perf_monitor.save_report()
    print(f"✓ Performance report saved to: {perf_file}")

## 9. Summary Statistics

In [None]:
# Generate summary statistics
print("\n" + "="*60)
print("ANALYSIS SUMMARY")
print("="*60)

total_labels = len(all_results)
total_properties = sum(len(props) for props in all_results.values())

print(f"Total Labels Analyzed: {total_labels}")
print(f"Total Properties Analyzed: {total_properties}")
print(f"\nProperty Type Breakdown:")
for prop_type, count in sorted(property_type_counts.items()):
    percentage = (count / total_properties * 100) if total_properties > 0 else 0
    print(f"  {prop_type}: {count} ({percentage:.1f}%)")

print(f"\nAnalysis Mode: {'Fast Mode' if USE_FAST_MODE else 'Standard Mode'}")
if not USE_FAST_MODE:
    print(f"Sample Size: {SAMPLE_SIZE if SAMPLE_SIZE else 'All nodes'}")

## 10. Cleanup

In [None]:
# Close the analyzer connection
analyzer.close()
print("✓ Connection closed")

## Next Steps

1. **Explore specific properties**: Use the extracted DataFrames to perform deeper analysis on specific properties
2. **Generate HTML reports**: Uncomment the HTML report generation section for detailed profiling
3. **Compare modes**: Try both fast and standard modes to see the performance difference
4. **Custom analysis**: Use the `PropertyAnalyzer` class directly for custom property analysis
5. **Integration**: Integrate these insights into your data pipeline or ontology alignment workflow

## Additional Resources

- See `examples.py` for more usage patterns
- Check `PERFORMANCE_GUIDE.md` for optimization tips
- Review `REFACTORING_GUIDE.md` for architecture details