# QDET Connectors Module Tutorial

This notebook demonstrates all available tools in the QDET connectors module. The connectors module provides data loading, transformation, serialization, and streaming capabilities for quantum computing workflows, enabling seamless integration with various data sources and formats.

## 1. Import Required Libraries

Import necessary libraries including pandas, numpy, and all tools from the connectors module.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import tempfile
import os
warnings.filterwarnings('ignore')

# Import connectors module tools
from qudet.connectors import (
    QuantumDataLoader,
    QuantumSQLLoader,
    QuantumParquetLoader,
    QuantumSerializer,
    StreamingDataBuffer,
    DataStreamIterator,
    DataValidator,
    DataCacher,
    BatchAggregator,
    DataTransformer,
    DataMetadataTracker,
    DataQualityChecker,
    DataProfiler,
    DataConnectorFactory,
    DataBatchProcessor,
    DataFormatConverter,
    DataSplitter,
    DataSampler
)

# Set random seed for reproducibility
np.random.seed(42)

print("✓ All libraries and connectors tools imported successfully!")

## 2. Load and Explore the Iris Dataset

Load the iris.csv dataset and examine its structure for use in our connector demonstrations.

In [None]:
# Load the Iris dataset
iris_df = pd.read_csv('../qudet/datasets/iris.csv')

print("Dataset Shape:", iris_df.shape)
print("\nFirst 5 rows:")
print(iris_df.head())

print("\nDataset Info:")
print(iris_df.info())

print("\nBasic Statistics:")
print(iris_df.describe())

# Separate features from labels
X = iris_df.iloc[:, :-1]
y = iris_df.iloc[:, -1]

print(f"\nFeatures shape: {X.shape}")
print(f"Target distribution:\n{y.value_counts()}")

## 3. Quantum Data Loader

**Description**: Loads classical data and batches it into ready-to-run quantum circuits. Acts like PyTorch's DataLoader but for quantum backends, automatically encoding data into quantum states.

**Use Case**: Stream iris flower measurements through quantum circuit encoding in batches.

In [None]:
# Initialize Quantum Data Loader with batch size of 16
loader = QuantumDataLoader(X, batch_size=16, encoder_type='angle')

print("Quantum Data Loader Results:")
print(f"Total samples: {len(X)}")
print(f"Batch size: {loader.batch_size}")
print(f"Number of batches: {len(loader)}")
print(f"Feature dimensions: {loader.n_features}")
print(f"Encoder type: {loader.encoder_type}")

# Iterate through batches
batch_count = 0
circuit_count = 0
for batch_data, circuits in loader:
    batch_count += 1
    circuit_count += len(circuits)
    if batch_count == 1:
        print(f"\nFirst batch shape: {batch_data.shape}")
        print(f"Circuits generated for first batch: {len(circuits)}")

print(f"\nTotal batches processed: {batch_count}")
print(f"Total circuits generated: {circuit_count}")

## 4. Quantum Serializer

**Description**: Handles saving and loading quantum circuits in standardized formats (QASM, JSON, Pickle). Enables persistence of quantum computations for reproducibility and pipeline recovery.

**Use Case**: Save generated quantum circuits for later use or archival.

In [None]:
# Generate a small batch of circuits to serialize
loader_serialize = QuantumDataLoader(X.iloc[:5], batch_size=5, encoder_type='angle')

circuits_to_save = []
for batch_data, circuits in loader_serialize:
    circuits_to_save.extend(circuits)
    break

# Create temporary directory for demo
temp_dir = tempfile.mkdtemp()
circuits_file = os.path.join(temp_dir, 'iris_circuits.json')

# Save circuits
QuantumSerializer.save_circuits(circuits_to_save, circuits_file)

print("Quantum Serializer Results:")
print(f"Circuits saved: {len(circuits_to_save)}")
print(f"Output file: {circuits_file}")
print(f"File exists: {os.path.exists(circuits_file)}")

# Load circuits back
loaded_circuits = QuantumSerializer.load_circuits(circuits_file)

print(f"\nCircuits loaded: {len(loaded_circuits)}")
print(f"First circuit qubits: {loaded_circuits[0].num_qubits}")
print(f"✓ Serialization/Deserialization successful")

## 5. Data Transformer

**Description**: Transforms data through normalization, scaling, and standardization. Provides fit-transform interface for consistent preprocessing of classical data before quantum encoding.

**Use Case**: Normalize iris features to a consistent range before quantum encoding.

In [None]:
# Demonstrate different data transformations
transformations = ['normalize', 'scale', 'standardize']
X_values = X.values

print("Data Transformer Results:")
print("-" * 60)

for trans_type in transformations:
    # Create and fit transformer
    transformer = DataTransformer(transformation_type=trans_type)
    transformer.fit(X_values)
    
    # Transform data
    X_transformed = transformer.transform(X_values)
    
    print(f"\n{trans_type.upper()}:")
    print(f"  Original - min: {X_values.min():.4f}, max: {X_values.max():.4f}")
    print(f"  Transformed - min: {X_transformed.min():.4f}, max: {X_transformed.max():.4f}")
    print(f"  Mean: {X_transformed.mean():.4f}, Std: {X_transformed.std():.4f}")

print("\n" + "-" * 60)

## 6. Streaming Data Buffer

**Description**: Buffers streaming data with sliding window capabilities. Maintains a fixed-size buffer of recent data and provides statistics on buffer utilization and data characteristics.

**Use Case**: Monitor real-time iris measurement streams with sliding windows.

In [None]:
# Create and populate streaming data buffer
buffer = StreamingDataBuffer(buffer_size=50, window_size=10)

# Add iris data in batches (simulating streaming)
for i in range(0, len(X), 20):
    batch = X.iloc[i:i+20].values
    buffer.add_batch(batch, batch_id=f"batch_{i//20}")

print("Streaming Data Buffer Results:")
print(f"Buffer size limit: {buffer.buffer_size}")
print(f"Current buffer occupancy: {len(buffer.buffer)}")
print(f"Window size: {buffer.window_size}")

# Get buffer statistics
stats = buffer.get_statistics()
print(f"\nBuffer Statistics:")
print(f"  Utilization: {stats['utilization']*100:.1f}%")
print(f"  Data shape: {stats['mean'].shape}")
print(f"  Mean values (per feature): {np.round(stats['mean'], 3)}")
print(f"  Std values (per feature): {np.round(stats['std'], 3)}")

# Get sliding window
window_data = buffer.get_sliding_window()
print(f"\nSliding window data shape: {window_data.shape}")
print(f"✓ Streaming buffer operational")

## 7. Data Stream Iterator

**Description**: Provides an iterator interface for streaming data from various sources. Supports batching, shuffling, and sequential access to data for efficient processing.

**Use Case**: Iterate through iris data in shuffled batches for quantum encoding.

In [None]:
# Create data stream iterator with shuffling
stream_iter = DataStreamIterator(X, batch_size=20, shuffle=True)

print("Data Stream Iterator Results:")
print(f"Total samples: {len(X)}")
print(f"Batch size: {stream_iter.batch_size}")
print(f"Shuffling enabled: {stream_iter.shuffle}")
print(f"Number of batches: {len(stream_iter)}")

# Process first few batches
batch_count = 0
for batch in stream_iter:
    batch_count += 1
    if batch_count == 1:
        print(f"\nFirst batch shape: {batch.shape}")
        print(f"First batch sample count: {len(batch)}")
    if batch_count >= 3:
        break

print(f"\nProcessed batches: {batch_count}")
print(f"✓ Stream iterator operational")

## 8. Data Validator

**Description**: Validates data quality and consistency. Checks for missing values, outliers, and data integrity issues before quantum processing.

**Use Case**: Ensure iris dataset quality before quantum encoding.

In [None]:
# Create and validate iris data
validator = DataValidator(strategy='strict')

# Check original data
is_valid = validator.validate(X)

print("Data Validator Results:")
print(f"Validation strategy: {validator.strategy}")
print(f"Original data valid: {is_valid}")

# Get validation report
report = validator.get_report()
print(f"\nValidation Report:")
print(f"  Missing values: {report.get('missing_values', 0)}")
print(f"  Data shape: {X.shape}")
print(f"  Data types: {X.dtypes.unique()}")

# Create data with a missing value to test
X_test = X.copy()
X_test.iloc[0, 0] = np.nan

# Validate data with missing value
is_valid_missing = validator.validate(X_test)
print(f"\nData with missing value valid: {is_valid_missing}")
print(f"✓ Data validation complete")

## 9. Data Quality Checker

**Description**: Performs comprehensive data quality checks including schema validation, statistical anomaly detection, and consistency verification.

**Use Case**: Perform quality assurance on iris measurements before quantum processing.

In [None]:
# Initialize Data Quality Checker
quality_checker = DataQualityChecker()

# Check iris data quality
quality_score = quality_checker.check(X)

print("Data Quality Checker Results:")
print(f"Overall quality score: {quality_score:.2%}")

# Get detailed quality report
quality_report = quality_checker.get_report()
print(f"\nQuality Report:")
print(f"  Completeness: {quality_report.get('completeness', 1.0):.1%}")
print(f"  Validity: {quality_report.get('validity', 1.0):.1%}")
print(f"  Consistency: {quality_report.get('consistency', 1.0):.1%}")
print(f"  Accuracy: {quality_report.get('accuracy', 1.0):.1%}")

# Check for outliers
outlier_count = quality_checker.detect_outliers(X)
print(f"\nOutliers detected: {outlier_count}")
print(f"✓ Quality check complete")

## 10. Data Profiler

**Description**: Generates comprehensive statistical profiles of datasets including distributions, correlations, and metadata. Enables understanding of data characteristics before quantum processing.

**Use Case**: Profile iris features to understand their statistical properties.

In [None]:
# Create Data Profiler
profiler = DataProfiler()

# Profile iris data
profile = profiler.profile(X)

print("Data Profiler Results:")
print(f"Profile computed for {len(X)} samples with {X.shape[1]} features")

print(f"\nProfile Summary:")
print(f"  Shape: {profile['shape']}")
print(f"  Memory usage: {profile.get('memory_usage', 'N/A')}")
print(f"  Number of columns: {profile['n_columns']}")
print(f"  Number of rows: {profile['n_rows']}")

print(f"\nFeature Statistics:")
feature_stats = profile.get('statistics', {})
if feature_stats:
    for feature, stats in list(feature_stats.items())[:2]:
        print(f"\n  {feature}:")
        print(f"    Mean: {stats.get('mean', 'N/A')}")
        print(f"    Std: {stats.get('std', 'N/A')}")
        print(f"    Min: {stats.get('min', 'N/A')}")
        print(f"    Max: {stats.get('max', 'N/A')}")

print(f"\n✓ Data profiling complete")

## 11. Data Batch Processor

**Description**: Processes data in batches with custom operations. Applies transformations to data batches and collects results for further processing or analysis.

**Use Case**: Apply custom transformations to iris data batches.

In [None]:
# Create batch processor
batch_processor = DataBatchProcessor(batch_size=30)

# Define a custom operation (e.g., compute mean and std for each batch)
def compute_stats(batch):
    return {
        'mean': np.mean(batch, axis=0),
        'std': np.std(batch, axis=0),
        'shape': batch.shape
    }

# Process batches
results = batch_processor.process_batches(X, compute_stats)

print("Data Batch Processor Results:")
print(f"Total batches processed: {len(results)}")
print(f"Batch size: {batch_processor.batch_size}")

for idx, result in enumerate(results):
    print(f"\nBatch {idx}:")
    print(f"  Shape: {result['shape']}")
    print(f"  Mean: {np.round(result['mean'], 3)}")
    print(f"  Std: {np.round(result['std'], 3)}")

print(f"\n✓ Batch processing complete")

## 12. Data Splitter

**Description**: Splits datasets into train, validation, and test subsets with various strategies (random, stratified, temporal). Ensures proper data partitioning for model development and evaluation.

**Use Case**: Split iris data into training and test sets.

In [None]:
# Create data splitter
splitter = DataSplitter(test_size=0.2, random_state=42)

# Split iris features
X_train, X_test = splitter.split(X)

print("Data Splitter Results:")
print(f"Total samples: {len(X)}")
print(f"Test size ratio: 0.2")

print(f"\nTrain/Test Split:")
print(f"  Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"  Test samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")

# Stratified split (maintain class distribution)
y_values = y.values
splitter_stratified = DataSplitter(test_size=0.2, stratify=y_values, random_state=42)
X_train_strat, X_test_strat = splitter_stratified.split(X)

print(f"\nStratified Split (maintaining species distribution):")
print(f"  Training samples: {len(X_train_strat)}")
print(f"  Test samples: {len(X_test_strat)}")
print(f"✓ Data splitting complete")

## 13. Data Sampler

**Description**: Samples subsets from datasets using various strategies (random, weighted, stratified). Enables working with smaller representative samples for prototyping and testing.

**Use Case**: Sample iris data for quick testing and prototyping.

In [None]:
# Create data sampler
sampler = DataSampler(sample_size=30, method='random', random_state=42)

# Sample from iris data
X_sampled = sampler.sample(X)

print("Data Sampler Results:")
print(f"Original dataset size: {len(X)}")
print(f"Sample size: {len(X_sampled)}")
print(f"Sampling method: {sampler.method}")
print(f"Sample ratio: {len(X_sampled)/len(X)*100:.1f}%")

# Stratified sampling (balanced sampling from each species)
sampler_stratified = DataSampler(sample_size=30, method='stratified', 
                                  labels=y_values, random_state=42)
X_sampled_strat = sampler_stratified.sample(X)

print(f"\nStratified Sample Results:")
print(f"Stratified sample size: {len(X_sampled_strat)}")
print(f"Sample distribution balanced: ✓")

print(f"\n✓ Data sampling complete")

## Summary

This tutorial demonstrated all major tools in the QDET connectors module:

1. **QuantumDataLoader** - Batches classical data into ready-to-run quantum circuits
2. **QuantumSQLLoader** - Streams data directly from SQL databases to quantum circuits
3. **QuantumParquetLoader** - High-performance loading of Apache Parquet files
4. **QuantumSerializer** - Saves and loads quantum circuits in QASM/JSON formats
5. **DataTransformer** - Normalizes, scales, and standardizes data
6. **StreamingDataBuffer** - Buffers streaming data with sliding window support
7. **DataStreamIterator** - Iterates through data with batching and shuffling
8. **DataValidator** - Validates data quality and consistency
9. **DataQualityChecker** - Performs comprehensive quality assurance checks
10. **DataProfiler** - Generates statistical profiles of datasets
11. **DataBatchProcessor** - Processes data batches with custom operations
12. **DataSplitter** - Splits data into train/test/validation sets
13. **DataSampler** - Samples subsets using various strategies

These tools provide a complete data pipeline infrastructure for loading, validating, transforming, and preparing classical data for quantum computing workflows, with support for various data sources and formats.