# QDET Transforms Module Tutorial

This notebook demonstrates all available tools in the QDET transforms module. The transforms module provides data preprocessing, dimensionality reduction, feature engineering, and normalization techniques optimized for quantum machine learning pipelines.

## 1. Import Required Libraries

Import necessary libraries including pandas, numpy, matplotlib, and all tools from the transforms module.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import sys
import os
sys.path.append(os.path.abspath('..'))

# Import transforms module tools
from qudet.transforms import (
    QuantumNormalizer,
    RangeNormalizer,
    DecimalScaler,
    LogTransformer,
    PowerTransformer,
    FeatureScaler,
    FeatureSelector,
    OutlierRemover,
    DataBalancer,
    CategoricalEncoder,
    TargetEncoder,
    FrequencyEncoder,
    BinningEncoder,
    QuantumPCA,
    RandomProjector,
    StreamingHasher,
    QuantumImputer,
    CoresetReducer,
    AutoReducer
)

# Set random seed for reproducibility
np.random.seed(42)

print("✓ All libraries and transform tools imported successfully!")

## 2. Load and Explore the Iris Dataset

Load the iris.csv dataset and prepare it for transformation demonstrations.

In [None]:
# Load the Iris dataset
iris_df = pd.read_csv('../qudet/datasets/iris.csv')

# Separate features and labels
X = iris_df.iloc[:, :-1]
y = iris_df.iloc[:, -1]

print("Dataset Summary:")
print(f"Shape: {iris_df.shape}")
print(f"\nFirst 5 rows:")
print(iris_df.head())

print(f"\nFeature Statistics:")
print(X.describe())

print(f"\nTarget Distribution:")
print(y.value_counts())

# Store for use in later demonstrations
X_values = X.values
y_values = y.values
feature_names = X.columns.tolist()

## 3. Quantum Normalizer

**Description**: Normalizes features to quantum-compatible scales (L2 norm, probability, amplitude, angle). Essential for quantum algorithms that require specific input ranges.

**Use Case**: Normalize iris features for quantum state preparation.

In [None]:
# Test different normalization methods
print("Quantum Normalizer Results:")
print("=" * 70)

methods = ['l2', 'l1', 'probability']
normalizers = {}

for method in methods:
    normalizer = QuantumNormalizer(method=method)
    normalizer.fit(X_values)
    X_normalized = normalizer.transform(X_values)
    normalizers[method] = X_normalized
    
    print(f"\n{method.upper()} Normalization:")
    print(f"  Original range: [{X_values.min():.4f}, {X_values.max():.4f}]")
    print(f"  Normalized range: [{X_normalized.min():.4f}, {X_normalized.max():.4f}]")
    print(f"  First row norms: {np.linalg.norm(X_normalized[0:3], axis=1)}")

print("\n" + "=" * 70)

## 4. Feature Scaler

**Description**: Scales and normalizes features using multiple strategies (standard, min-max, robust, quantum). Prepares data for quantum algorithms with specific input requirements.

**Use Case**: Scale iris features to consistent ranges.

In [None]:
# Test different scaling methods
print("Feature Scaler Results:")
print("=" * 70)

scaling_methods = ['standard', 'minmax', 'robust']

for method in scaling_methods:
    scaler = FeatureScaler(method=method, feature_range=(0, 1))
    scaler.fit(X_values)
    X_scaled = scaler.transform(X_values)
    
    print(f"\n{method.upper()} Scaling:")
    print(f"  Range: [{X_scaled.min():.4f}, {X_scaled.max():.4f}]")
    print(f"  Mean: {X_scaled.mean():.4f}, Std: {X_scaled.std():.4f}")
    print(f"  First sample (scaled): {np.round(X_scaled[0], 3)}")

print("\n" + "=" * 70)
print("✓ Feature scaling complete")

## 5. Feature Selector

**Description**: Selects the most relevant features for quantum algorithms using statistical methods. Reduces feature dimensionality while preserving predictive power.

**Use Case**: Select important iris features for quantum classification.

In [None]:
# Feature selection
selector = FeatureSelector(k=2, method='f_classif')
selector.fit(X_values, y_values)
X_selected = selector.transform(X_values)

print("Feature Selector Results:")
print(f"Original features: {len(feature_names)}")
print(f"Selected features: {X_selected.shape[1]}")

# Get feature importance scores
importance = selector.get_feature_importance()
selected_indices = selector.get_selected_features()

print(f"\nSelected feature indices: {selected_indices}")
print(f"Selected feature names: {[feature_names[i] for i in selected_indices]}")

print(f"\nFeature importance scores:")
for idx, score in enumerate(importance):
    print(f"  {feature_names[idx]}: {score:.4f}")

print(f"\nDimensionality reduction: {len(feature_names)} → {X_selected.shape[1]} features")

## 6. Outlier Remover

**Description**: Detects and removes outliers from datasets using statistical methods (IQR, Z-score, isolation forest). Essential for data quality before quantum processing.

**Use Case**: Clean iris data by removing statistical outliers.

In [None]:
from qudet.transforms.feature_engineering import OutlierRemover

# Create outlier remover with IQR method
outlier_remover = OutlierRemover(method='iqr', threshold=1.5)

# Fit and transform data
X_cleaned = outlier_remover.fit_transform(X_train)

# Display outlier statistics
print(f"Original dataset shape: {X_train.shape}")
print(f"Cleaned dataset shape: {X_cleaned.shape}")
print(f"Number of outliers removed: {X_train.shape[0] - X_cleaned.shape[0]}")
print(f"Percentage of data removed: {(X_train.shape[0] - X_cleaned.shape[0]) / X_train.shape[0] * 100:.2f}%")

# Show outlier detection with Z-score method
outlier_remover_zscore = OutlierRemover(method='zscore', threshold=3)
X_zscore_cleaned = outlier_remover_zscore.fit_transform(X_train)
print(f"\nZ-score method: Removed {X_train.shape[0] - X_zscore_cleaned.shape[0]} outliers")

## 7. Data Balancer

**Description**: Balances class distribution in imbalanced datasets using techniques like oversampling, undersampling, and SMOTE. Ensures fair quantum circuit training on all classes.

**Use Case**: Balance iris dataset for better minority class representation.

In [None]:
from qudet.transforms.feature_engineering import DataBalancer
from collections import Counter

# Original class distribution
original_counts = Counter(y_train)
print("Original class distribution:")
print(f"  {original_counts}")

# Balance dataset using SMOTE
balancer = DataBalancer(strategy='smote', random_state=42)
X_balanced, y_balanced = balancer.fit_transform(X_train, y_train)

balanced_counts = Counter(y_balanced)
print(f"\nBalanced class distribution:")
print(f"  {balanced_counts}")
print(f"\nOriginal dataset size: {len(y_train)}")
print(f"Balanced dataset size: {len(y_balanced)}")

# Try random oversampling
balancer_oversample = DataBalancer(strategy='oversample', random_state=42)
X_over, y_over = balancer_oversample.fit_transform(X_train, y_train)
over_counts = Counter(y_over)
print(f"\nOversampling result: {over_counts}")

## 8. Categorical Encoder

**Description**: Encodes categorical features using label, one-hot, ordinal, or binary encoding. Prepares categorical data for quantum circuits.

**Use Case**: Encode iris species labels using different encoding schemes.

In [None]:
from qudet.transforms.encoding import CategoricalEncoder
import numpy as np

# Create categorical data (iris species)
categories = np.array(['Setosa', 'Versicolor', 'Virginica', 'Setosa', 'Versicolor'])

# Label encoding
label_encoder = CategoricalEncoder(encoding='label')
encoded_label = label_encoder.fit_transform(categories.reshape(-1, 1))
print("Label Encoding:")
print(f"  Original: {categories}")
print(f"  Encoded: {encoded_label.flatten()}")

# One-hot encoding
onehot_encoder = CategoricalEncoder(encoding='onehot')
encoded_onehot = onehot_encoder.fit_transform(categories.reshape(-1, 1))
print(f"\nOne-Hot Encoding:")
print(f"  Shape: {encoded_onehot.shape}")
print(f"  Sample:\n{encoded_onehot[:2]}")

# Ordinal encoding
ordinal_encoder = CategoricalEncoder(encoding='ordinal')
encoded_ordinal = ordinal_encoder.fit_transform(categories.reshape(-1, 1))
print(f"\nOrdinal Encoding:")
print(f"  Encoded: {encoded_ordinal.flatten()}")

## 9. Target Encoder

**Description**: Encodes categorical features based on target variable statistics. Useful for encoding features with target correlation information.

**Use Case**: Encode iris species using target-based statistics.

In [None]:
from qudet.transforms.encoding import TargetEncoder

# Create categorical feature and target
categories = pd.DataFrame({'species': ['Setosa', 'Versicolor', 'Virginica', 'Setosa', 'Versicolor']})
target = np.array([0, 1, 2, 0, 1])

# Target encoding
target_encoder = TargetEncoder()
encoded_target = target_encoder.fit_transform(categories, target)
print("Target Encoding:")
print(f"  Original categories: {categories['species'].values}")
print(f"  Target values: {target}")
print(f"  Encoded values:\n{encoded_target}")

## 10. Frequency Encoder

**Description**: Encodes categorical features based on their frequency in the dataset. Maps categories to their occurrence counts.

**Use Case**: Frequency-based encoding of iris species.

In [None]:
from qudet.transforms.encoding import FrequencyEncoder

# Create categorical feature with different frequencies
categories = np.array(['Setosa', 'Versicolor', 'Virginica', 'Setosa', 'Versicolor', 'Setosa']).reshape(-1, 1)

# Frequency encoding
freq_encoder = FrequencyEncoder()
encoded_freq = freq_encoder.fit_transform(categories)
print("Frequency Encoding:")
print(f"  Original: {categories.flatten()}")
print(f"  Frequency encoded:\n{encoded_freq.flatten()}")
print(f"  (Setosa appears 3 times = 0.5, Versicolor 2 times = 0.333, Virginica 1 time = 0.167)")

## 11. Binning Encoder

**Description**: Bins continuous features into discrete categories or encodes categorical features into bins. Reduces feature dimensionality and creates interpretable intervals.

**Use Case**: Bin continuous iris features into discrete categories.

In [None]:
from qudet.transforms.encoding import BinningEncoder

# Bin continuous features into discrete categories
binning_encoder = BinningEncoder(n_bins=3, strategy='quantile')
X_binned = binning_encoder.fit_transform(X_train)
print("Binning Encoder:")
print(f"  Original feature ranges:")
print(f"    Min: {X_train.min(axis=0)}")
print(f"    Max: {X_train.max(axis=0)}")
print(f"\n  Binned data (first 5 samples):")
print(f"    Original:\n{X_train[:5]}")
print(f"    Binned:\n{X_binned[:5]}")
print(f"\n  Unique bin values per feature: {[len(np.unique(X_binned[:, i])) for i in range(X_binned.shape[1])]}")

## 12. Quantum PCA

**Description**: Performs quantum-inspired Principal Component Analysis using kernel methods. Efficiently reduces dimensionality while preserving quantum-relevant features.

**Use Case**: Reduce 4D iris features to 2D using quantum PCA.

In [None]:
from qudet.transforms.pca import QuantumPCA

# Apply quantum PCA
qpca = QuantumPCA(n_components=2, kernel='rbf')
X_qpca = qpca.fit_transform(X_train)
print("Quantum PCA Results:")
print(f"  Original shape: {X_train.shape}")
print(f"  Reduced shape: {X_qpca.shape}")
print(f"  Explained variance ratio: {qpca.explained_variance_ratio_ if hasattr(qpca, 'explained_variance_ratio_') else 'N/A'}")
print(f"\n  First 5 samples (original vs PCA):")
print(f"    Original:\n{X_train[:5]}")
print(f"    Quantum PCA:\n{X_qpca[:5]}")

## 13. Random Projector

**Description**: Uses Gaussian random projection for fast dimensionality reduction. Preserves distances between points while drastically reducing dimensions.

**Use Case**: Project iris data to 2D using random projection.

In [None]:
from qudet.transforms.projections import RandomProjector

# Apply random projection
random_proj = RandomProjector(n_components=2, random_state=42)
X_rp = random_proj.fit_transform(X_train)
print("Random Projection Results:")
print(f"  Original shape: {X_train.shape}")
print(f"  Projected shape: {X_rp.shape}")
print(f"  Johnson-Lindenstrauss bound: {random_proj.eps if hasattr(random_proj, 'eps') else 'N/A'}")
print(f"\n  First 5 projected samples:")
print(X_rp[:5])

## 14. Streaming Hasher

**Description**: Implements the hashing trick for feature vectorization. Maps arbitrary features to fixed-width vectors using hash functions. Useful for streaming data.

**Use Case**: Hash categorical features to fixed-width vectors.

In [None]:
from qudet.transforms.sketching import StreamingHasher

# Create categorical data
categories = pd.DataFrame({
    'species': ['Setosa', 'Versicolor', 'Virginica', 'Setosa'],
    'region': ['North', 'South', 'East', 'West']
})

# Hash categorical features
hasher = StreamingHasher(n_features=8, hash_func='md5')
X_hashed = hasher.fit_transform(categories)
print("Streaming Hasher Results:")
print(f"  Input shape: {categories.shape}")
print(f"  Output shape (fixed-width): {X_hashed.shape}")
print(f"  Hashed features (first 2 samples):")
print(X_hashed[:2])

## 15. Quantum Imputer

**Description**: Handles missing values using quantum similarity metrics. Imputes missing values by finding quantum-closest neighbors and interpolating their values.

**Use Case**: Impute missing iris values using quantum similarity.

In [None]:
from qudet.transforms.imputation import QuantumImputer

# Create data with missing values
X_with_missing = X_train.copy()
X_with_missing[0, 0] = np.nan
X_with_missing[5, 2] = np.nan
X_with_missing[10, 1] = np.nan

print("Quantum Imputer Results:")
print(f"  Missing values before imputation: {np.isnan(X_with_missing).sum()}")

# Impute using quantum similarity
imputer = QuantumImputer(method='quantum_similarity', n_neighbors=5)
X_imputed = imputer.fit_transform(X_with_missing)

print(f"  Missing values after imputation: {np.isnan(X_imputed).sum()}")
print(f"\n  Imputed values:")
print(f"    Position [0, 0]: {X_imputed[0, 0]:.4f}")
print(f"    Position [5, 2]: {X_imputed[5, 2]:.4f}")
print(f"    Position [10, 1]: {X_imputed[10, 1]:.4f}")

## 16. Coreset Reducer

**Description**: Reduces dataset size using coreset algorithms. Selects representative points that preserve dataset statistics while reducing computational complexity.

**Use Case**: Reduce iris dataset to a representative coreset.

In [None]:
from qudet.transforms.coresets import CoresetReducer

# Create coreset reducer
coreset = CoresetReducer(coreset_size=30, method='kmeans')
X_coreset = coreset.fit_transform(X_train)

print("Coreset Reducer Results:")
print(f"  Original dataset size: {X_train.shape[0]}")
print(f"  Coreset size: {X_coreset.shape[0]}")
print(f"  Compression ratio: {X_train.shape[0] / X_coreset.shape[0]:.2f}x")
print(f"\n  Coreset statistics:")
print(f"    Mean original: {X_train.mean(axis=0)}")
print(f"    Mean coreset: {X_coreset.mean(axis=0)}")
print(f"    Std original: {X_train.std(axis=0)}")
print(f"    Std coreset: {X_coreset.std(axis=0)}")

## 17. Auto Reducer

**Description**: Automatically selects the best dimensionality reduction technique based on data characteristics. Compares multiple methods and chooses optimal approach.

**Use Case**: Automatically reduce iris dimensions using best-fit method.

In [None]:
from qudet.transforms.auto import AutoReducer

# Auto select best dimensionality reduction
auto_reducer = AutoReducer(n_components=2, methods=['pca', 'random_projection'])
X_auto = auto_reducer.fit_transform(X_train)

print("Auto Reducer Results:")
print(f"  Original shape: {X_train.shape}")
print(f"  Reduced shape: {X_auto.shape}")
print(f"  Selected method: {auto_reducer.best_method_ if hasattr(auto_reducer, 'best_method_') else 'N/A'}")
print(f"  Method scores: {auto_reducer.scores_ if hasattr(auto_reducer, 'scores_') else 'N/A'}")
print(f"\n  First 5 auto-reduced samples:")
print(X_auto[:5])

## Summary

This notebook demonstrated the comprehensive feature transformation capabilities of QDET's Transforms module:

### Data Quality & Engineering
- **Quantum Normalizer**: Normalizes data to quantum-compatible ranges
- **Feature Scaler**: Applies various scaling methods (standard, minmax, robust)
- **Feature Selector**: Identifies and selects most important features
- **Outlier Remover**: Detects and removes statistical outliers
- **Data Balancer**: Handles class imbalance through oversampling/undersampling

### Categorical Encoding
- **Categorical Encoder**: Maps categorical → numerical (label, onehot, ordinal)
- **Target Encoder**: Encodes based on target variable statistics
- **Frequency Encoder**: Encodes by feature occurrence frequency
- **Binning Encoder**: Creates discrete intervals from continuous features

### Dimensionality Reduction
- **Quantum PCA**: Kernel-based principal component analysis
- **Random Projector**: Gaussian random projection for fast reduction
- **Coreset Reducer**: Selects representative points for efficiency
- **Auto Reducer**: Automatically chooses optimal reduction method

### Specialized Techniques
- **Streaming Hasher**: Fixed-width feature hashing for streaming data
- **Quantum Imputer**: Quantum-similarity based missing value imputation

### Key Insights
1. **Modularity**: Each transformer can be used independently or in pipelines
2. **Compatibility**: All tools work with numpy arrays and pandas DataFrames
3. **Quantum-Ready**: Features normalized for quantum circuit processing
4. **Efficiency**: Scalable methods like CoresetReducer for large datasets
5. **Flexibility**: Multiple encoding/reduction strategies for different use cases

### Next Steps
- Combine multiple transformers in pipelines for complete data preprocessing
- Use with QDET analytics tools (classifiers, encoders) for quantum ML workflows
- Experiment with different hyperparameters based on your specific data