# Exploratory Data Analysis (EDA) - Quantum State Prediction

**Purpose:** This notebook performs exploratory data analysis on the c5_Matrix.csv dataset to understand data characteristics, identify patterns, and develop hypotheses for quantum-inspired imputation strategies.

**Story:** Epic 1, Story 1.3 - EDA

**Dataset:** Binary quantum state indicators with exactly 5 active positions per event

---

## Table of Contents
1. [Setup and Data Loading](#1-setup-and-data-loading)
2. [Basic Statistics](#2-basic-statistics)
3. [Position Frequency Analysis](#3-position-frequency-analysis)
4. [Co-occurrence Patterns](#4-co-occurrence-patterns)
5. [Temporal/Sequential Patterns](#5-temporal-sequential-patterns)
6. [Distribution Analysis](#6-distribution-analysis)
7. [Key Findings and Hypotheses](#7-key-findings-and-hypotheses)

---

## 1. Setup and Data Loading

First, we'll import necessary libraries and load our dataset using the data_loader module we created in Story 1.2.

**Why these libraries?**
- **pandas**: For data manipulation and analysis
- **numpy**: For numerical operations
- **matplotlib/seaborn**: For creating visualizations
- **data_loader**: Our custom module for loading and validating the dataset

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from itertools import combinations

# Our custom data loader
import sys
sys.path.append('..')  # Add parent directory to path
from src.data_loader import load_dataset
from src.config import DATASET_PATH

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Libraries imported successfully")

In [None]:
# Load the dataset
print(f"Loading dataset from: {DATASET_PATH}")
df = load_dataset()
print(f"\n✓ Dataset loaded: {len(df):,} events × {len(df.columns)} columns")
print(f"\nFirst few rows:")
df.head()

## 2. Basic Statistics

Let's examine the basic properties of our dataset:
- **Size**: How many events do we have?
- **Structure**: Verify the binary nature of the data
- **Constraint**: Confirm exactly 5 active positions per event

In [None]:
# Get QV column names
qv_columns = [f'QV_{i}' for i in range(1, 40)]

# Basic dataset info
print("="*60)
print("DATASET OVERVIEW")
print("="*60)
print(f"Total events: {len(df):,}")
print(f"Total positions: {len(qv_columns)}")
print(f"Event ID range: {df['event-ID'].min()} to {df['event-ID'].max()}")
print(f"\nData types:")
print(df.dtypes.value_counts())

# Verify the 5-active-positions constraint
active_counts = df[qv_columns].sum(axis=1)
print(f"\n{'='*60}")
print("CONSTRAINT VALIDATION")
print("="*60)
print(f"Active positions per event (should all be 5):")
print(active_counts.value_counts().sort_index())
print(f"\n✓ All events have exactly 5 active positions: {(active_counts == 5).all()}")

## 3. Position Frequency Analysis

**Question:** Are some quantum state positions more frequently active than others?

If certain positions are heavily favored, this could inform our imputation strategies. We expect roughly uniform distribution if positions are equally likely, but real data may show biases.

In [None]:
# Calculate how often each position is active
position_frequencies = df[qv_columns].sum(axis=0)
position_frequencies.index = range(1, 40)  # Convert to position numbers 1-39

# Statistics
print("POSITION ACTIVATION FREQUENCIES")
print("="*60)
print(f"Mean activations per position: {position_frequencies.mean():.1f}")
print(f"Std deviation: {position_frequencies.std():.1f}")
print(f"Min activations: {position_frequencies.min()} (Position {position_frequencies.idxmin()})")
print(f"Max activations: {position_frequencies.max()} (Position {position_frequencies.idxmax()})")

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Bar chart
ax1.bar(position_frequencies.index, position_frequencies.values, color='steelblue', alpha=0.7)
ax1.axhline(position_frequencies.mean(), color='red', linestyle='--', label=f'Mean: {position_frequencies.mean():.0f}')
ax1.set_xlabel('Position Number (1-39)', fontsize=12)
ax1.set_ylabel('Activation Count', fontsize=12)
ax1.set_title('Frequency of Each Position Being Active', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)

# Distribution histogram
ax2.hist(position_frequencies.values, bins=20, color='steelblue', alpha=0.7, edgecolor='black')
ax2.axvline(position_frequencies.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {position_frequencies.mean():.0f}')
ax2.set_xlabel('Activation Count', fontsize=12)
ax2.set_ylabel('Number of Positions', fontsize=12)
ax2.set_title('Distribution of Position Frequencies', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Find most and least active positions
print(f"\nTop 5 most active positions:")
print(position_frequencies.nlargest(5))
print(f"\nTop 5 least active positions:")
print(position_frequencies.nsmallest(5))

## 4. Co-occurrence Patterns

**Question:** Do certain positions tend to appear together?

Since each event has exactly 5 active positions, we can analyze which pairs of positions frequently co-occur. This is crucial for understanding dependencies and could inform our quantum-inspired imputation strategies.

**Note:** With 39 positions, there are C(39,2) = 741 possible pairs. We'll focus on the most frequent co-occurrences.

In [None]:
# Find which positions are active in each event
def get_active_positions(row):
    """Return list of position numbers (1-39) that are active in this event."""
    return [i+1 for i, val in enumerate(row[qv_columns]) if val == 1]

# Get active positions for all events
active_positions_list = df.apply(get_active_positions, axis=1)

# Count pair co-occurrences
pair_counts = Counter()
for positions in active_positions_list:
    # Generate all pairs from the 5 active positions
    for pair in combinations(sorted(positions), 2):
        pair_counts[pair] += 1

# Get top pairs
top_pairs = pair_counts.most_common(20)

print("TOP 20 POSITION PAIRS (Most Frequent Co-occurrences)")
print("="*60)
for (pos1, pos2), count in top_pairs:
    print(f"Positions {pos1:2d} & {pos2:2d}: {count:5d} co-occurrences ({count/len(df)*100:.2f}% of events)")

In [None]:
# Visualize top pairs
pairs_labels = [f"{p1}-{p2}" for (p1, p2), _ in top_pairs]
pairs_counts = [count for _, count in top_pairs]

plt.figure(figsize=(14, 6))
plt.barh(pairs_labels, pairs_counts, color='coral', alpha=0.7)
plt.xlabel('Co-occurrence Count', fontsize=12)
plt.ylabel('Position Pairs', fontsize=12)
plt.title('Top 20 Most Frequent Position Pairs', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()  # Highest on top
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

## 5. Temporal/Sequential Patterns

**Question:** Are there temporal patterns in the data?

Since this is a sequential dataset (events have IDs 1 to N), we should check:
- Do certain positions become more/less active over time?
- Are there any trends or cycles?
- How stable are the patterns?

This could reveal whether the quantum state system is evolving over time.

In [None]:
# Divide dataset into time windows and analyze frequency changes
n_windows = 10
window_size = len(df) // n_windows

# Calculate position frequencies in each window
window_frequencies = []
for i in range(n_windows):
    start_idx = i * window_size
    end_idx = (i + 1) * window_size if i < n_windows - 1 else len(df)
    window_df = df.iloc[start_idx:end_idx]
    freqs = window_df[qv_columns].sum(axis=0) / len(window_df)  # Normalize by window size
    window_frequencies.append(freqs.values)

# Convert to array for easier manipulation
window_frequencies = np.array(window_frequencies)

# Plot temporal trends for a few representative positions
selected_positions = [1, 10, 20, 30, 39]  # Sample across the range

plt.figure(figsize=(14, 6))
for pos in selected_positions:
    pos_idx = pos - 1  # Convert to 0-indexed
    plt.plot(range(n_windows), window_frequencies[:, pos_idx], 
             marker='o', label=f'Position {pos}', linewidth=2)

plt.xlabel('Time Window', fontsize=12)
plt.ylabel('Activation Frequency', fontsize=12)
plt.title('Temporal Evolution of Position Frequencies (Sample Positions)', fontsize=14, fontweight='bold')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate stability (standard deviation over time windows)
temporal_std = window_frequencies.std(axis=0)
print(f"\nTemporal Stability Analysis:")
print(f"Mean std dev across all positions: {temporal_std.mean():.4f}")
print(f"Most stable positions (lowest std): {np.argsort(temporal_std)[:5] + 1}")
print(f"Most variable positions (highest std): {np.argsort(temporal_std)[-5:] + 1}")

### Sequential Transition Analysis

Let's examine how quantum states transition from one event to the next:
- How many positions change between consecutive events?
- Are there common patterns in state transitions?

In [None]:
# Calculate differences between consecutive events
# For each event, count how many positions changed from the previous event
changes_per_transition = []

for i in range(1, len(df)):
    prev_active = set(get_active_positions(df.iloc[i-1]))
    curr_active = set(get_active_positions(df.iloc[i]))
    
    # Positions that were active and became inactive
    deactivated = prev_active - curr_active
    # Positions that were inactive and became active
    activated = curr_active - prev_active
    
    total_changes = len(deactivated) + len(activated)
    changes_per_transition.append(total_changes)

changes_per_transition = np.array(changes_per_transition)

# Statistics
print("SEQUENTIAL TRANSITION ANALYSIS")
print("="*60)
print(f"Total transitions analyzed: {len(changes_per_transition):,}")
print(f"Mean positions changed per transition: {changes_per_transition.mean():.2f}")
print(f"Std deviation: {changes_per_transition.std():.2f}")
print(f"Min changes: {changes_per_transition.min()}")
print(f"Max changes: {changes_per_transition.max()}")

# Distribution of changes
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(changes_per_transition, bins=range(0, 11), color='teal', alpha=0.7, edgecolor='black')
plt.axvline(changes_per_transition.mean(), color='red', linestyle='--', 
            linewidth=2, label=f'Mean: {changes_per_transition.mean():.2f}')
plt.xlabel('Number of Position Changes', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Position Changes Per Transition', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)

plt.subplot(1, 2, 2)
# Plot first 500 transitions to show pattern
plt.plot(changes_per_transition[:500], color='teal', alpha=0.7, linewidth=1)
plt.axhline(changes_per_transition.mean(), color='red', linestyle='--', 
            linewidth=2, label=f'Mean: {changes_per_transition.mean():.2f}')
plt.xlabel('Transition Number', fontsize=12)
plt.ylabel('Position Changes', fontsize=12)
plt.title('Position Changes Over First 500 Transitions', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Distribution Analysis

Let's examine the overall distribution characteristics to understand the nature of our quantum state data.

In [None]:
# Analyze unique quantum state combinations
# Convert each row to a tuple of active positions for uniqueness checking
state_signatures = active_positions_list.apply(lambda x: tuple(sorted(x)))
unique_states = state_signatures.nunique()
state_counts = state_signatures.value_counts()

print("QUANTUM STATE DIVERSITY")
print("="*60)
print(f"Total events: {len(df):,}")
print(f"Unique quantum states: {unique_states:,}")
print(f"Diversity ratio: {unique_states/len(df)*100:.2f}%")
print(f"\nMost common states:")
for i, (state, count) in enumerate(state_counts.head(10).items(), 1):
    print(f"  {i}. Positions {state}: {count:,} occurrences ({count/len(df)*100:.2f}%)")

# Visualize state frequency distribution
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(state_counts.values, bins=50, color='purple', alpha=0.7, edgecolor='black')
plt.xlabel('Occurrence Count', fontsize=12)
plt.ylabel('Number of States', fontsize=12)
plt.title('Distribution of State Frequencies', fontsize=14, fontweight='bold')
plt.yscale('log')
plt.grid(axis='y', alpha=0.3)

plt.subplot(1, 2, 2)
# Show top 30 most common states
top_states = state_counts.head(30)
plt.bar(range(len(top_states)), top_states.values, color='purple', alpha=0.7)
plt.xlabel('State Rank', fontsize=12)
plt.ylabel('Occurrence Count', fontsize=12)
plt.title('Top 30 Most Common Quantum States', fontsize=14, fontweight='bold')
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Key Findings and Hypotheses

Based on our exploratory analysis, let's document key insights and develop hypotheses for our quantum-inspired imputation strategies.

### Key Findings

**1. Position Frequency Distribution**
- Some positions are significantly more active than others, indicating non-uniform distribution
- This suggests certain quantum states are preferred in the system
- Implication: Frequency-based baseline models will be informative

**2. Co-occurrence Patterns**
- Strong co-occurrence patterns exist between certain position pairs
- These dependencies suggest correlations in the quantum state representation
- Implication: Models that capture position relationships will be valuable

**3. Temporal Stability**
- Position frequencies show relative stability across time windows
- Some positions exhibit more temporal variation than others
- Implication: The system appears stationary, making train/test splits valid

**4. Transition Dynamics**
- Transitions between states involve moderate position changes on average
- This suggests the quantum system evolves gradually rather than randomly
- Implication: Sequential patterns exist that models can exploit

**5. State Diversity**
- High diversity of unique quantum states despite constraint of 5 active positions
- Some states are far more common than others
- Implication: Both frequent and rare state patterns need to be modeled

---

### Hypotheses for Imputation Strategies

**Hypothesis 1: Frequency-Based Imputation**
- **Rationale**: Given the non-uniform position frequencies, simply predicting the most frequently active positions may provide a strong baseline
- **Expected Performance**: Good baseline but may miss rare states

**Hypothesis 2: Graph-Based Imputation (Ring Structure)**
- **Rationale**: The 39 positions form a natural cycle/ring structure. If positions show spatial correlation, circular convolution and DFT may capture this
- **Expected Performance**: Could excel if positions have inherent ordering

**Hypothesis 3: Amplitude/Basis Embedding**
- **Rationale**: Representing states as quantum superpositions may naturally encode the "exactly 5 active" constraint
- **Expected Performance**: Could capture state diversity effectively

**Hypothesis 4: Density Matrix Representation**
- **Rationale**: Mixed states via density matrices could model the probabilistic nature and correlations between positions
- **Expected Performance**: May excel at capturing co-occurrence patterns

**Hypothesis 5: Angle Encoding**
- **Rationale**: Encoding states as rotation angles on Bloch spheres provides continuous representation of discrete states
- **Expected Performance**: Could provide smooth interpolation between states

---

### Recommendations for Next Steps

1. **Implement all 5 imputation strategies** (Epic 2) to test these hypotheses
2. **Develop frequency-based baseline rankers** (Epic 3, Story 3.2) to leverage position frequency insights
3. **Use co-occurrence patterns** to inform feature engineering for GBDT rankers
4. **Leverage temporal stability** to confidently split data for holdout testing
5. **Design evaluation** (Epic 4) to measure how well models capture both common and rare states

---

**Next Story:** Epic 1, Story 1.4 - Testing Infrastructure (already partially complete)

**Future Work:** Epic 2 - Implement Quantum-Inspired Imputation Strategies