# Statistical Correlation Features

gwexpy makes it easy to calculate statistical correlations between TimeSeries objects. This is useful for noise hunting and investigating nonlinear coupling.

Supported methods:
- **Pearson (PCC)**: Linear correlation.
- **Kendall (Ktau)**: Rank correlation (robust to outliers, non-parametric).
- **MIC**: Maximal Information Coefficient (robust to nonlinear relationships, requires `minepy`).

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from gwexpy.plot import PairPlot, Plot
from gwexpy.timeseries import TimeSeries, TimeSeriesMatrix

## Pairwise Correlation

In [None]:
# Create dummy data
t = np.linspace(0, 10, 1000)

# Linear relationship
ts_a = TimeSeries(t, dt=0.01, name="A")
ts_b = TimeSeries(t * 2 + np.random.normal(0, 1, 1000), dt=0.01, name="B_Linear")

# Nonlinear relationship (sine wave)
ts_c = TimeSeries(np.sin(t) * 10, dt=0.01, name="C_Sine")

# Random noise
ts_d = TimeSeries(np.random.normal(0, 1, 1000), dt=0.01, name="D_Noise")

Plot(ts_a, ts_b, ts_c, ts_d, separate=True, sharex=True);

In [None]:
# Visualization
pair = PairPlot([ts_a, ts_b, ts_c, ts_d], corner=True)
pair.show()

In [None]:
print("Correlation A vs B (linear):")
print(f"  Pearson: {ts_a.pcc(ts_b):.3f}")
print(f"  Kendall: {ts_a.ktau(ts_b):.3f}")
try:
    print(f"  MIC:     {ts_a.mic(ts_b):.3f}")
except ImportError:
    print("  MIC:     (minepy not available)")

Correlation A vs B (linear):
  Pearson: 0.986
  Kendall: 0.893
  MIC:     (minepy not available)


In [None]:
print("Correlation A vs C (nonlinear sine wave):")
print(f"  Pearson: {ts_a.pcc(ts_c):.3f} (linear correlation cannot capture structure)")
print(f"  Kendall: {ts_a.ktau(ts_c):.3f}")
try:
    print(f"  MIC:     {ts_a.mic(ts_c):.3f} (captures nonlinear dependency)")
except ImportError:
    print("  MIC:     (minepy not available)")

Correlation A vs C (nonlinear sine wave):
  Pearson: -0.071 (linear correlation cannot capture structure)
  Kendall: -0.053
  MIC:     (minepy not available)


## Correlation Vector (Noise Hunting)

When investigating noise sources, we often want to check correlations between a target channel (e.g., DARM) and hundreds of auxiliary channels.
`TimeSeriesMatrix.correlation_vector` efficiently computes this ranking.

In [None]:
# Create a Matrix with many auxiliary channels
n_channels = 20
data = np.random.randn(n_channels, 1, 1000)
names = [f"AUX-{i:02d}" for i in range(n_channels)]

# Inject signals into AUX-05 and AUX-12
target_signal = np.sin(np.linspace(0, 20, 1000))
data[5, 0, :] += target_signal * 5  # Strong coupling
data[12, 0, :] += target_signal**2 * 5  # Nonlinear coupling

matrix = TimeSeriesMatrix(data, dt=0.01, channel_names=names)

# Target channel
target = TimeSeries(
    target_signal + np.random.normal(0, 0.1, 1000), dt=0.01, name="TARGET"
)

In [None]:
# Compute correlation vector
# Use 'mic' to capture both linear and nonlinear (slower but powerful)
# Use 'pearson' for speed

try:
    print("Computing MIC vector (Top 5)...")
    df_mic = matrix.correlation_vector(target, method="mic", nproc=2)
    print(df_mic.head(5))
except ImportError:
    print("Skipping MIC example because minepy is not installed.")

Computing MIC vector (Top 5)...

   row  col channel  score
0    0    0  AUX-00    NaN
1    1    0  AUX-01    NaN
2    2    0  AUX-02    NaN
3    3    0  AUX-03    NaN
4    4    0  AUX-04    NaN


In [None]:
print("Computing Pearson vector (Top 5)...")
df_pcc = matrix.correlation_vector(target, method="pearson", nproc=1)
print(df_pcc.head(5))

Computing Pearson vector (Top 5)...
   row  col channel     score
0    5    0  AUX-05  0.953698
1   10    0  AUX-10  0.082508
2    6    0  AUX-06 -0.064250
3    8    0  AUX-08  0.056624
4   18    0  AUX-18  0.056503


In [None]:
# Visualize ranking (Top 10)
df_plot = df_pcc.head(10).iloc[::-1]  # Reverse to descending order

plt.figure(figsize=(10, 6))
plt.barh(df_plot["channel"], np.abs(df_plot["score"]), color="skyblue")
plt.xlabel("Absolute Correlation Score (Pearson)")
plt.title("Top 10 Correlated Channels")
plt.grid(axis="x", linestyle="--", alpha=0.7)
plt.show()

# Advanced Statistics

gwexpy provides advanced statistical capabilities not only for correlation but also for examining data distribution shape and causality.

- **Skewness**: Asymmetry of the distribution.
- **Kurtosis**: Heaviness of distribution tails (presence of outliers).
- **Distance Correlation (dCor)**: Measure of nonlinear dependency.
- **Granger Causality**: Causality between time series (contribution to prediction).

## Detecting Non-Gaussian Noise (Skewness / Kurtosis)

In [None]:
# Generate Gaussian noise and non-Gaussian noise (exponential distribution)
np.random.seed(42)
gauss_noise = TimeSeries(np.random.normal(0, 1, 1000), dt=0.01, name="Gaussian")
exp_noise = TimeSeries(
    np.random.exponential(1, 1000) - 1, dt=0.01, name="Non-Gaussian"
)  # Centered

# Plot
plot = exp_noise.plot(label="Non-Gaussian")
plot.gca().plot(gauss_noise, label="Gaussian", alpha=0.7)
plot.gca().legend()
plot.show()

# Calculate statistics
print(
    f"Gaussian:     Skewness={gauss_noise.skewness():.3f}, Kurtosis={gauss_noise.kurtosis():.3f}"
)
print(
    f"Non-Gaussian: Skewness={exp_noise.skewness():.3f}, Kurtosis={exp_noise.kurtosis():.3f}"
)
print("Note: Gaussian distributions have Skewness~0, Kurtosis~0 (Fisher definition).")

## Detecting Nonlinear Dependency (Distance Correlation)

Let's examine the relationship between the sine wave data (`ts_c`) and linear data (`ts_a`) using dCor. We can detect relationships that Pearson correlation cannot capture.

In [None]:
try:
    dcor_val = ts_a.distance_correlation(ts_c)
    print(f"Distance Correlation (A vs C): {dcor_val:.3f}")
    print(f"Pearson Correlation  (A vs C): {ts_a.pcc(ts_c):.3f}")
except ImportError:
    print("dcor package is not installed. Install it with: pip install dcor")

Distance Correlation (A vs C): 0.381
Pearson Correlation  (A vs C): -0.071


## Estimating Causality (Granger Causality)

Tests whether past values of one time series help predict future values of another time series.

In [None]:
# Generate data with causal relationship (X -> Y)
np.random.seed(0)
n = 200
x_val = np.random.randn(n)
y_val = np.zeros(n)
# Y depends on the value of X one step before
for i in range(1, n):
    y_val[i] = 0.5 * y_val[i - 1] + 0.8 * x_val[i - 1] + 0.1 * np.random.randn()

ts_x = TimeSeries(x_val, dt=1, name="Cause (X)")
ts_y = TimeSeries(y_val, dt=1, name="Effect (Y)")

try:
    # Does X cause Y? (Does X help predict Y?) -> p-value should be small
    p_xy = ts_y.granger_causality(ts_x, maxlag=5)

    # Does Y cause X? -> p-value should be large
    p_yx = ts_x.granger_causality(ts_y, maxlag=5)

    print(
        f"Granger Causality X -> Y (p-value): {p_xy:.4f} {'(Significant)' if p_xy < 0.05 else ''}"
    )
    print(
        f"Granger Causality Y -> X (p-value): {p_yx:.4f} {'(Significant)' if p_yx < 0.05 else ''}"
    )
except ImportError:
    print(
        "statsmodels package is not installed. Install it with: pip install statsmodels"
    )



Granger Causality X -> Y (p-value): 0.0000 (Significant)
Granger Causality Y -> X (p-value): 0.0907 
