### **Organize PSD Data**

- [ ]  Ensure all power spectral data (PSD) is saved per **subject-session** in a consistent format:
    - One file or object per subject-session
    - Contains:
        - EEG **channels**
        - **Frequencies**
        - Power per **epoch** or pre-averaged by condition
    - Associated **metadata**:
        - Subject ID
        - Session ID 
        - Dataset name
        - Cognitive **state label** (OT, MW)

In [1]:
from eeg_analyzer.dataset import Dataset
from utils.config import DATASETS

dataset_config = DATASETS['jin2019']

dataset = Dataset(dataset_config)
dataset.load_subjects()

### **Extract Alpha Power**

- [ ]  For each subject-session
    - Select **8–12 Hz** frequency band
    - **Sum power across frequencies** per channel and epoch
- [ ]  Result per epoch:

```python
{
    "subject_session": "001_1",
    "subject_id": "001",
    "session_id": "1",
    "channel": "Pz",
    "task": "SART",
    "state": "MW",
    "alpha_power": 3.45,
    ...
}
```

### **Assign Numerical Condition Labels**

For modeling the *ordinal direction* of alpha power:

| State | Code |
| --- | --- |
| OT | 0 |
| MW | 1 |
| MED | 2 |
- [ ]  For **Jin** and **Touryan**, you only use codes `0` and `1` (OT, MW)
- [ ]  For **Braboszcz**, use codes `1` and `2` (MW, MED)

In [2]:
# Get all list of all epochs for all subject-session pairs as dict with these keys:
# 'subject_session', 'subject', 'session', 'channel', 'task', 'state', 'band_power'

epochs = dataset.to_long_band_power_list(freq_band=(8,12))  # Alpha band

estimated_length = dataset.estimate_long_band_power_length()

### **Create Long-Form DataFrame per Dataset**

For each dataset, prepare:
| subject_session | subject_id | group | channel/ROI | state | alpha_power |
| --- | --- | --- | --- | --- | --- |
| 001_1 | 001 | NaN | Pz | 0 | 3.24 |
| 001_1 | 001 | NaN | Pz | 1 | 3.88 |
| 060_1 | 060 | vip | Pz | 1 | 5.12 |
| 060_1 | 060 | vip | Pz | 2 | 6.00 |

In [3]:
# create a dataframe with the epochs
import pandas as pd
df_full = pd.DataFrame(epochs)
df_full.head()

Unnamed: 0,subject_session,subject_id,session_id,group,epoch_idx,channel,cortical_region,hemisphere,task,state,band_power,is_bad
0,1_2,1,2,,0,Fp1,prefrontal,left,vs,1,5.478316,False
1,1_2,1,2,,0,AF7,prefrontal,left,vs,1,4.105724,False
2,1_2,1,2,,0,AF3,prefrontal,left,vs,1,4.055729,False
3,1_2,1,2,,0,F1,frontal,left,vs,1,3.185635,False
4,1_2,1,2,,0,F3,frontal,left,vs,1,3.334554,False


## **Report the state balance**

In [13]:
state_counts = df_full['state'].value_counts()
state_counts = state_counts / state_counts.sum()
print(f"State balance: {state_counts.to_dict()}")

State balance: {1: 0.6506716873600651, 0: 0.34932831263993486}


## **Make grouped DataFrames per channel**

* [ ] Create a list of DataFrames for each its own channel.
* [ ] Create a new column with the z-score within state.
* [ ] Filter the data by removing all epochs with a z-score > 4.

In [4]:
# Create a list of dataframes for each channel (keep the channel column)
df_channels = []
for channel in df_full['channel'].unique():
    df_channel = df_full[df_full['channel'] == channel].reset_index(drop=True)
    df_channels.append(df_channel)

# create a new column with z-scores within each state
# def z_score_within_state(df):
#     grouped = df.groupby(['subject_session', 'state'])
#     df['z_score'] = grouped['band_power'].transform(lambda x: (x - x.mean()) / x.std())
#     return df

# # apply the z-score function to each channel dataframe
# df_channels_z = []
# for df_channel in df_channels:
#     df_channel_z = z_score_within_state(df_channel)
#     df_channels_z.append(df_channel_z)

# # Remove all rows where z_score > 3
# for i in range(len(df_channels_z)):
#     df_channels_z[i] = df_channels_z[i][df_channels_z[i]['z_score'] <= 3]

# print(df_channels_z[0].head())


### **Fit Mixed Effects Model per Channel**

Do this **separately for each dataset** and for each channel (or ROI):

### ✅ Model formula:

```python
alpha_power ~ state + (state | subject_session)
```

This:

- Estimates the effect of `state` (e.g., OT → MW or MW → MED)
- Models per-subject variability in both baseline (intercept) and sensitivity (slope)

In [11]:
# Fit mixed effects model per channel
import statsmodels.formula.api as smf
from statsmodels.tools.sm_exceptions import ConvergenceWarning

results_per_channel = {}

for df_channel in df_channels:
    channel_name = df_channel['channel'].iloc[0] if 'channel' in df_channel else None
    # Fit the mixed effects model: alpha_power ~ state + (state | subject_session)
    # Use 'band_power' as the dependent variable
    # 'state' as fixed effect, random intercept and slope for 'subject_session'
    model = smf.mixedlm(
        "band_power ~ state",
        df_channel,
        groups="subject_session",
        re_formula="~state"
    )
    result = model.fit(method="lbfgs")
    results_per_channel[channel_name] = result

# Print summary for all channels
for channel in results_per_channel:
    print(f"Results for channel: {channel}")
    print(results_per_channel[channel].summary())



Results for channel: Fp1
                 Mixed Linear Model Regression Results
Model:                  MixedLM     Dependent Variable:     band_power 
No. Observations:       19652       Method:                 REML       
No. Groups:             58          Scale:                  312.2876   
Min. group size:        106         Log-Likelihood:         -84534.3671
Max. group size:        443         Converged:              Yes        
Mean group size:        338.8                                          
-----------------------------------------------------------------------
                             Coef.  Std.Err.   z    P>|z| [0.025 0.975]
-----------------------------------------------------------------------
Intercept                    24.314    3.032  8.020 0.000 18.372 30.255
state                        -1.775    0.541 -3.278 0.001 -2.836 -0.714
subject_session Var         529.682    5.653                           
subject_session x state Cov -38.187    0.762            

# Z-score normalize whitin-subject
Z-score normalize whitin-subject across epochs and fit new models for easier interpretations 

In [8]:
# create a new column with z-scores within each state
def z_score_within_state(df):
    grouped = df.groupby(['subject_session'])
    df['z_score'] = grouped['band_power'].transform(lambda x: (x - x.mean()) / x.std())
    return df

# apply the z-score function to each channel dataframe
df_channels_z = []
for df_channel in df_channels:
    df_channel_z = z_score_within_state(df_channel)
    df_channels_z.append(df_channel_z)

In [9]:
# Fit mixed effects model per channel with z-score insted of band_power

results_per_channel_z = {}

for df_channel in df_channels_z:
    channel_name = df_channel['channel'].iloc[0] if 'channel' in df_channel else None
    # Fit the mixed effects model: alpha_power ~ state + (state | subject_session)
    # Use 'band_power' as the dependent variable
    # 'state' as fixed effect, random intercept and slope for 'subject_session'
    model = smf.mixedlm(
        "z_score ~ state",
        df_channel,
        groups="subject_session",
        re_formula="~state"
    )
    result = model.fit(method="lbfgs")
    results_per_channel_z[channel_name] = result

# Print summary for all channels
for channel in results_per_channel_z:
    print(f"Results for channel: {channel}")
    print(results_per_channel_z[channel].summary())



Results for channel: Fp1
                Mixed Linear Model Regression Results
Model:                 MixedLM     Dependent Variable:     z_score    
No. Observations:      19652       Method:                 REML       
No. Groups:            58          Scale:                  0.9874     
Min. group size:       106         Log-Likelihood:         -27901.6066
Max. group size:       443         Converged:              Yes        
Mean group size:       338.8                                          
----------------------------------------------------------------------
                            Coef.  Std.Err.   z    P>|z| [0.025 0.975]
----------------------------------------------------------------------
Intercept                    0.072    0.038  1.914 0.056 -0.002  0.146
state                       -0.091    0.034 -2.693 0.007 -0.158 -0.025
subject_session Var          0.072    0.002                           
subject_session x state Cov  0.021                                   

In [12]:
# Fit mixed effects model per channel with z-score insted of band_power

results_per_channel_z = {}

for df_channel in df_channels_z:
    channel_name = df_channel['channel'].iloc[0] if 'channel' in df_channel else None
    # Fit the mixed effects model: alpha_power ~ state + (state | subject_session)
    # Use 'band_power' as the dependent variable
    # 'state' as fixed effect, random intercept and slope for 'subject_session'
    model = smf.mixedlm(
        "z_score ~ state",
        df_channel,
        groups="subject_session",
        re_formula="~1"
    )
    result = model.fit(method="lbfgs")
    results_per_channel_z[channel_name] = result

# Print summary for all channels
for channel in results_per_channel_z:
    print(f"Results for channel: {channel}")
    print(results_per_channel_z[channel].summary())

  sdf[0:self.k_fe, 1] = np.sqrt(np.diag(self.cov_params()[0:self.k_fe]))


Results for channel: Fp1
                  Mixed Linear Model Regression Results
Model:                    MixedLM       Dependent Variable:       z_score
No. Observations:         19652         Method:                   REML   
No. Groups:               58            Scale:                    0.9954 
Min. group size:          106           Log-Likelihood:           inf    
Max. group size:          443           Converged:                Yes    
Mean group size:          338.8                                          
-------------------------------------------------------------------------
                    Coef.   Std.Err.    z    P>|z|    [0.025     0.975]  
-------------------------------------------------------------------------
Intercept            0.029 482460.915  0.000 1.000 -945605.988 945606.047
state               -0.095      0.016 -5.832 0.000      -0.127     -0.063
subject_session Var  0.000                                               

Results for channel: AF7
     

### **Store & Summarize Results**

For each model/channel:

- [ ]  Store:
    - Fixed effect estimate for `state`
    - p-value
    - t-statistic (optional)
    - Number of subjects included
- [ ]  Save results to `.csv` or `.json`

### **Visualize Results**

- [ ]  Plot per-channel **bar plots** or **line plots** of alpha power by condition
- [ ]  Plot **topoplots** (scalp maps) of:
    - `state` slope per channel
    - p-values (FDR-corrected or raw, with masking)
    - t-values

In [None]:
import matplotlib.pyplot as plt
from utils.config import PLOTS_PATH
import os



### **Correct for Multiple Comparisons**

- [ ]  Across channels (e.g., for scalp maps), correct p-values using:
    - **False Discovery Rate (FDR)**
    - Or **cluster-based permutation test** if possible

### **Filter Low-Effect Subjects**

- [ ]  Compute within-subject effect sizes (e.g., Cohen’s *d*) between states
- [ ]  Re-run the model excluding sessions with **d < 0.5**
- [ ]  Compare results to check robustness