In [2]:
# Import essential libraries for data handling, visualization, and analysis
import pandas as pd

# Import statistical tools
from statsmodels.formula.api import ols
import statsmodels.api as sm

# System and warning configuration
import warnings
import os
warnings.filterwarnings('ignore')  # Suppress all warnings
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

### Load and Prepare Dataset

In this step, we load the raw behavioral dataset from a .parquet file into a DataFrame. We also standardize Sex-related columns (GR_Sex and Sex) by converting all values to lowercase, ensuring consistent formatting for downstream analysis.

In [4]:
# Define the path to the raw data file (Parquet format)
file_path = "raw/ambitus_0_15_log_24_07_2025.parquet"
# load the data into a DataFrame
df = pd.read_parquet(file_path)
# Standardize Sex-related columns by converting text to lowercase
df['GR_Sex'] = df['GR_Sex'].str.lower()
df['Sex'] = df['Sex'].str.lower()

In [5]:
# Drop rows with missing values in Group_Sex or Year
df = df.dropna(subset=['GR_Sex', 'Year'])
# Standardize Group_Sex values to lowercase
df['Sex'] = df['Sex'].str.lower()
# Standardize Group_Sex values to lowercase
df['Group_Sex'] = df['GR_Sex'].str.lower()

### Factorial ANOVA – Group × Sex × Year Effects

To evaluate whether behavioral outcomes differ across experimental groups, sexes, and years, we performed a factorial ANOVA on six key behavioral features. The model includes:
- Main effects: Group, Sex, Year
- Two-way interactions: Group × Year and Sex × Year
  
This analysis identifies which behavioral measures show statistically significant differences across subpopulations or temporal contexts. The results are compiled into a unified ANOVA table (final_anova_df) for further inspection and visualization.

In [7]:

# Drop rows where grouping variables are missing
df = df.dropna(subset=['GR_Sex', 'Year'])
# Normalize Sex information (lowercase for consistency)
df['Sex'] = df['Sex'].str.lower()
df['Group_Sex'] = df['GR_Sex'].str.lower()

# Define variables to test and corresponding feature names in the dataset
anova_targets = {
    'LOCO_TOT': 'Locomotion (Loco_TOT)',
    'LOCO_BEF': 'Locomotion frequency (LOCO_BEF)',
    'EXPL_TOT': 'Exploration (Expl_TOT)',
    'Expl_E_I_BEF_Nr': 'Exploration frequency (Expl_BEF)',
    'L_C': 'Learning capacity (L_C)',
    'E_E': 'Effective exploration ratio (E_E)'
}

# Initialize a list to collect ANOVA results
anova_results = []

# Perform two-way factorial ANOVA for each target variable
# Including main effects and interactions with Year
for feature, description in anova_targets.items():
    formula = f"{feature} ~ C(Group) + C(Sex) + C(Year) + C(Group):C(Year) + C(Sex):C(Year)"
    model = ols(formula, data=df).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)
    # Annotate results with feature description and variable name
    anova_table["Feature"] = description
    anova_table["Variable"] = anova_table.index
    anova_results.append(anova_table.reset_index(drop=True))

# Combine results into a single DataFram
final_anova_df = pd.concat(anova_results, ignore_index=True)

### ANOVA Summary Table (Formatted for Readability)

The raw factorial ANOVA results are reformatted into a clean summary table. Only the main effects and relevant two-way interactions are retained:
- Gr: Experimental Group
- Sex: Biological Sex
- Year: Testing Year
- Gr/Y: Group × Year interaction
- Sex/Y: Sex × Year interaction

Each cell shows the F-statistic along with the associated p-value, helping to quickly identify which variables are significantly influenced by group, sex, year, or their interactions. This format is well-suited for publication or supplementary tables.

In [9]:
# Create a copy of the original ANOVA results to preserve the source
df_anova = final_anova_df.copy()

# Define a mapping from raw ANOVA variable terms to simplified labels
label_map = {
    'C(Group)': 'Gr',
    'C(Sex)': 'Sex',
    'C(Year)': 'Year',
    'C(Group):C(Year)': 'Gr/Y',
    'C(Sex):C(Year)': 'Sex/Y'
}

# Filter the DataFrame to retain only the terms of interest
df_anova = df_anova[df_anova['Variable'].isin(label_map.keys())]

# Create a new column for simplified effect labels
df_anova['Effect'] = df_anova['Variable'].map(label_map)

# Format the F and p-values into a readable string
df_anova['F(p)'] = df_anova.apply(
    lambda row: f"{row['F']:.2f} (p < {row['PR(>F)']:.4f})", axis=1
)

# Pivot the table so each effect is a column and each feature is a row
table_formatted = df_anova.pivot(index='Feature', columns='Effect', values='F(p)')

# Reorder columns to match desired output
ordered_cols = ['Gr', 'Sex', 'Year', 'Gr/Y', 'Sex/Y']
table_formatted = table_formatted.reindex(columns=ordered_cols)
display(table_formatted)

Effect,Gr,Sex,Year,Gr/Y,Sex/Y
Feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Effective exploration ratio (E_E),102.58 (p < 0.0000),3.69 (p < 0.0547),19.32 (p < 0.0000),12.56 (p < 0.0000),3.00 (p < 0.0063)
Exploration (Expl_TOT),293.82 (p < 0.0000),4.12 (p < 0.0425),41.56 (p < 0.0000),21.92 (p < 0.0000),3.52 (p < 0.0018)
Exploration frequency (Expl_BEF),67.93 (p < 0.0000),25.22 (p < 0.0000),34.10 (p < 0.0000),12.19 (p < 0.0000),2.76 (p < 0.0112)
Learning capacity (L_C),261.45 (p < 0.0000),2.80 (p < 0.0943),28.40 (p < 0.0000),11.74 (p < 0.0000),2.06 (p < 0.0545)
Locomotion (Loco_TOT),147.16 (p < 0.0000),94.49 (p < 0.0000),14.02 (p < 0.0000),12.63 (p < 0.0000),1.70 (p < 0.1163)
Locomotion frequency (LOCO_BEF),37.25 (p < 0.0000),44.34 (p < 0.0000),22.36 (p < 0.0000),2.96 (p < 0.0069),5.35 (p < 0.0000)


### Experiment with ANOVA Models

-> Playground cell <-

This cell allows you to experiment with different ANOVA models by changing:
- target_variable: The behavioral feature to analyze (e.g. L_C, EXPL_TOT, E_E, etc.)
- formula: The structure of the model, including main effects and interactions

Suggestions:
- Try adding/removing predictors like C(Season) or C(Paradigm)
- Test interaction terms, e.g. C(Group):C(Sex)
- Replace the outcome variable with another metric from the dataset

Some combinations may result in collinearity or missing data. Use dropna() as needed:
df_clean = df.dropna(subset=[target_variable, 'Group', 'Sex', 'Year'])


In [11]:
# Example: Customize ANOVA feature and formula here

target_variable = 'LOCO_TOT'  # You can change this to any numeric behavioral feature
formula = f"{target_variable} ~ C(Group) + C(Sex) + C(Year) + C(Group):C(Year)"

# Fit the model using ordinary least squares (OLS)
model = ols(formula, data=df).fit()

# Compute the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
display(anova_table)


Unnamed: 0,sum_sq,df,F,PR(>F)
C(Group),12314.133755,1.0,145.685555,4.0806360000000004e-33
C(Sex),7980.842955,1.0,94.419434,3.8860370000000003e-22
C(Year),7104.719036,6.0,14.009037,6.971234e-16
C(Group):C(Year),6176.4504,6.0,12.178683,1.193428e-13
Residual,452464.60999,5353.0,,
