# Z Test for Means of image_index 0, 1, and 2

This notebook performs a Z test to determine if the means for `image_index` 0, 1, and 2 in the provided CSV file are significantly different.

In [2]:
! pip install spicy

Collecting spicy
  Downloading spicy-0.16.0-py2.py3-none-any.whl.metadata (310 bytes)
Collecting scipy (from spicy)
  Downloading scipy-1.15.3-cp311-cp311-win_amd64.whl.metadata (60 kB)
Downloading spicy-0.16.0-py2.py3-none-any.whl (1.7 kB)
Downloading scipy-1.15.3-cp311-cp311-win_amd64.whl (41.2 MB)
   ---------------------------------------- 0.0/41.2 MB ? eta -:--:--
   - -------------------------------------- 1.6/41.2 MB 7.0 MB/s eta 0:00:06
   ---- ----------------------------------- 4.5/41.2 MB 11.2 MB/s eta 0:00:04
   ------ --------------------------------- 7.1/41.2 MB 11.5 MB/s eta 0:00:03
   -------- ------------------------------- 9.2/41.2 MB 11.0 MB/s eta 0:00:03
   ----------- ---------------------------- 11.8/41.2 MB 11.2 MB/s eta 0:00:03
   -------------- ------------------------- 14.9/41.2 MB 11.7 MB/s eta 0:00:03
   ----------------- ---------------------- 17.6/41.2 MB 11.8 MB/s eta 0:00:03
   ------------------- -------------------- 20.2/41.2 MB 12.1 MB/s eta 0:00:02
 

In [3]:
# Import required libraries
import pandas as pd
import numpy as np
from scipy.stats import norm

## 1. Load the CSV file

We will load the data from the specified CSV file.

In [16]:
# Load the CSV file
df1 = pd.read_csv('results_stage2/Difaggregation_data_20250613_103611_run1.csv')
df1["run_id"] = 1
df2 = pd.read_csv('results_stage2/Difaggregation_data_20250613_103826_run2.csv')
df2["run_id"] = 2
df3 = pd.read_csv('results_stage2/Difaggregation_data_20250613_103927_run3.csv')
df3["run_id"] = 3
df4 = pd.read_csv('results_stage2/Difaggregation_data_20250613_104015_run4.csv')
df4["run_id"] = 4
df5 = pd.read_csv('results_stage2/Difaggregation_data_20250613_104220_run5.csv')
df5["run_id"] = 5
df6 = pd.read_csv('results_stage2/Difaggregation_data_20250613_104617_run6.csv')
df6["run_id"] = 6


df = pd.concat([df1, df2,df3, df4, df5, df6], ignore_index=True)
df.head()

Unnamed: 0,frame,image_index,agents,agents_smoothed,run_id
0,0,0,87,87.0,1
1,1,0,87,87.0,1
2,2,0,87,87.0,1
3,3,0,87,87.0,1
4,4,0,87,87.0,1


In [18]:
# Pivot the table so each frame has one row and each image_index becomes a column
df_pivot = df.pivot_table(index=['run_id', 'frame'], columns='image_index', values='agents', aggfunc='mean')


# Optional: rename the columns for clarity
df_pivot.columns = ['zone0', 'zone1', 'zone2']

# Drop rows with any NaN (in case one frame is missing data for a zone)
df_pivot.dropna(inplace=True)

# Now you can do comparisons *per frame*
print(df_pivot.head())


              zone0  zone1  zone2
run_id frame                     
1      0       87.0   11.0    2.0
       1       87.0   11.0    2.0
       2       87.0   11.0    2.0
       3       87.0   11.0    2.0
       4       87.0   11.0    2.0


## 2. Filter data for image_index 0, 1, and 2

We will filter the data for each `image_index` and prepare for analysis.

In [20]:
# Group by run and calculate mean proportion for each zone per run
zone_means = df_pivot.groupby('run_id')[['zone0', 'zone1', 'zone2']].mean().reset_index()

# Perform Z-test on these means (between zone0 and zone1)
from scipy.stats import norm
import numpy as np

def z_test_from_samples(x, y):
    mean1, std1, n1 = x.mean(), x.std(ddof=1), len(x)
    mean2, std2, n2 = y.mean(), y.std(ddof=1), len(y)
    se = np.sqrt(std1**2 / n1 + std2**2 / n2)
    z = (mean1 - mean2) / se
    p = 2 * (1 - norm.cdf(abs(z)))
    return z, p

# Example: Compare zone0 vs zone1
z, p = z_test_from_samples(zone_means['zone1'], zone_means['zone2'])
print(f"Z-test (zone1 vs zone2 mean proportion across runs): z = {z:.4f}, p = {p:.4e}")


Z-test (zone1 vs zone2 mean proportion across runs): z = 3.9675, p = 7.2626e-05


## 3. Calculate means and standard deviations

We will calculate the mean and standard deviation for each group.

In [None]:
# Group by image_index
groups = df.groupby("image_index")

# Calculate mean, std, and n for each group using 'agents'
stats = {}
for idx, group in groups:
    stats[idx] = {
        'mean': group['agents'].mean(),
        'std': group['agents'].std(ddof=1),
        'n': len(group)
    }
    print(f"image_index {idx}: mean = {stats[idx]['mean']:.4f}, std = {stats[idx]['std']:.4f}, n = {stats[idx]['n']}")


image_index 0: mean = 33.8339, std = 33.1023, n = 58584
image_index 1: mean = 65.3500, std = 34.7518, n = 60001
image_index 2: mean = 3.0776, std = 2.0497, n = 31490


## 4. Perform Z tests between the groups

We will perform pairwise Z tests between the means of the groups.

In [12]:
def z_test(mean1, std1, n1, mean2, std2, n2):
    # Standard error
    se = np.sqrt(std1**2 / n1 + std2**2 / n2)
    z = (mean1 - mean2) / se
    p = 2 * (1 - norm.cdf(abs(z)))
    return z, p

pairs = [(0, 1), (0, 2), (1, 2)]
for i, j in pairs:
    z, p = z_test(stats[i]['mean'], stats[i]['std'], stats[i]['n'],
                  stats[j]['mean'], stats[j]['std'], stats[j]['n'])
    print(f"Z test between image_index {i} and {j}: z = {z:.4f}, p = {p:.4e}")

Z test between image_index 0 and 1: z = -159.9329, p = 0.0000e+00
Z test between image_index 0 and 2: z = 224.0892, p = 0.0000e+00
Z test between image_index 1 and 2: z = 437.4842, p = 0.0000e+00


## 5. Interpretation of Results

Interpret the p-values from the Z tests. If p &lt; 0.05, the difference in means is considered statistically significant.

In [13]:
# Interpretation
for i, j in pairs:
    z, p = z_test(stats[i]['mean'], stats[i]['std'], stats[i]['n'],
                  stats[j]['mean'], stats[j]['std'], stats[j]['n'])
    if p < 0.05:
        print(f"The means of image_index {i} and {j} are significantly different (p = {p:.4e}).")
    else:
        print(f"No significant difference between means of image_index {i} and {j} (p = {p:.4e}).")

The means of image_index 0 and 1 are significantly different (p = 0.0000e+00).
The means of image_index 0 and 2 are significantly different (p = 0.0000e+00).
The means of image_index 1 and 2 are significantly different (p = 0.0000e+00).
