# Exploratory Analysis

This notebook was used to analyze the data from tests results.

In [None]:
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('../data/results_by_mutant_filtered_none_removed.csv')


In [None]:
display(df.describe(include='all'))
df.drop(axis=1, labels=[' Human-Error', ' Branch-Error', ' Default-Error'], inplace=True)

In [None]:
df_grouped_system = df.drop(columns=['Mutant', ' Mutation Operator']).groupby(' System').sum(numeric_only=True)
display(df_grouped_system)

In [None]:
df_grouped_operator = df.drop(columns=['Mutant', ' System']).groupby(' Mutation Operator').sum(numeric_only=True)
display(df_grouped_operator)

In [None]:
# Set style for better visualization
plt.style.use('default')
sns.set_theme(style="whitegrid")  # Apply seaborn styling

# 1. Boxplots for System-level comparison
plt.figure(figsize=(10, 6))
df_melted_system = df_grouped_system[[' Human-Fail', ' Branch-Fail', ' Default-Fail']].melt()
sns.boxplot(x='variable', y='value', data=df_melted_system)
plt.title('Distribution of Fail Rates by Test Type (System Level)')
plt.xlabel('Test Type')
plt.ylabel('Number of Fails')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# 2. Violin plots for Operator-level comparison
plt.figure(figsize=(10, 6))
df_melted_operator = df_grouped_operator[[' Human-Fail', ' Branch-Fail', ' Default-Fail']].melt()
sns.violinplot(x='variable', y='value', data=df_melted_operator)
plt.title('Distribution of Fail Rates by Test Type (Operator Level)')
plt.xlabel('Test Type')
plt.ylabel('Number of Fails')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# 3. Bar plot showing means with error bars
plt.figure(figsize=(10, 6))
means = df_grouped_system[[' Human-Fail', ' Branch-Fail', ' Default-Fail']].mean()
sems = df_grouped_system[[' Human-Fail', ' Branch-Fail', ' Default-Fail']].sem()
means.plot(kind='bar', yerr=sems, capsize=5)
plt.title('Mean Fail Rates by Test Type with Standard Error')
plt.xlabel('Test Type')
plt.ylabel('Mean Number of Fails')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# 4. Heatmap of correlations
plt.figure(figsize=(8, 6))
correlation_matrix = df_grouped_system[[' Human-Fail', ' Branch-Fail', ' Default-Fail']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation between Test Types')
plt.tight_layout()
plt.show()

# 5. Paired plot to show relationships
system_names = df_grouped_system.index
plt.figure(figsize=(10, 6))
plt.plot(system_names, df_grouped_system[' Human-Fail'], 'o-', label='Human')
plt.plot(system_names, df_grouped_system[' Branch-Fail'], 's-', label='Branch')
plt.plot(system_names, df_grouped_system[' Default-Fail'], '^-', label='Default')
plt.title('Fail Rates Across Systems')
plt.xlabel('System')
plt.ylabel('Number of Fails')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# 6. Paired plot to show relationships
operator_names = df_grouped_operator.index
plt.figure(figsize=(10, 6))
plt.plot(operator_names, df_grouped_operator[' Human-Fail'], 'o-', label='Human')
plt.plot(operator_names, df_grouped_operator[' Branch-Fail'], 's-', label='Branch')
plt.plot(operator_names, df_grouped_operator[' Default-Fail'], '^-', label='Default')
plt.title('Fail Rates Across Operators')
plt.xlabel('Operator')
plt.ylabel('Number of Fails')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Group by Operator

In [None]:
# Test for normality using Shapiro-Wilk test for each fail type in df_grouped_operator
for col in [' Human-Fail', ' Branch-Fail', ' Default-Fail']:
    stat, p = st.shapiro(df_grouped_operator[col])
    print(f"{col.strip()}: Shapiro-Wilk stat={stat:.3f}, p-value={p:.3f}")

### Interpreting Shapiro-Wilk Test Results
The Shapiro-Wilk test evaluates whether data is normally distributed. For each test:

Null hypothesis (H0): The data follows a normal distribution
Alternative hypothesis (H1): The data does not follow a normal distribution
Significance level (α): Typically 0.05

### Results Analysis

#### Human-Fail

- Statistic: 0.661
- p-value: 0.000 < 0.05
- Interpretation: Reject H0, data is not normally distributed

#### Branch-Fail

- Statistic: 0.609
- p-value: 0.000 < 0.05
- Interpretation: Reject H0, data is not normally distributed

#### Default-Fail

- Statistic: 0.619
- p-value: 0.000 < 0.05
- Interpretation: Reject H0, data is not normally distributed

In [None]:
stat_branch, p_branch = st.wilcoxon(df_grouped_operator[' Human-Fail'], df_grouped_operator[' Branch-Fail'])
stat_default, p_default = st.wilcoxon(df_grouped_operator[' Human-Fail'], df_grouped_operator[' Default-Fail'])

print(f"Wilcoxon Human-Fail vs Branch-Fail: stat={stat_branch:.3f}, p-value={p_branch:.3g}")
print(f"Wilcoxon Human-Fail vs Default-Fail: stat={stat_default:.3f}, p-value={p_default:.3g}")

### Interpreting Wilcoxon Test Results
The Wilcoxon signed-rank test results show:

#### Human-Fail vs Branch-Fail

- Test statistic = 0.000
- p-value = 3.56e-08 (much smaller than significance level 0.05)

This extremely low p-value indicates strong evidence to reject the null hypothesis

We can conclude there is a statistically significant difference between Human-Fail and Branch-Fail rates

#### Human-Fail vs Default-Fail

- Test statistic = 61.000
- p-value = 2.71e-06 (much smaller than significance level 0.05)

This very low p-value also indicates strong evidence to reject the null hypothesis

We can conclude there is a statistically significant difference between Human-Fail and Default-Fail rates

# Group By System

In [None]:
# Test for normality using Shapiro-Wilk test for each fail type in df_grouped_operator
for col in [' Human-Fail', ' Branch-Fail', ' Default-Fail']:
    stat, p = st.shapiro(df_grouped_system[col])
    print(f"{col.strip()}: Shapiro-Wilk stat={stat:.3f}, p-value={p:.3f}")

### Interpreting Shapiro-Wilk Test Results
The Shapiro-Wilk test evaluates whether data is normally distributed. For each test:

Null hypothesis (H0): The data follows a normal distribution
Alternative hypothesis (H1): The data does not follow a normal distribution
Significance level (α): Typically 0.05

### Results Analysis

#### Human-Fail

- Statistic: 0.849
- p-value: 0.190 > 0.05
- Interpretation: Fail to reject H0, data is normally distributed

#### Branch-Fail

- Statistic: 0.845
- p-value: 0.179 > 0.05
- Interpretation: Fail to reject H0, data is normally distributed

#### Default-Fail

- Statistic: 0.912
- p-value: 0.481 > 0.05
- Interpretation: Fail to reject H0, data is normally distributed

In [None]:

# Paired t-test between Human-Fail and Branch-Fail in df_grouped_system
t_stat_branch, p_value_branch = st.ttest_rel(df_grouped_system[' Human-Fail'], df_grouped_system[' Branch-Fail'])

# Paired t-test between Human-Fail and Default-Fail in df_grouped_system
t_stat_default, p_value_default = st.ttest_rel(df_grouped_system[' Human-Fail'], df_grouped_system[' Default-Fail'])

print(f"Paired t-test (Human-Fail vs Branch-Fail): t-stat={t_stat_branch:.3f}, p-value={p_value_branch:.3f}")
print(f"Paired t-test (Human-Fail vs Default-Fail): t-stat={t_stat_default:.3f}, p-value={p_value_default:.3f}")

### Interpreting Paired T-Test Results
#### Human-Fail vs Branch-Fail

- t-statistic = 3.799
- p-value = 0.019 (less than significance level 0.05)

Interpretation: There is a statistically significant difference between Human-Fail and Branch-Fail rates

We can reject the null hypothesis at the 5% significance level

#### Human-Fail vs Default-Fail

- t-statistic = 2.727
- p-value = 0.053 (slightly above significance level 0.05)

Interpretation: The difference between Human-Fail and Default-Fail rates is marginally significant

If using α = 0.05, we would fail to reject the null hypothesis
However, this is very close to being significant, and with α = 0.10 would be considered significant

# Overall Conclusion

#### By Operator Grouping:

Wilcoxon test showed highly significant differences (p < 0.05)
Strong evidence that human-written and machine-generated tests perform differently when considering mutation operators

This suggests that certain mutation operators are handled differently by human vs machine-generated tests

#### By System Grouping:

Paired t-test showed marginal or no significant differences (p ≈ 0.05)
Less evidence of performance differences when looking at whole systems
Suggests similar overall effectiveness at the system level
