# Notebook 4: Results Evaluation


## Hypothesis Testing: Sentiment Protection and Gender

In this notebook, we estimate the statistical significance of our results.  
Our null hypothesis (H₀) is that there is no difference in the distribution of sentiment protection between female and male authors:

**H₀**: The presence of sentiment protection (coded as 0 or 1) is distributed evenly across women’s and men’s subcorpora.  

Drawing on insights from feminist literary criticism, we propose that female authors may be more likely to demonstrate sentiment protection. Therefore, our alternative hypothesis (H₁) is:

**H₁**: Sentiment protection (0 or 1) is present more frequently in the subcorpus of female authors than in that of male authors.  

We will conduct standard p-value-based tests to evaluate the statistical significance of this hypothesis and then offer a critical reflection on why such tests may have limited interpretive value in this context.


In [4]:
import numpy as np
import pandas as pd
from scipy.stats import permutation_test
from scipy.stats import chi2_contingency
from scipy.stats import mannwhitneyu


In [5]:
dataset = pd.read_excel(r"Genbaku-Bungaku-SA\Tables\results.xlsx", index_col=False)

In [6]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,Title,Gender,Length in characters,Overall Sentiment,Direct Speech Sentiment,Author's Speech Sentiment,Difference Value,Exceeds?
0,0,有吉 佐和子 - 祈禱,F,30102,0.007671,0.06503,-0.03553,0.10056,Yes
1,1,林 京子 - ギヤマン ビードロ,F,112867,-0.011454,-0.054455,-0.010385,-0.044071,No
2,2,林 京子 - 二人の墓標,F,22251,-0.110104,-0.139089,-0.098741,-0.040348,No
3,3,林 京子 - 同期会,F,16390,0.016105,0.142857,0.017386,0.125471,Yes
4,4,林 京子 - 昭和二十年の夏,F,18826,-0.013255,0.5,-0.01509,0.51509,Yes


In [7]:
desc_stats = dataset.describe()

numeric_data = dataset.select_dtypes(include='number')

# Coefficient of Variation (CV = std / mean)
cv = numeric_data.std() / numeric_data.mean().abs()
cv.name = 'cv'  

desc_stats = pd.concat([desc_stats, pd.DataFrame([cv])])
desc_stats

Unnamed: 0.1,Unnamed: 0,Length in characters,Overall Sentiment,Direct Speech Sentiment,Author's Speech Sentiment,Difference Value
count,117.0,117.0,117.0,117.0,117.0,117.0
mean,58.0,34931.581197,-0.081698,-0.040173,-0.086433,0.046261
std,33.919021,40523.865998,0.107093,0.223469,0.111709,0.234867
min,0.0,482.0,-0.673947,-1.0,-0.673947,-0.999475
25%,29.0,7517.0,-0.136357,-0.095238,-0.137054,-0.01583
50%,58.0,18826.0,-0.065855,-0.006903,-0.076533,0.040967
75%,87.0,44581.0,-0.005081,0.022131,-0.014907,0.104355
max,116.0,194666.0,0.160452,1.0,0.160452,1.076533
cv,0.584811,1.160093,1.31083,5.562698,1.292426,5.077042


Initially, we onbserve that the distribution of sentiment in the texts is highly variative. If overall sentiment and author's speech by themselves demonstrate high variability (cv - coefficient of variation), the sentiment of direct speech and consequently the sentiment different are highly variative. As cv for the tested variables is over 1, even small shifts can strongly affect averages and p-values, traditional statistical significance test. 
As the CV even now indicates that the distribution is far from normal, we cannot use t-test.
We will conduct two tests:
1. Mann-Whitney U test for comparing group medians - permutation test
2. Chi-Squared Test for comparing proportions of authors with sentiment protection.

## Mann-Whitney U Test

In [13]:

female = dataset[dataset["Gender"] == "F"]
male = dataset[dataset["Gender"] == "M"]

columns_to_test = [
    "Direct Speech Sentiment",
    "Author's Speech Sentiment",
    "Difference Value"
]

results = []
for col in columns_to_test:
    u_stat, p_val = mannwhitneyu(female[col], male[col], alternative='two-sided')
    results.append({
        "Variable": col,
        "U-statistic": round(u_stat, 3),
        "p-value": round(p_val, 4)
    })

results_df = pd.DataFrame(results)
results_df.to_csv("mann_whitney_results.csv", index=False)
results_df

Unnamed: 0,Variable,U-statistic,p-value
0,Direct Speech Sentiment,1397.0,0.3587
1,Author's Speech Sentiment,1476.0,0.6415
2,Difference Value,1460.0,0.5775


It is important to note that the Mann-Whitney U test may not accurately detect differences in means or medians, especially when distributions differ in shape or variability, not just central tendency (Divine et al. 2018). A widely recommended alternative is the permutation test, which replaces the Mann-Whitney U test in such contexts.

The permutation test is a non-parametric statistical method used to assess whether two or more groups differ significantly, without making strong assumptions about the underlying data distribution. It is particularly suitable when the distribution is unknown or when working with small sample sizes.

This method is more flexible, as it does not assume normality and evaluates significance directly by randomly shuffling group labels across the observed dataset. This makes it more appropriate for smaller datasets like the one used in this study. However, the permutation test rests on the assumption of exchangeability—that is, that data points can be randomly reassigned across groups under the null hypothesis without violating the test's logic.

The permutation test relies on the assumption of exchangeability under the null hypothesis — that is, the idea that gender labels can be randomly reassigned without affecting the outcome, if no true difference exists between groups. In this study, while gender is a fixed and meaningful category, the null hypothesis asserts that it has no statistical effect on sentiment. Therefore, under the null, the exchangeability assumption is justifiable. Nonetheless, it is important to recognize the interpretive weight of gender in literary contexts and to consider this when drawing conclusions from statistically randomized procedures.

This makes the interpretation and validity of these used tested a question for a further discussion.

## Permutation Test

In [12]:
def difference_in_means(x, y, axis):
    return np.mean(x, axis=axis) - np.mean(y, axis=axis)

sentiment_measures = [
    "Direct Speech Sentiment",
    "Author's Speech Sentiment",
    "Difference Value"
]

results_list = []

for measure in sentiment_measures:
    female_data = dataset[dataset["Gender"] == "F"][measure].dropna().values
    male_data = dataset[dataset["Gender"] == "M"][measure].dropna().values
    
    if len(female_data) > 0 and len(male_data) > 0:

        res = permutation_test((female_data, male_data), difference_in_means,
                               permutation_type='independent', vectorized=True,
                               n_resamples=9999, alternative='two-sided')
        
        results_list.append({
            "Variable": measure,
            "Observed Difference": round(res.statistic, 4),
            "p-value": round(res.pvalue, 4)
        })

results_df = pd.DataFrame(results_list)
results_df.to_csv("permutation_test_results.csv", index=False)
results_df

Unnamed: 0,Variable,Observed Difference,p-value
0,Direct Speech Sentiment,-0.0452,0.3024
1,Author's Speech Sentiment,0.0035,0.894
2,Difference Value,-0.0487,0.296


## Chi-Squared Test

In [11]:
dataset['Protection'] = dataset['Exceeds?'].str.lower() == 'yes'

contingency_table = pd.crosstab(dataset['Gender'], dataset['Protection'])

chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)

expected_df = pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns)

summary = pd.DataFrame({
    "Chi² Statistic": [round(chi2_stat, 4)],
    "Degrees of Freedom": [dof],
    "p-value": [round(p_val, 4)]
})

summary.to_csv("chi_squared_summary.csv", index=False)

summary


Unnamed: 0,Chi² Statistic,Degrees of Freedom,p-value
0,2.6599,1,0.1029


The Chi-squared test is a statistical method used to determine whether there is a significant association between two categorical variables. It compares the observed frequencies in a contingency table to the expected frequencies that would occur if the variables were independent. If the observed and expected values differ significantly, the test suggests that the variables are likely not independent.

## What do these evaluations tell to us?

When interpreting the statistical data from the Mann-Whitney and permutation tests, we noticed that traditional tools for testing the statistical significance of hypotheses present some challenges in this case:

1. **High Variability in Small Datasets**\
Even within a single genre or gender group, and even across a basic variable like different sentiment metrics, sentiment scores exhibit very high variability. In small datasets (e.g., ours with N=116), this makes such tests more of an abstract calculation than a meaningful assessment of hypothesis significance. In effect, the hypothesis may only hold within the specific observed dataset and cannot be generalized.

2. **Interpreting Sentiment Differences**\
For assessing differences in sentiment, a more appropriate statistical approach is needed—one that accurately captures deviations from the mean. Estimating a "mean of means" is generally considered poor practice; instead, the sentiment mean should be calculated directly by summing all sentiment scores for each evaluated chunk. However, in this type of analysis, we do not observe raw sentiment but rather a transformed or averaged value that reflects a new qualitative dimension. This metric can reflect whether the author is "self-centered," expressing their thoughts directly and using characters as conduits, or whether they adopt a more "emancipated" approach by letting characters speak independently. The metric does not judge but simply operationalizes whether a writer isolates negative emotions in their own narration or distributes them among the characters.

3. **Interpretation of P-Values**\
P-value interpretation also requires caution. We found that the T-test, Mann-Whitney, and permutation tests have limited power in establishing statistical significance given our data. Unsurprisingly, due to the significant coefficient of variation and the small sample size, the p-values for sentiment differences were 0.5775 (Mann-Whitney) and 0.295 (permutation test). The chi-squared test yielded a slightly better value of 0.1029. While this is closer to the conventional threshold, it’s important to remember that the commonly used cutoffs of 0.05 or 0.01 (often applied in fields like medicine) are conventions rather than absolute truths. A p-value of 0.1029 implies there is a 10.29% probability of observing a difference in sentiment protection between male and female authors as extreme (or more extreme) as the one observed, assuming no actual association (i.e., the null hypothesis is true).

However, when working with literary texts—which inherently demonstrate individuality even in sentiment variations—should 10.29% be considered significantly worse than 5%? Perhaps not. In such cases, critical interpretation and domain expertise (even if it involves some degree of subjective or "voluntaristic" judgment) should complement, if not override, rigid statistical conventions that originate in other domains of human knowledge.1.

## References
Divine, George W., H. James Norton, Anna E. Barón, and Elizabeth Juarez-Colunga. 2018. “The Wilcoxon–Mann–Whitney Procedure Fails as a Test of Medians.” The American Statistician 72 (3): 278–86. https://doi.org/10.1080/00031305.2017.1305291.
