# Mann-Whitney Test for Term Distinctiveness

This is a sample implementation of the Mann-Whitney test (AKA Wilcoxon Rank-Sum test) for term distinctiveness in a corpus of documents. The test is used to determine if the distribution of term frequencies in two different sets of documents is significantly different. It makes a useful alternative to the z-test.

### References

- Jefrey Lijffijt, Terttu Nevalainen, Tanja Säily, et al,
“Significance testing of word frequencies in corpora”,
_Digital Scholarship in the Humanities_ 31, no. 2 (2016):
375-397.
- Andrew Piper and Eva Portelace, [“How Cultural
Capital Works: Prizewinning Novels, Bestsellers, and
the Time of Reading”](http://post45.org/2016/05/how-cultural-capital-works-prizewinning-novels-bestsellers-and-the-time-of-reading/), _Post45_ (May 10, 2016).

In the next cell, we will define a small corpus of documents and divide them into two groups for comparison.

In [None]:
# Import necessary libraries
import pandas as pd
import spacy
from lexos.dtm import DTM

# Load the small English model from spaCy
nlp = spacy.load("en_core_web_sm")

# Sample text documents for testing
texts = [
    "This is a sample text for testing.", # odd
    "Here is another example of a text to analyze.", # even
    "This text is different from the others.", # odd
    "Yet another sample text for comparison.", # even
    "This text is similar to the first one.", # odd
    "A completely different text for the analysis.", # even
]

# Process the sample texts with spaCy to create documents
docs = [nlp(text) for text in texts]
labels = [f"Doc{i + 1}" for i in range(len(docs))]
even_docs = ["Doc2", "Doc4", "Doc6"]
odd_docs = ["Doc1", "Doc3", "Doc5"]

# Create a Document-Term Matrix (DTM) using the sample documents
# Limit to terms occurring in at least 2 documents
dtm = DTM()
dtm(docs=docs, labels=labels, min_df=2)

# Convert the DTM to a DataFrame and separate even and odd documents
df = dtm.to_df(transpose=True)
even_df = df[df.index.isin(even_docs)]
odd_df = df[df.index.isin(odd_docs)]
even_df

x_sorted = even_df.T.sort_index(ascending=True)
x_sorted["Mean"] = x_sorted.mean(axis=1)
x_sorted

In the next cell, import the `MannWhitney` class from the `topwords` module. We will use this class to calculate the Mann Whitney U statistic and its p-value. The function takes two dataframes as input. The first is the data grouping for which the the results of the test will be reported. The second is the the data grouping to which the first will be compared. The function returns a ranking statistic and a p-value for each term. The highest ranked terms are the ones that are most distinctive to the first data grouping. An additional function provides the option to add some statistics showing the relative frequency of the terms between each grouping.

You can access the result as a dictionary with `MannWhitney.result`, but we will use the `to_df()` method to view a DataFrame. The output will show the terms ranked by their distinctiveness, along with their U statistic and p-value.

The p-value is the probability that a test statistic is extreme or more extreme than the one observed, assuming that the two samples come from the same distribution. A small p-value (typically less than 0.05) indicates that the observed difference between the two samples is statistically significant, and we conclude that the two samples do not come from the same distribution.

By default, the table displays the average frequency of terms in the control group along with the increase in frequency in the comparison group. This provides us with another view of how important the word is to the sample and its relative over- or under-usage in comparison to the other sample.

In [None]:
from lexos.topwords.mann_whitney import MannWhitney

mw = MannWhitney(target=even_df, comparison=odd_df, add_freq=True)
mw.to_df()