# Task 1: The Fingerprint
## Proving Classes Are Mathematically Distinct

Before training ML models, we must prove these 3 classes have measurably different linguistic properties.

**Goal:** Perform 6 analyses to show Human, AI Vanilla, and AI Styled texts are statistically distinguishable.

---

## Analyses:
1. **Type-Token Ratio (TTR)** - Vocabulary diversity
2. **Hapax Legomena** - Rare words (appearing once)
3. **POS Distribution** - Adjective-to-Noun ratio
4. **Sentence Length Variance** - Structural rhythm
5. **Punctuation Density** - Stylistic markers
6. **Flesch-Kincaid Grade Level** - Readability

---
## Setup

In [None]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from scipy import stats
import re
from collections import Counter

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("‚úÖ Libraries loaded")

---
## Load Data

In [None]:
# Updated path for Twain + Austen dataset
DATA_DIR = Path('../TASK0/data/dataset/twain_austen')

# Load Class 1 (Human - JSONL format)
class1_texts = []
with open(DATA_DIR / 'class1_human.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        if line.strip():
            class1_texts.append(json.loads(line)['text'])

# Load Class 2 (AI Vanilla - TXT format, 1 para per line)
with open(DATA_DIR / 'class2.txt', 'r', encoding='utf-8') as f:
    class2_texts = [line.strip() for line in f if line.strip()]

# Load Class 3 (AI Styled - TXT format, 1 para per line)
with open(DATA_DIR / 'class3.txt', 'r', encoding='utf-8') as f:
    class3_texts = [line.strip() for line in f if line.strip()]

print(f"‚úÖ Loaded Twain + Austen Dataset:")
print(f"  Class 1 (Human): {len(class1_texts)} paragraphs")
print(f"  Class 2 (AI Vanilla): {len(class2_texts)} paragraphs")
print(f"  Class 3 (AI Styled): {len(class3_texts)} paragraphs")
print(f"\nExpected:")
print(f"  Class 1: ~470-500 human paragraphs")
print(f"  Class 2: ~470-500 AI Vanilla paragraphs")
print(f"  Class 3: ~470-500 AI Styled paragraphs (Twain + Austen)")

---
## 1. Type-Token Ratio (TTR)

Measures vocabulary diversity: unique words / total words

In [None]:
def calculate_ttr(text):
    """Type-Token Ratio: vocabulary diversity"""
    tokens = re.findall(r'\b\w+\b', text.lower())
    if len(tokens) == 0:
        return 0
    unique_tokens = set(tokens)
    return len(unique_tokens) / len(tokens)

# Calculate for all classes
class1_ttr = [calculate_ttr(text) for text in class1_texts]
class2_ttr = [calculate_ttr(text) for text in class2_texts]
class3_ttr = [calculate_ttr(text) for text in class3_texts]

print(f"Average TTR:")
print(f"  Human (Class 1): {np.mean(class1_ttr):.3f}")
print(f"  AI Vanilla (Class 2): {np.mean(class2_ttr):.3f}")
print(f"  AI Styled (Class 3): {np.mean(class3_ttr):.3f}")

In [None]:
# Visualize
data = pd.DataFrame({
    'TTR': class1_ttr + class2_ttr + class3_ttr,
    'Class': ['Human']*len(class1_ttr) + ['AI_Vanilla']*len(class2_ttr) + ['AI_Styled']*len(class3_ttr)
})

plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='Class', y='TTR', palette='Set2')
plt.title('Type-Token Ratio by Class', fontsize=14, fontweight='bold')
plt.ylabel('TTR (Vocabulary Diversity)')
plt.show()

In [None]:
# Statistical test
t_stat, p_value = stats.ttest_ind(class1_ttr, class2_ttr)

print(f"\nT-test (Human vs AI Vanilla):")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.6f}")

if p_value < 0.05:
    print(f"  ‚úÖ Statistically significant difference (p < 0.05)")
    diff = abs(np.mean(class1_ttr) - np.mean(class2_ttr))
    print(f"  Difference: {diff:.3f}")
else:
    print(f"  ‚ùå No significant difference (p >= 0.05)")

### TTR Summary

**Results:**
- Human (Twain + Austen): **0.674 TTR** (lower)
- AI Vanilla: **0.710 TTR** (higher)
- AI Styled: **0.700 TTR** (middle)

**‚ö†Ô∏è LENGTH BIAS DETECTED!**

**Paragraph lengths:**
- Human: **134 words** average
- AI: **90-97 words** average  
- **Difference: 37-44 words** (HUGE!)

**Why this creates bias:**
- Longer texts ‚Üí more word repetition ‚Üí **lower TTR**
- Shorter texts ‚Üí less repetition ‚Üí **higher TTR**
- This is a **mathematical artifact** from text length, not pure style

**Statistical Significance:**
- p < 0.000001 (extremely significant)
- Difference: 0.036 (consistent and measurable)
- ‚úÖ **Still valid for classification** (models learn length + style together)

**Verdict:** ‚ö†Ô∏è **TTR is CONFOUNDED by length bias**
- Can't interpret as "AI has richer vocabulary"
- Actually means "AI writes shorter paragraphs"
- Still useful as a feature, just not interpretable as pure style

---
## 2. Hapax Legomena

Words that appear exactly once in a text (rare/unique vocabulary)

**Example:** In "the cat sat on the mat", hapax = {cat, sat, on, mat} (4 words)

In [None]:
def calculate_hapax_ratio(text):
    """Ratio of words appearing exactly once"""
    tokens = re.findall(r'\b\w+\b', text.lower())
    if len(tokens) == 0:
        return 0
    word_counts = Counter(tokens)
    hapax_count = sum(1 for count in word_counts.values() if count == 1)
    return hapax_count / len(tokens)

# Calculate for all classes
class1_hapax = [calculate_hapax_ratio(text) for text in class1_texts]
class2_hapax = [calculate_hapax_ratio(text) for text in class2_texts]
class3_hapax = [calculate_hapax_ratio(text) for text in class3_texts]

print(f"Average Hapax Ratio:")
print(f"  Human (Class 1): {np.mean(class1_hapax):.3f}")
print(f"  AI Vanilla (Class 2): {np.mean(class2_hapax):.3f}")
print(f"  AI Styled (Class 3): {np.mean(class3_hapax):.3f}")

In [None]:
# Visualize
data = pd.DataFrame({
    'Hapax_Ratio': class1_hapax + class2_hapax + class3_hapax,
    'Class': ['Human']*len(class1_hapax) + ['AI_Vanilla']*len(class2_hapax) + ['AI_Styled']*len(class3_hapax)
})

plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='Class', y='Hapax_Ratio', palette='Set2')
plt.title('Hapax Legomena Ratio by Class', fontsize=14, fontweight='bold')
plt.ylabel('Hapax Ratio (Rare Words)')
plt.show()

In [None]:
# Statistical test
t_stat, p_value = stats.ttest_ind(class1_hapax, class2_hapax)

print(f"\nT-test (Human vs AI Vanilla):")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.6f}")

if p_value < 0.05:
    print(f"  ‚úÖ Statistically significant difference (p < 0.05)")
    diff = abs(np.mean(class1_hapax) - np.mean(class2_hapax))
    print(f"  Difference: {diff:.3f}")
else:
    print(f"  ‚ùå No significant difference (p >= 0.05)")

### Hapax Summary

**Results:**
- Human (Twain + Austen): **0.524** (lower)
- AI Vanilla: **0.585** (higher)  
- AI Styled: **0.571** (middle)

**‚ö†Ô∏è LENGTH BIAS DETECTED!**

**Paragraph lengths:**
- Human: **134 words** average
- AI: **90-97 words** average
- **Difference: 37-44 words** (HUGE!)

**Why this matters:**
Longer texts naturally have **lower Hapax ratios** because:
1. More words = more opportunities to repeat common words
2. The longer you write, the more "the," "a," "is," "said" appear multiple times
3. This is a **mathematical artifact**, not a stylistic difference

**Verdict:** ‚ö†Ô∏è **Hapax is CONFOUNDED by length bias**
- Still statistically significant (p < 0.000001)
- Still useful for classification (models learn patterns)
- But NOT interpretable as pure style difference



### üéØ Key Takeaway

**Both TTR and Hapax are CONFOUNDED by paragraph length:**
- Human: 134 words (longer) ‚Üí more repetition ‚Üí lower TTR/Hapax
- AI: 90-97 words (shorter) ‚Üí less repetition ‚Üí higher TTR/Hapax

**This is OK for classification!** Models will learn "short paragraphs = AI" which is accurate.

**But:** Can't interpret as "AI has richer vocabulary" - it just writes shorter.


---
## 3. POS Distribution: Adjective-to-Noun Ratio

Does AI "over-describe" compared to humans?

**Example:** 
- "The quick brown fox" ‚Üí 2 adj, 1 noun = 2.0 ratio
- "The fox" ‚Üí 0 adj, 1 noun = 0.0 ratio

In [None]:
# Install spaCy if needed
try:
    import spacy
    nlp = spacy.load('en_core_web_sm')
    print("‚úÖ spaCy loaded")
except:
    print("Installing spaCy...")
    import sys
    !{sys.executable} -m pip install spacy -q
    !{sys.executable} -m spacy download en_core_web_sm -q
    import spacy
    nlp = spacy.load('en_core_web_sm')
    print("‚úÖ spaCy installed and loaded")

In [None]:
def calculate_adj_noun_ratio(text):
    """Adjective to noun ratio using spaCy POS tagging"""
    doc = nlp(text)
    adj_count = sum(1 for token in doc if token.pos_ == 'ADJ')
    noun_count = sum(1 for token in doc if token.pos_ in ['NOUN', 'PROPN'])
    
    if noun_count == 0:
        return 0
    return adj_count / noun_count

# Calculate for all classes (sample first 100 for speed)
class1_adj_noun = [calculate_adj_noun_ratio(text) for text in class1_texts[:100]]
class2_adj_noun = [calculate_adj_noun_ratio(text) for text in class2_texts[:100]]
class3_adj_noun = [calculate_adj_noun_ratio(text) for text in class3_texts[:100]]

print(f"Average Adj/Noun Ratio:")
print(f"  Human (Class 1): {np.mean(class1_adj_noun):.3f}")
print(f"  AI Vanilla (Class 2): {np.mean(class2_adj_noun):.3f}")
print(f"  AI Styled (Class 3): {np.mean(class3_adj_noun):.3f}")

In [None]:
# Visualize
data = pd.DataFrame({
    'Adj_Noun_Ratio': class1_adj_noun + class2_adj_noun + class3_adj_noun,
    'Class': ['Human']*len(class1_adj_noun) + ['AI_Vanilla']*len(class2_adj_noun) + ['AI_Styled']*len(class3_adj_noun)
})

plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='Class', y='Adj_Noun_Ratio', palette='Set2')
plt.title('Adjective-to-Noun Ratio by Class', fontsize=14, fontweight='bold')
plt.ylabel('Adj/Noun Ratio')
plt.show()

In [None]:
# Statistical test
t_stat, p_value = stats.ttest_ind(class1_adj_noun, class2_adj_noun)

print(f"\nT-test (Human vs AI Vanilla):")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.6f}")

if p_value < 0.05:
    print(f"  ‚úÖ Statistically significant difference (p < 0.05)")
    diff = abs(np.mean(class1_adj_noun) - np.mean(class2_adj_noun))
    print(f"  Difference: {diff:.3f}")
else:
    print(f"  ‚ùå No significant difference (p >= 0.05)")
    print(f"  This metric does NOT distinguish the classes")

### POS Distribution Summary

**SURPRISING SUCCESS!** ‚úÖ (With Twain + Austen)

**Results:**
- Human (Twain): **0.313** (fewer adjectives)
- AI Vanilla: **0.356** (more adjectives)
- AI Styled: **0.346** (middle)

**Statistical Significance:**
- p = 0.024 (SIGNIFICANT!)
- Difference: 0.043 (modest but real)

**Why it works NOW:**
1. **Twain's colloquial style**: Action-focused, minimal description, dialogue-heavy
2. **AI's analytical style**: More formal, descriptive, academic tone
3. **"Tom ran"** (Twain) vs **"The enthusiastic boy ran quickly"** (AI)

**Why it FAILED with Victorian dataset:**
- Dickens used TONS of adjectives (Victorian formal prose)
- Dickens (0.35) ‚âà AI (0.35) - no difference!
- Twain (0.31) < AI (0.36) - clear difference!

**Verdict:** ‚úÖ **VALID METRIC** for distinguishing colloquial vs formal writing styles!

---
## 4. Sentence Length Variance

Measures how varied sentence lengths are (structural rhythm)

**Example:**
- Monotonous: [18, 20, 19, 21] words ‚Üí variance = 1.25
- Dynamic: [5, 40, 15, 60] words ‚Üí variance = 506

In [None]:
def calculate_sentence_length_variance(text):
    """Standard deviation of sentence lengths"""
    sentences = re.split(r'[.!?]+', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    
    if len(sentences) <= 1:
        return 0
    
    lengths = [len(s.split()) for s in sentences]
    return np.std(lengths)

# Calculate for all classes
class1_variance = [calculate_sentence_length_variance(text) for text in class1_texts]
class2_variance = [calculate_sentence_length_variance(text) for text in class2_texts]
class3_variance = [calculate_sentence_length_variance(text) for text in class3_texts]

print(f"Average Sentence Length Variance:")
print(f"  Human (Class 1): {np.mean(class1_variance):.3f}")
print(f"  AI Vanilla (Class 2): {np.mean(class2_variance):.3f}")
print(f"  AI Styled (Class 3): {np.mean(class3_variance):.3f}")

In [None]:
# Visualize
data = pd.DataFrame({
    'Sentence_Variance': class1_variance + class2_variance + class3_variance,
    'Class': ['Human']*len(class1_variance) + ['AI_Vanilla']*len(class2_variance) + ['AI_Styled']*len(class3_variance)
})

plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='Class', y='Sentence_Variance', palette='Set2')
plt.title('Sentence Length Variance by Class', fontsize=14, fontweight='bold')
plt.ylabel('Std Dev of Sentence Lengths')
plt.show()

In [None]:
# Statistical test
t_stat, p_value = stats.ttest_ind(class1_variance, class2_variance)

print(f"\nT-test (Human vs AI Vanilla):")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.10f}")

if p_value < 0.05:
    print(f"  ‚úÖ Statistically significant difference (p < 0.05)")
    diff = abs(np.mean(class1_variance) - np.mean(class2_variance))
    print(f"  Difference: {diff:.3f}")
    print(f"\n  üèÜ STRONGEST DISTINGUISHER!")
else:
    print(f"  ‚ùå No significant difference (p >= 0.05)")

### Sentence Variance Summary

**üèÜ STRONGEST METRIC - THE AI FINGERPRINT!**

**Results:**
- Human (Twain + Austen): **13.697** (highly varied)
- AI Vanilla: **5.456** (monotonous)
- AI Styled: **6.009** (slightly less monotonous)

**Statistical Significance:**
- **Difference: +8.241** (MASSIVE!)
- **p-value: < 0.0000000001** (22œÉ effect - extremely significant!)
- **t-statistic: 22.3** (huge effect size)

**Why this is THE metric:**

1. **‚úÖ Length-independent**: Not affected by paragraph length (unlike TTR/Hapax)
2. **‚úÖ 2.5x difference**: Human variance is 2.5x higher than AI
3. **‚úÖ Reveals structure**: AI's mechanical rhythm vs human's natural flow
4. **‚úÖ Consistent pattern**: Works across Victorian AND Twain datasets

**What it shows:**

**Humans (variance ~14):**
‚Üí **Natural rhythm, varied pacing**

**AI (variance ~5):**
‚Üí **Mechanical uniformity, "middle zone" trap**

**üéØ THE SMOKING GUN:** AI avoids extremes (short punchy sentences OR long flowing ones), clustering around 15-20 words per sentence. Humans use the full range!

---
## 5. Punctuation Density Heatmap

Count 7 punctuation types per 1000 words

In [None]:
def calculate_punctuation_density(text):
    """Count punctuation per 1000 words"""
    word_count = len(re.findall(r'\b\w+\b', text))
    if word_count == 0:
        return {}
    
    multiplier = 1000 / word_count
    
    return {
        'comma': text.count(',') * multiplier,
        'semicolon': text.count(';') * multiplier,
        'colon': text.count(':') * multiplier,
        'em_dash': (text.count('‚Äî') + text.count('--')) * multiplier,
        'exclamation': text.count('!') * multiplier,
        'question': text.count('?') * multiplier,
        'quote': text.count('"') * multiplier
    }

# Calculate for all classes
class1_punct = [calculate_punctuation_density(text) for text in class1_texts]
class2_punct = [calculate_punctuation_density(text) for text in class2_texts]
class3_punct = [calculate_punctuation_density(text) for text in class3_texts]

# Average by class
punct_types = ['comma', 'semicolon', 'colon', 'em_dash', 'exclamation', 'question', 'quote']
heatmap_data = []

for punct_type in punct_types:
    heatmap_data.append([
        np.mean([p[punct_type] for p in class1_punct]),
        np.mean([p[punct_type] for p in class2_punct]),
        np.mean([p[punct_type] for p in class3_punct])
    ])

heatmap_df = pd.DataFrame(heatmap_data, 
                          columns=['Human', 'AI_Vanilla', 'AI_Styled'],
                          index=punct_types)

print("Punctuation Density (per 1000 words):")
print(heatmap_df.round(2))

In [None]:
# Heatmap visualization
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_df, annot=True, fmt='.1f', cmap='YlOrRd', cbar_kws={'label': 'Per 1000 words'})
plt.title('Punctuation Density Heatmap', fontsize=14, fontweight='bold')
plt.ylabel('Punctuation Type')
plt.xlabel('Class')
plt.tight_layout()
plt.show()

In [None]:
# Statistical tests for key punctuation
key_punct = ['semicolon', 'em_dash', 'exclamation']

for punct in key_punct:
    class1_values = [p[punct] for p in class1_punct]
    class2_values = [p[punct] for p in class2_punct]
    
    t_stat, p_value = stats.ttest_ind(class1_values, class2_values)
    
    print(f"\n{punct.upper()}:")
    print(f"  Human: {np.mean(class1_values):.2f}")
    print(f"  AI Vanilla: {np.mean(class2_values):.2f}")
    print(f"  p-value: {p_value:.6f}")
    
    if p_value < 0.05:
        print(f"  ‚úÖ Significant difference")
    else:
        print(f"  ‚ùå No significant difference")

### Punctuation Summary

**COUNTERINTUITIVE RESULTS** ‚ö†Ô∏è
- Humans (Dickens/Austen) use MORE semicolons/em-dashes than AI
- This is because:
  - Human data = 19th century Victorian literature
  - AI data = 21st century modern style
  - We're comparing historical vs modern conventions!

**Still valid for classification** - the differences are real and significant.

---
## 6. Flesch-Kincaid Grade Level

Measures reading difficulty based on sentence length and word complexity

**Formula:** 0.39 √ó (words/sentences) + 11.8 √ó (syllables/words) - 15.59

**Interpretation:**
- Grade 8-10: Easy high school
- Grade 10-12: Standard high school
- Grade 12-14: College level
- Grade 14+: Graduate level

In [None]:
# Install textstat if needed
try:
    from textstat import flesch_kincaid_grade
    print("‚úÖ textstat loaded")
except:
    print("Installing textstat...")
    import sys
    !{sys.executable} -m pip install textstat -q
    from textstat import flesch_kincaid_grade
    print("‚úÖ textstat installed")

In [None]:
# Calculate FK grade for all classes
class1_fk = [flesch_kincaid_grade(text) for text in class1_texts]
class2_fk = [flesch_kincaid_grade(text) for text in class2_texts]
class3_fk = [flesch_kincaid_grade(text) for text in class3_texts]

print(f"Average Flesch-Kincaid Grade Level:")
print(f"  Human (Class 1): {np.mean(class1_fk):.2f}")
print(f"  AI Vanilla (Class 2): {np.mean(class2_fk):.2f}")
print(f"  AI Styled (Class 3): {np.mean(class3_fk):.2f}")

In [None]:
# Visualize
data = pd.DataFrame({
    'FK_Grade': class1_fk + class2_fk + class3_fk,
    'Class': ['Human']*len(class1_fk) + ['AI_Vanilla']*len(class2_fk) + ['AI_Styled']*len(class3_fk)
})

plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='Class', y='FK_Grade', palette='Set2')

# Add mean markers
means = data.groupby('Class')['FK_Grade'].mean()
positions = range(len(means))
plt.plot(positions, means.values, 'D', color='red', markersize=10, label='Mean', zorder=3)

plt.title('Flesch-Kincaid Grade Level by Class', fontsize=14, fontweight='bold')
plt.ylabel('Grade Level (Reading Difficulty)')
plt.legend()
plt.show()

In [None]:
# Statistical tests
print("\nStatistical Tests:")

# Human vs AI Vanilla
t_stat, p_value = stats.ttest_ind(class1_fk, class2_fk)
print(f"\nHuman vs AI Vanilla:")
print(f"  Difference: {np.mean(class1_fk) - np.mean(class2_fk):.2f} grade levels")
print(f"  p-value: {p_value:.6f}")
if p_value < 0.05:
    print(f"  ‚úÖ Significant")
else:
    print(f"  ‚ùå Not significant")

# Human vs AI Styled
t_stat, p_value = stats.ttest_ind(class1_fk, class3_fk)
print(f"\nHuman vs AI Styled:")
print(f"  Difference: {np.mean(class1_fk) - np.mean(class3_fk):.2f} grade levels")
print(f"  p-value: {p_value:.6f}")
if p_value < 0.05:
    print(f"  ‚úÖ Significant")
else:
    print(f"  ‚ùå Not significant")

# AI Vanilla vs AI Styled
t_stat, p_value = stats.ttest_ind(class2_fk, class3_fk)
print(f"\nAI Vanilla vs AI Styled:")
print(f"  Difference: {np.mean(class2_fk) - np.mean(class3_fk):.2f} grade levels")
print(f"  p-value: {p_value:.6f}")
if p_value < 0.05:
    print(f"  ‚úÖ Significant")
else:
    print(f"  ‚ùå Not significant")

### Flesch-Kincaid Summary

Shows reading difficulty level of each class based on sentence length and word complexity.

---
## Final Summary

### Metrics That WORK (distinguish classes):
1. ‚úÖ **Sentence Length Variance** - STRONGEST (p < 0.000001, diff = 9.6)
2. ‚úÖ **TTR** - Significant (p < 0.000001, diff = 0.039)
3. ‚úÖ **Hapax** - Significant (p < 0.000001, diff = 0.043)
4. ‚úÖ **Punctuation** - Significant (semicolons, em-dashes)
5. ‚úÖ **Flesch-Kincaid** - Likely significant

### Metrics That FAILED:
1. ‚ùå **Adj/Noun Ratio** - No difference (p = 0.812)

### Key Insights:
- **Structural metrics** (sentence variance) > **Vocabulary metrics** (TTR, Hapax)
- AI can mimic vocabulary but struggles with natural rhythm
- Length bias affects vocabulary metrics but doesn't invalidate them
- 19th century human style ‚â† 21st century AI style (historical confound)

### Best Features for Classification (Task 2):
1. Sentence length variance (strongest signal)
2. Paragraph length
3. TTR
4. Punctuation density (semicolons, em-dashes)
5. Flesch-Kincaid grade level

**Classes are mathematically distinct** ‚úÖ - Ready for Task 2!