# Shakespeare Genre Analysis: TF-IDF, Pearson Correlation, Syntactic Complexity
**IDS 570: Text as Data - Data Exploration Assignment**

---

## Research Question
Do Shakespeare's genre categories (tragedy, comedy, history) reflect measurable linguistic patterns?

## Corpus
~20 Shakespeare plays:
- 7-8 Tragedies
- 7-8 Comedies  
- 4-5 Histories

## Methods
1. **TF-IDF** (lexical distinctiveness)
2. **Pearson Correlation** (text similarity)
3. **Syntactic Complexity** (structural analysis)

---
## Setup & Imports

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import re
from collections import Counter

# NLP libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

# Syntactic parsing
import spacy

# Settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

print("✓ All libraries imported successfully!")

---
## STEP 0: Load and Normalize Data

### Define Your Corpus
**IMPORTANT**: Update the file paths below with YOUR actual play filenames!

In [None]:
# Define your corpus structure
# UPDATE THESE WITH YOUR ACTUAL FILENAMES!

corpus_info = {
    'tragedies': [
        'hamlet.txt',
        'macbeth.txt',
        'othello.txt',
        'king_lear.txt',
        'romeo_and_juliet.txt',
        'julius_caesar.txt',
        'antony_and_cleopatra.txt',
        # Add more as needed
    ],
    'comedies': [
        'much_ado_about_nothing.txt',
        'twelfth_night.txt',
        'as_you_like_it.txt',
        'midsummer_nights_dream.txt',
        'merchant_of_venice.txt',
        'taming_of_the_shrew.txt',
        'comedy_of_errors.txt',
        # Add more as needed
    ],
    'histories': [
        'henry_v.txt',
        'richard_iii.txt',
        'henry_iv_part1.txt',
        'richard_ii.txt',
        # Add more as needed
    ]
}

# Path to your plays folder
DATA_PATH = Path('shakespeare_plays/')  # UPDATE THIS PATH!

print(f"Corpus structure defined:")
print(f"  Tragedies: {len(corpus_info['tragedies'])}")
print(f"  Comedies: {len(corpus_info['comedies'])}")
print(f"  Histories: {len(corpus_info['histories'])}")
print(f"  Total: {sum(len(v) for v in corpus_info.values())} plays")

### Load Texts

In [None]:
def load_play(filepath):
    """Load a single play text file"""
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            return f.read()
    except UnicodeDecodeError:
        # Try different encoding if utf-8 fails
        with open(filepath, 'r', encoding='latin-1') as f:
            return f.read()

# Load all plays
plays = {}
play_genres = {}

for genre, filenames in corpus_info.items():
    for filename in filenames:
        filepath = DATA_PATH / filename
        play_name = filename.replace('.txt', '').replace('_', ' ').title()
        
        try:
            plays[play_name] = load_play(filepath)
            play_genres[play_name] = genre
            print(f"✓ Loaded: {play_name} ({genre})")
        except FileNotFoundError:
            print(f"✗ File not found: {filepath}")

print(f"\n✓ Successfully loaded {len(plays)} plays")

### Text Normalization

**REQUIRED**: Fix long S character  
**YOUR CHOICE**: Other normalizations

In [None]:
def normalize_text(text):
    """
    Normalize Early Modern English text
    
    REQUIRED:
    - Fix long S (ſ → s)
    
    OPTIONAL (your choice):
    - u/v normalization
    - i/j normalization
    - Remove punctuation
    - Lowercase
    """
    # REQUIRED: Fix long S
    text = text.replace('ſ', 's')
    
    # OPTIONAL: Add other normalizations here if desired
    # Uncomment the ones you want to use:
    
    # text = text.replace('u', 'v')  # u/v normalization
    # text = text.replace('i', 'j')  # i/j normalization
    # text = text.lower()  # lowercase
    
    return text

# Apply normalization to all plays
plays_normalized = {name: normalize_text(text) for name, text in plays.items()}

print("✓ Text normalization complete")
print("\nNormalization choices:")
print("  - Fixed long S character (ſ → s): YES (required)")
print("  - Other normalizations: [UPDATE THIS BASED ON YOUR CHOICES]")

### Create DataFrame for Analysis

In [None]:
# Create a dataframe with all plays
df_plays = pd.DataFrame({
    'play': list(plays_normalized.keys()),
    'text': list(plays_normalized.values()),
    'genre': [play_genres[name] for name in plays_normalized.keys()]
})

# Add word counts
df_plays['word_count'] = df_plays['text'].apply(lambda x: len(x.split()))

print(df_plays[['play', 'genre', 'word_count']].to_string())
print(f"\nGenre distribution:")
print(df_plays['genre'].value_counts())

---
## STEP 1: TF-IDF Analysis

### Calculate TF-IDF

In [None]:
# Create TF-IDF vectorizer
# You can adjust these parameters:
tfidf = TfidfVectorizer(
    max_features=5000,        # Keep top 5000 features
    min_df=2,                 # Word must appear in at least 2 documents
    max_df=0.8,               # Word can't appear in more than 80% of documents
    stop_words='english',     # Remove common English stop words
    lowercase=True,
    token_pattern=r'\b[a-zA-Z]{3,}\b'  # Only words with 3+ letters
)

# Fit and transform
tfidf_matrix = tfidf.fit_transform(df_plays['text'])
feature_names = tfidf.get_feature_names_out()

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"({tfidf_matrix.shape[0]} documents × {tfidf_matrix.shape[1]} features)")

### Extract Top TF-IDF Terms per Play

In [None]:
def get_top_tfidf_terms(doc_index, n=15):
    """Get top N TF-IDF terms for a document"""
    tfidf_scores = tfidf_matrix[doc_index].toarray()[0]
    top_indices = tfidf_scores.argsort()[-n:][::-1]
    top_terms = [(feature_names[i], tfidf_scores[i]) for i in top_indices]
    return top_terms

# Get top terms for each play
top_terms_per_play = {}
for idx, row in df_plays.iterrows():
    top_terms_per_play[row['play']] = get_top_tfidf_terms(idx, n=15)

# Display results
print("=" * 80)
print("TOP 15 TF-IDF TERMS PER PLAY")
print("=" * 80)
for play, terms in top_terms_per_play.items():
    genre = df_plays[df_plays['play'] == play]['genre'].values[0]
    print(f"\n{play} ({genre.upper()})")
    print("-" * 60)
    terms_str = ", ".join([f"{term}({score:.3f})" for term, score in terms])
    print(terms_str)

### Create TF-IDF Summary Table

In [None]:
# Create a clean table for your report
tfidf_summary = []
for play, terms in top_terms_per_play.items():
    genre = df_plays[df_plays['play'] == play]['genre'].values[0]
    top_words = ", ".join([term for term, score in terms[:10]])
    tfidf_summary.append({
        'Play': play,
        'Genre': genre,
        'Top 10 Distinctive Terms': top_words
    })

df_tfidf_summary = pd.DataFrame(tfidf_summary)
print(df_tfidf_summary.to_string(index=False))

# Save to CSV
df_tfidf_summary.to_csv('tfidf_results.csv', index=False)
print("\n✓ Saved to tfidf_results.csv")

### TF-IDF Interpretation

**Answer at least 2 of these questions:**

1. **Do some documents share distinctive vocabulary?**
   - [Your answer here]

2. **Are distinctive terms topical, rhetorical, or technical?**
   - [Your answer here]

3. **Are there documents whose distinctiveness seems driven by noise or formatting?**
   - [Your answer here]

---
## STEP 2: Pearson Correlation Analysis

### Calculate Pairwise Correlations

In [None]:
# Convert TF-IDF matrix to dense array
tfidf_dense = tfidf_matrix.toarray()

# Calculate Pearson correlation matrix
correlation_matrix = np.corrcoef(tfidf_dense)

# Create DataFrame with play names
df_corr = pd.DataFrame(
    correlation_matrix,
    index=df_plays['play'],
    columns=df_plays['play']
)

# Round for readability
df_corr = df_corr.round(3)

print("Correlation matrix shape:", df_corr.shape)
print("\nFirst 5x5 subset:")
print(df_corr.iloc[:5, :5])

### Create Similarity Heatmap

In [None]:
# Create heatmap
plt.figure(figsize=(14, 12))

# Create heatmap with better color scheme
sns.heatmap(
    df_corr,
    cmap='RdBu_r',           # Red-Blue diverging colormap
    center=0,                # Center at 0
    vmin=-1, vmax=1,         # Correlation range
    square=True,             # Square cells
    linewidths=0.5,          # Grid lines
    cbar_kws={"shrink": 0.8, "label": "Pearson Correlation"},
    annot=False              # Set to True if you want numbers in cells
)

plt.title('Pearson Correlation Between Shakespeare Plays', fontsize=16, pad=20)
plt.xlabel('')
plt.ylabel('')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()

# Save figure
plt.savefig('pearson_heatmap.png', dpi=300, bbox_inches='tight')
print("✓ Saved heatmap to pearson_heatmap.png")

plt.show()

### Find Most and Least Similar Pairs

In [None]:
# Get upper triangle (to avoid duplicates and diagonal)
mask = np.triu(np.ones_like(df_corr), k=1).astype(bool)
corr_values = df_corr.where(mask)

# Convert to long format
corr_long = corr_values.stack().reset_index()
corr_long.columns = ['Play_1', 'Play_2', 'Correlation']

# Sort by correlation
corr_sorted = corr_long.sort_values('Correlation', ascending=False)

# Most similar pairs
print("=" * 80)
print("TOP 5 MOST SIMILAR PLAY PAIRS")
print("=" * 80)
for idx, row in corr_sorted.head(5).iterrows():
    genre1 = df_plays[df_plays['play'] == row['Play_1']]['genre'].values[0]
    genre2 = df_plays[df_plays['play'] == row['Play_2']]['genre'].values[0]
    print(f"{row['Play_1']} ({genre1}) <--> {row['Play_2']} ({genre2})")
    print(f"  Correlation: {row['Correlation']:.3f}")
    print()

# Least similar pairs
print("=" * 80)
print("TOP 5 LEAST SIMILAR PLAY PAIRS")
print("=" * 80)
for idx, row in corr_sorted.tail(5).iterrows():
    genre1 = df_plays[df_plays['play'] == row['Play_1']]['genre'].values[0]
    genre2 = df_plays[df_plays['play'] == row['Play_2']]['genre'].values[0]
    print(f"{row['Play_1']} ({genre1}) <--> {row['Play_2']} ({genre2})")
    print(f"  Correlation: {row['Correlation']:.3f}")
    print()

### Pearson Correlation Interpretation

**Required answers:**

1. **Two most similar document pairs:**
   - [Your answer here]

2. **Two least similar document pairs:**
   - [Your answer here]

3. **What questions would you ask from this corpus after seeing these patterns?**
   - [Your answer here]

---
## STEP 3: Syntactic Complexity Analysis

### Select Two Plays for Comparison

**Based on your TF-IDF and Pearson results, choose 2 plays to compare**

In [None]:
# UPDATE THESE WITH YOUR CHOSEN PLAYS!
play1_name = "Hamlet"  # Replace with actual play name
play2_name = "Twelfth Night"  # Replace with actual play name

# Get texts
play1_text = df_plays[df_plays['play'] == play1_name]['text'].values[0]
play2_text = df_plays[df_plays['play'] == play2_name]['text'].values[0]

print(f"Selected plays for syntactic analysis:")
print(f"  Play 1: {play1_name}")
print(f"  Play 2: {play2_name}")
print(f"\nReason for selection: [WRITE YOUR REASONING HERE]")

### Load Spacy Model for Syntactic Parsing

In [None]:
# Load spaCy English model
# If not installed, run: python -m spacy download en_core_web_sm
try:
    nlp = spacy.load("en_core_web_sm")
    print("✓ Spacy model loaded successfully")
except:
    print("Please install spacy model:")
    print("  python -m spacy download en_core_web_sm")
    raise

# For large texts, increase max_length
nlp.max_length = 2000000  # Increase as needed

### Define Syntactic Complexity Functions

In [None]:
def calculate_syntactic_complexity(text, play_name):
    """
    Calculate syntactic complexity measures for a text
    
    Returns:
    - Mean Length of Sentence (MLS)
    - Clauses per Sentence (C/S)
    - Dependent Clauses per Sentence
    - Coordination per Sentence
    - Complex Nominals per Sentence
    """
    print(f"Parsing {play_name}... (this may take a few minutes)")
    
    # Parse text
    doc = nlp(text[:500000])  # Limit to first 500k chars if needed
    
    # Initialize counters
    sentences = list(doc.sents)
    n_sentences = len(sentences)
    
    total_words = 0
    total_clauses = 0
    total_dependent_clauses = 0
    total_coordination = 0
    total_complex_nominals = 0
    
    for sent in sentences:
        # Word count
        words = [token for token in sent if not token.is_punct]
        total_words += len(words)
        
        # Clause count (simplified: count verbs)
        verbs = [token for token in sent if token.pos_ == "VERB"]
        total_clauses += len(verbs)
        
        # Dependent clauses (advcl, acl, ccomp, xcomp)
        dep_clauses = [token for token in sent if token.dep_ in ['advcl', 'acl', 'ccomp', 'xcomp']]
        total_dependent_clauses += len(dep_clauses)
        
        # Coordination (conj)
        coord = [token for token in sent if token.dep_ == 'conj']
        total_coordination += len(coord)
        
        # Complex nominals (noun with modifiers)
        nouns_with_mods = [token for token in sent if token.pos_ == 'NOUN' and 
                          any(child.dep_ in ['amod', 'nmod', 'acl', 'relcl'] for child in token.children)]
        total_complex_nominals += len(nouns_with_mods)
    
    # Calculate measures
    results = {
        'play': play_name,
        'sentences': n_sentences,
        'MLS': total_words / n_sentences if n_sentences > 0 else 0,
        'C/S': total_clauses / n_sentences if n_sentences > 0 else 0,
        'DC/S': total_dependent_clauses / n_sentences if n_sentences > 0 else 0,
        'Coord/S': total_coordination / n_sentences if n_sentences > 0 else 0,
        'CN/S': total_complex_nominals / n_sentences if n_sentences > 0 else 0
    }
    
    print(f"✓ Parsing complete for {play_name}")
    return results, doc

print("✓ Syntactic complexity functions defined")

### Calculate Complexity for Both Plays

**WARNING**: This may take 5-10 minutes per play!

In [None]:
# Calculate for Play 1
results1, doc1 = calculate_syntactic_complexity(play1_text, play1_name)

# Calculate for Play 2
results2, doc2 = calculate_syntactic_complexity(play2_text, play2_name)

print("\n✓ Syntactic analysis complete for both plays!")

### Create Summary Table

In [None]:
# Create comparison table
df_syntax = pd.DataFrame([results1, results2])
df_syntax = df_syntax.set_index('play')

# Format for display
df_syntax_display = df_syntax.copy()
for col in ['MLS', 'C/S', 'DC/S', 'Coord/S', 'CN/S']:
    df_syntax_display[col] = df_syntax_display[col].round(2)

print("=" * 80)
print("SYNTACTIC COMPLEXITY COMPARISON")
print("=" * 80)
print(df_syntax_display.to_string())
print("\nMeasures:")
print("  MLS = Mean Length of Sentence")
print("  C/S = Clauses per Sentence")
print("  DC/S = Dependent Clauses per Sentence")
print("  Coord/S = Coordination per Sentence")
print("  CN/S = Complex Nominals per Sentence")

# Save to CSV
df_syntax_display.to_csv('syntactic_complexity.csv')
print("\n✓ Saved to syntactic_complexity.csv")

### Visualize Syntactic Differences

In [None]:
# Create bar chart comparison
fig, ax = plt.subplots(figsize=(12, 6))

measures = ['MLS', 'C/S', 'DC/S', 'Coord/S', 'CN/S']
x = np.arange(len(measures))
width = 0.35

play1_values = [results1[m] for m in measures]
play2_values = [results2[m] for m in measures]

ax.bar(x - width/2, play1_values, width, label=play1_name, alpha=0.8)
ax.bar(x + width/2, play2_values, width, label=play2_name, alpha=0.8)

ax.set_xlabel('Measure', fontsize=12)
ax.set_ylabel('Value', fontsize=12)
ax.set_title('Syntactic Complexity Comparison', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(measures)
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('syntactic_comparison.png', dpi=300, bbox_inches='tight')
print("✓ Saved chart to syntactic_comparison.png")
plt.show()

### Extract Example Sentences

In [None]:
def find_example_sentence(doc, play_name, criterion='longest'):
    """
    Find an example sentence based on criterion
    
    criterion: 'longest', 'most_complex', 'most_coordination'
    """
    sentences = list(doc.sents)
    
    if criterion == 'longest':
        # Find longest sentence
        sent = max(sentences, key=lambda s: len([t for t in s if not t.is_punct]))
    elif criterion == 'most_complex':
        # Find sentence with most dependent clauses
        sent = max(sentences, key=lambda s: len([t for t in s if t.dep_ in ['advcl', 'acl', 'ccomp', 'xcomp']]))
    elif criterion == 'most_coordination':
        # Find sentence with most coordination
        sent = max(sentences, key=lambda s: len([t for t in s if t.dep_ == 'conj']))
    
    return sent.text.strip()

# Get example sentences
print("=" * 80)
print("EXAMPLE SENTENCES")
print("=" * 80)

print(f"\n{play1_name} - Longest Sentence:")
print("-" * 60)
example1 = find_example_sentence(doc1, play1_name, 'longest')
print(example1[:500] + "..." if len(example1) > 500 else example1)

print(f"\n{play2_name} - Longest Sentence:")
print("-" * 60)
example2 = find_example_sentence(doc2, play2_name, 'longest')
print(example2[:500] + "..." if len(example2) > 500 else example2)

print(f"\n{play1_name} - Most Complex (dependent clauses):")
print("-" * 60)
example3 = find_example_sentence(doc1, play1_name, 'most_complex')
print(example3[:500] + "..." if len(example3) > 500 else example3)

print(f"\n{play2_name} - Most Complex (dependent clauses):")
print("-" * 60)
example4 = find_example_sentence(doc2, play2_name, 'most_complex')
print(example4[:500] + "..." if len(example4) > 500 else example4)

### Syntactic Complexity Interpretation

**Required answers:**

1. **How do the two texts differ in syntactic complexity?**
   - [Your answer here]

2. **Do these differences align with or complicate your earlier lexical findings?**
   - [Your answer here]

3. **What kinds of rhetorical or stylistic practices might these syntactic patterns reflect?**
   - [Your answer here]

---
## STEP 4: SYNTHESIS

### Triangulating Evidence Across All Three Approaches

**This is the most important section!**

Articulate one central analytical question/hypothesis that draws on:
- TF-IDF evidence
- Pearson similarity/distance evidence  
- Syntactic complexity evidence

---

### Central Question/Hypothesis:

[WRITE YOUR CENTRAL QUESTION HERE]

Example: "Do Shakespeare's genre categories (tragedy, comedy, history) reflect measurable linguistic patterns, or are they primarily thematic/plot-based distinctions?"

---

### Evidence from TF-IDF:

[SUMMARIZE KEY FINDINGS FROM TF-IDF]

Example:
- Tragedies share vocabulary related to death, blood, revenge, fate
- Comedies share vocabulary related to love, marriage, jest, disguise
- Character names dominate distinctive terms

---

### Evidence from Pearson Correlation:

[SUMMARIZE KEY FINDINGS FROM CORRELATION ANALYSIS]

Example:
- Strong within-genre clustering (tragedies correlate with tragedies)
- Clear separation between tragedy and comedy
- BUT some plays bridge categories (e.g., Merchant of Venice)
- Histories form distinct cluster

---

### Evidence from Syntactic Complexity:

[SUMMARIZE KEY FINDINGS FROM SYNTAX ANALYSIS]

Example:
- Hamlet shows higher complexity: longer sentences, more subordination
- Twelfth Night shows simpler structures: coordination, shorter sentences
- Syntactic differences align with lexical differences

---

### Conclusion:

[YOUR SYNTHESIS AND CONCLUSIONS]

Example:
"Shakespeare's genres are computationally distinguishable across multiple dimensions. Tragedies and comedies differ not only in vocabulary (what they discuss) but in syntax (how they construct arguments). However, problem plays and late romances resist clean classification, suggesting genre exists on a spectrum rather than as rigid categories. This analysis demonstrates that computational methods can illuminate both the coherence of traditional genre labels and their limitations."

---
## Summary of Outputs

### Files Generated:
1. `tfidf_results.csv` - Top TF-IDF terms per play
2. `pearson_heatmap.png` - Correlation heatmap
3. `syntactic_complexity.csv` - Complexity measures
4. `syntactic_comparison.png` - Bar chart comparison

### For Your Report:
- Include all visualizations
- Reference tables in your analysis
- Answer all interpretive questions
- Write synthesis section

### For Canvas Submission:
- Post this notebook to GitHub
- Include link in Discussion Board post
- Upload all PNG files
- Write your report with interpretations

---
## Notes & Reflections

[Use this space for any additional notes, observations, or ideas that emerged during your analysis]