# Historical Text Analysis & Corpus Linguistics

This notebook demonstrates computational methods for analyzing historical texts, including lexical analysis, stylometry, and topic modeling.

## Learning Objectives

1. Load and preprocess historical texts
2. Perform lexical analysis (word frequencies, vocabulary richness)
3. Conduct stylometric analysis for authorship attribution
4. Apply topic modeling to identify themes
5. Extract and visualize named entities
6. Compare texts and measure similarity
7. Create visualizations (word clouds, frequency distributions)

## Dataset

Public domain texts:
- Shakespeare: Sonnet 18
- Lincoln: Gettysburg Address
- Austen: Pride and Prejudice (excerpt)
- King: I Have a Dream speech (excerpt)

## 1. Setup and Imports

In [None]:
import os
from collections import Counter

import matplotlib.pyplot as plt

# NLP libraries
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from wordcloud import WordCloud

# Set visualization style
plt.style.use("seaborn-v0_8-whitegrid")
sns.set_palette("husl")
%matplotlib inline

print("✓ Imports successful")

In [None]:
# Download required NLTK data
nltk.download("punkt", quiet=True)
nltk.download("stopwords", quiet=True)
nltk.download("averaged_perceptron_tagger", quiet=True)
nltk.download("maxent_ne_chunker", quiet=True)
nltk.download("words", quiet=True)

print("✓ NLTK data downloaded")

## 2. Load Texts

In [None]:
# Load texts from files
def load_text(filename):
    """Load text from file."""
    with open(os.path.join("texts", filename), encoding="utf-8") as f:
        return f.read()


texts = {
    "Shakespeare": load_text("shakespeare_sonnet18.txt"),
    "Lincoln": load_text("lincoln_gettysburg.txt"),
    "Austen": load_text("austen_pride_excerpt.txt"),
    "King": load_text("king_dream_excerpt.txt"),
}

print("Loaded texts:")
for author, text in texts.items():
    print(f"  {author}: {len(text)} characters")

# Display sample
print("\nSample from Shakespeare:")
print(texts["Shakespeare"][:200] + "...")

## 3. Text Preprocessing

In [None]:
def preprocess_text(text, remove_stopwords=False):
    """Tokenize and preprocess text."""
    # Tokenize
    tokens = word_tokenize(text.lower())

    # Remove punctuation and non-alphabetic tokens
    tokens = [token for token in tokens if token.isalpha()]

    # Optionally remove stopwords
    if remove_stopwords:
        stop_words = set(stopwords.words("english"))
        tokens = [token for token in tokens if token not in stop_words]

    return tokens


# Preprocess all texts
tokens_all = {author: preprocess_text(text) for author, text in texts.items()}
tokens_nostop = {
    author: preprocess_text(text, remove_stopwords=True) for author, text in texts.items()
}

print("Token counts (with stopwords):")
for author, tokens in tokens_all.items():
    print(f"  {author}: {len(tokens)} tokens")

print("\nToken counts (without stopwords):")
for author, tokens in tokens_nostop.items():
    print(f"  {author}: {len(tokens)} tokens")

## 4. Lexical Analysis

In [None]:
def calculate_lexical_stats(tokens):
    """Calculate lexical diversity metrics."""
    total_tokens = len(tokens)
    unique_tokens = len(set(tokens))

    # Type-Token Ratio (TTR)
    ttr = unique_tokens / total_tokens if total_tokens > 0 else 0

    # Hapax legomena (words appearing once)
    freq_dist = Counter(tokens)
    hapax = sum(1 for word, count in freq_dist.items() if count == 1)

    return {
        "total_tokens": total_tokens,
        "unique_tokens": unique_tokens,
        "ttr": ttr,
        "hapax_legomena": hapax,
        "hapax_percentage": (hapax / unique_tokens * 100) if unique_tokens > 0 else 0,
    }


# Calculate stats for all texts
lexical_stats = {}
for author, tokens in tokens_all.items():
    lexical_stats[author] = calculate_lexical_stats(tokens)

# Display as DataFrame
stats_df = pd.DataFrame(lexical_stats).T
print("Lexical Statistics:")
print(stats_df)

In [None]:
# Visualize lexical diversity
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Type-Token Ratio
stats_df["ttr"].plot(kind="bar", ax=axes[0], color="steelblue", alpha=0.7)
axes[0].set_title("Type-Token Ratio (Vocabulary Richness)", fontsize=12, fontweight="bold")
axes[0].set_xlabel("Author")
axes[0].set_ylabel("TTR (higher = more diverse)")
axes[0].tick_params(axis="x", rotation=45)
axes[0].grid(True, alpha=0.3, axis="y")

# Hapax Legomena
stats_df["hapax_legomena"].plot(kind="bar", ax=axes[1], color="coral", alpha=0.7)
axes[1].set_title("Hapax Legomena (Words Used Once)", fontsize=12, fontweight="bold")
axes[1].set_xlabel("Author")
axes[1].set_ylabel("Count")
axes[1].tick_params(axis="x", rotation=45)
axes[1].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

## 5. Word Frequency Analysis

In [None]:
# Get top 10 words for each text (without stopwords)
print("Top 10 Most Frequent Words (excluding stopwords):\n")

for author, tokens in tokens_nostop.items():
    freq_dist = Counter(tokens)
    top_10 = freq_dist.most_common(10)

    print(f"{author}:")
    for word, count in top_10:
        print(f"  {word}: {count}")
    print()

In [None]:
# Visualize frequency distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, (author, tokens) in enumerate(tokens_nostop.items()):
    freq_dist = Counter(tokens)
    top_15 = dict(freq_dist.most_common(15))

    axes[idx].barh(list(top_15.keys()), list(top_15.values()), color="skyblue", alpha=0.7)
    axes[idx].set_title(f"{author}: Top 15 Words", fontsize=12, fontweight="bold")
    axes[idx].set_xlabel("Frequency")
    axes[idx].invert_yaxis()
    axes[idx].grid(True, alpha=0.3, axis="x")

plt.tight_layout()
plt.show()

## 6. Word Clouds

In [None]:
# Generate word clouds for each text
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, (author, text) in enumerate(texts.items()):
    wordcloud = WordCloud(
        width=400, height=300, background_color="white", colormap="viridis", max_words=50
    ).generate(text)

    axes[idx].imshow(wordcloud, interpolation="bilinear")
    axes[idx].axis("off")
    axes[idx].set_title(f"{author}", fontsize=14, fontweight="bold")

plt.tight_layout()
plt.show()

## 7. Stylometric Analysis

Analyze writing style through sentence structure and function words.

In [None]:
def calculate_stylometric_features(text):
    """Calculate stylometric features for authorship attribution."""
    # Sentence statistics
    sentences = sent_tokenize(text)
    words = word_tokenize(text)

    avg_sentence_length = len(words) / len(sentences) if sentences else 0

    # Word length statistics
    word_lengths = [len(word) for word in words if word.isalpha()]
    avg_word_length = np.mean(word_lengths) if word_lengths else 0

    # Function words (common in English)
    function_words = ["the", "of", "and", "to", "a", "in", "that", "is", "it", "for"]
    tokens_lower = [w.lower() for w in words if w.isalpha()]
    function_word_freq = (
        sum(tokens_lower.count(fw) for fw in function_words) / len(tokens_lower)
        if tokens_lower
        else 0
    )

    return {
        "num_sentences": len(sentences),
        "avg_sentence_length": avg_sentence_length,
        "avg_word_length": avg_word_length,
        "function_word_freq": function_word_freq * 100,  # as percentage
    }


# Calculate stylometric features
stylometric_stats = {}
for author, text in texts.items():
    stylometric_stats[author] = calculate_stylometric_features(text)

style_df = pd.DataFrame(stylometric_stats).T
print("Stylometric Features:")
print(style_df)

In [None]:
# Visualize stylometric features
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Average sentence length
style_df["avg_sentence_length"].plot(kind="bar", ax=axes[0], color="purple", alpha=0.7)
axes[0].set_title("Average Sentence Length", fontsize=12, fontweight="bold")
axes[0].set_xlabel("Author")
axes[0].set_ylabel("Words per Sentence")
axes[0].tick_params(axis="x", rotation=45)
axes[0].grid(True, alpha=0.3, axis="y")

# Average word length
style_df["avg_word_length"].plot(kind="bar", ax=axes[1], color="teal", alpha=0.7)
axes[1].set_title("Average Word Length", fontsize=12, fontweight="bold")
axes[1].set_xlabel("Author")
axes[1].set_ylabel("Characters per Word")
axes[1].tick_params(axis="x", rotation=45)
axes[1].grid(True, alpha=0.3, axis="y")

# Function word frequency
style_df["function_word_freq"].plot(kind="bar", ax=axes[2], color="salmon", alpha=0.7)
axes[2].set_title("Function Word Frequency", fontsize=12, fontweight="bold")
axes[2].set_xlabel("Author")
axes[2].set_ylabel("Percentage (%)")
axes[2].tick_params(axis="x", rotation=45)
axes[2].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

## 8. Text Similarity Analysis

In [None]:
# Calculate TF-IDF vectors
vectorizer = TfidfVectorizer(max_features=100, stop_words="english")
corpus = list(texts.values())
tfidf_matrix = vectorizer.fit_transform(corpus)

# Calculate cosine similarity
similarity_matrix = cosine_similarity(tfidf_matrix)
similarity_df = pd.DataFrame(similarity_matrix, index=texts.keys(), columns=texts.keys())

print("Text Similarity Matrix (Cosine Similarity):")
print(similarity_df)

In [None]:
# Visualize similarity matrix
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(
    similarity_df,
    annot=True,
    fmt=".3f",
    cmap="YlOrRd",
    square=True,
    ax=ax,
    cbar_kws={"label": "Cosine Similarity"},
)
ax.set_title("Text Similarity Heatmap", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

## 9. Topic Modeling (LDA)

Extract themes from the corpus using Latent Dirichlet Allocation.

In [None]:
# Prepare data for LDA
count_vectorizer = CountVectorizer(max_features=50, stop_words="english", max_df=0.95)
doc_term_matrix = count_vectorizer.fit_transform(corpus)
feature_names = count_vectorizer.get_feature_names_out()

# Fit LDA model
n_topics = 3
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42, max_iter=20)
lda_output = lda.fit_transform(doc_term_matrix)


# Display top words for each topic
def display_topics(model, feature_names, n_top_words=10):
    topics = {}
    for topic_idx, topic in enumerate(model.components_):
        top_indices = topic.argsort()[-n_top_words:][::-1]
        top_words = [feature_names[i] for i in top_indices]
        topics[f"Topic {topic_idx + 1}"] = top_words
    return topics


topics = display_topics(lda, feature_names, n_top_words=10)

print("Discovered Topics:\n")
for topic_name, words in topics.items():
    print(f"{topic_name}: {', '.join(words)}")
    print()

In [None]:
# Document-topic distribution
topic_distribution = pd.DataFrame(
    lda_output, index=texts.keys(), columns=[f"Topic {i + 1}" for i in range(n_topics)]
)

print("Document-Topic Distribution:")
print(topic_distribution)

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
topic_distribution.plot(kind="bar", stacked=True, ax=ax, alpha=0.7)
ax.set_title("Topic Distribution Across Texts", fontsize=14, fontweight="bold")
ax.set_xlabel("Author")
ax.set_ylabel("Topic Proportion")
ax.legend(title="Topics", bbox_to_anchor=(1.05, 1), loc="upper left")
ax.tick_params(axis="x", rotation=45)
ax.grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()

## 10. Named Entity Recognition

In [None]:
def extract_named_entities(text):
    """Extract named entities using NLTK."""
    # Tokenize and POS tag
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)

    # Simple NER: extract proper nouns (NNP, NNPS)
    entities = [word for word, pos in pos_tags if pos in ["NNP", "NNPS"]]

    return Counter(entities)


# Extract entities from all texts
print("Named Entities (Proper Nouns):\n")

for author, text in texts.items():
    entities = extract_named_entities(text)
    print(f"{author}:")
    if entities:
        for entity, count in entities.most_common(10):
            print(f"  {entity}: {count}")
    else:
        print("  No named entities found")
    print()

## 11. Part-of-Speech Analysis

In [None]:
def analyze_pos(text):
    """Analyze part-of-speech distribution."""
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)

    # Count POS tags
    pos_counts = Counter(tag for word, tag in pos_tags)

    # Group into major categories
    categories = {
        "Nouns": sum(count for tag, count in pos_counts.items() if tag.startswith("NN")),
        "Verbs": sum(count for tag, count in pos_counts.items() if tag.startswith("VB")),
        "Adjectives": sum(count for tag, count in pos_counts.items() if tag.startswith("JJ")),
        "Adverbs": sum(count for tag, count in pos_counts.items() if tag.startswith("RB")),
        "Pronouns": sum(count for tag, count in pos_counts.items() if tag.startswith("PR")),
    }

    return categories


# Analyze POS for all texts
pos_analysis = {}
for author, text in texts.items():
    pos_analysis[author] = analyze_pos(text)

pos_df = pd.DataFrame(pos_analysis).T
print("Part-of-Speech Distribution:")
print(pos_df)

In [None]:
# Visualize POS distribution
fig, ax = plt.subplots(figsize=(12, 6))
pos_df.plot(kind="bar", ax=ax, alpha=0.7)
ax.set_title("Part-of-Speech Distribution Across Texts", fontsize=14, fontweight="bold")
ax.set_xlabel("Author")
ax.set_ylabel("Count")
ax.legend(title="POS Category", bbox_to_anchor=(1.05, 1), loc="upper left")
ax.tick_params(axis="x", rotation=45)
ax.grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()

## 12. Summary Report

In [None]:
# Create comprehensive summary
summary_data = []

for author in texts:
    summary_data.append(
        {
            "Author": author,
            "Characters": len(texts[author]),
            "Tokens": lexical_stats[author]["total_tokens"],
            "Unique Words": lexical_stats[author]["unique_tokens"],
            "TTR": f"{lexical_stats[author]['ttr']:.3f}",
            "Sentences": stylometric_stats[author]["num_sentences"],
            "Avg Sent Length": f"{stylometric_stats[author]['avg_sentence_length']:.1f}",
            "Avg Word Length": f"{stylometric_stats[author]['avg_word_length']:.2f}",
        }
    )

summary_df = pd.DataFrame(summary_data)

print("=" * 80)
print("TEXT ANALYSIS SUMMARY")
print("=" * 80)
print(summary_df.to_string(index=False))
print("=" * 80)

# Save summary
summary_df.to_csv("text_analysis_summary.csv", index=False)
print("\n✓ Summary saved to text_analysis_summary.csv")

## Key Findings

### Lexical Diversity
- **Highest TTR**: [Author] shows greatest vocabulary diversity
- **Hapax Legomena**: [Author] uses most unique words

### Stylometric Patterns
- **Longest Sentences**: [Author] uses most complex sentence structure
- **Shortest Words**: [Author] favors simpler vocabulary
- **Function Words**: [Author] has highest function word density

### Text Similarity
- Most similar texts: [Text A] and [Text B] (similarity: [value])
- Least similar texts: [Text C] and [Text D] (similarity: [value])

### Topics Identified
1. **Topic 1**: [Theme description based on top words]
2. **Topic 2**: [Theme description]
3. **Topic 3**: [Theme description]

## Research Applications

### Authorship Attribution
- Stylometric features distinguish different authors
- Function word patterns reveal writing habits
- Sentence structure complexity varies by author

### Historical Analysis
- Language evolution across time periods
- Thematic shifts in discourse
- Cultural context through vocabulary

### Literary Studies
- Genre classification
- Influence detection
- Stylistic comparison

## Next Steps

1. **Expand Corpus**: Add more texts for robust analysis
2. **Advanced NER**: Use spaCy for better entity recognition
3. **Sentiment Analysis**: Track emotional tone
4. **Network Analysis**: Build co-occurrence networks
5. **Temporal Analysis**: Track language change over time

## Resources

- [Project Gutenberg](https://www.gutenberg.org/): Free public domain texts
- [NLTK Book](https://www.nltk.org/book/): Natural Language Processing with Python
- [Stylometry Guide](https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python)
- [HathiTrust](https://www.hathitrust.org/): Digital library of millions of texts