# Text Mining Introduction: Historical Document Analysis

Learn digital humanities fundamentals through 19th century text analysis.

## Dataset

20 historical documents (1850-1900) about industrialization:
- Topics: Labor, economy, technology, society
- Authors and publication years
- Industrial revolution period

## Methods
- Text preprocessing
- Word frequency analysis
- TF-IDF scoring
- Temporal trends
- Topic identification

In [None]:
import re
import warnings
from collections import Counter

import matplotlib.pyplot as plt
import nltk
import pandas as pd
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer

warnings.filterwarnings("ignore")

# Download required NLTK data
nltk.download("stopwords", quiet=True)
nltk.download("punkt", quiet=True)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

plt.style.use("seaborn-v0_8-darkgrid")
%matplotlib inline

print("✓ Setup complete")

## 1. Load and Explore Data

In [None]:
# Load documents
df = pd.read_csv("sample_texts.csv")

print(f"Number of documents: {len(df)}")
print(f"Time period: {df['year'].min()} - {df['year'].max()}")
print(f"\nAuthors: {df['author'].nunique()}")
print(f"Average text length: {df['text'].str.len().mean():.0f} characters")

df.head()

## 2. Text Preprocessing

In [None]:
# Preprocessing function
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r"[^\w\s]", "", text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words("english"))
    tokens = [w for w in tokens if w not in stop_words]
    # Remove short words
    tokens = [w for w in tokens if len(w) > 3]
    return tokens


# Apply preprocessing
df["tokens"] = df["text"].apply(preprocess_text)
df["word_count"] = df["tokens"].apply(len)

print("Example preprocessing:")
print(f"Original: {df['text'].iloc[0][:100]}...")
print(f"Tokens: {df['tokens'].iloc[0][:15]}")
print(f"\nAverage words per document: {df['word_count'].mean():.1f}")

## 3. Word Frequency Analysis

In [None]:
# Count all words
all_words = []
for tokens in df["tokens"]:
    all_words.extend(tokens)

word_freq = Counter(all_words)
most_common = word_freq.most_common(20)

# Plot most frequent words
words = [w[0] for w in most_common]
counts = [w[1] for w in most_common]

fig, ax = plt.subplots(figsize=(12, 6))
bars = ax.barh(words, counts, color="steelblue", alpha=0.7, edgecolor="black")
ax.set_xlabel("Frequency", fontsize=12)
ax.set_title("Top 20 Most Frequent Words", fontsize=14, fontweight="bold")
ax.invert_yaxis()
ax.grid(True, alpha=0.3, axis="x")
plt.tight_layout()
plt.show()

print("\nTop 20 Most Common Words:")
for word, count in most_common:
    print(f"  {word:.<25} {count}")

## 4. TF-IDF Analysis

In [None]:
# Calculate TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=50, stop_words="english", min_df=2)
tfidf_matrix = tfidf_vectorizer.fit_transform(df["text"])
feature_names = tfidf_vectorizer.get_feature_names_out()

# Get top terms for each document
print("Top TF-IDF Terms by Document:\n")
for idx in [0, 5, 10, 15]:  # Sample documents
    doc_tfidf = tfidf_matrix[idx].toarray()[0]
    top_indices = doc_tfidf.argsort()[-5:][::-1]
    top_terms = [(feature_names[i], doc_tfidf[i]) for i in top_indices]

    print(f"{df['title'].iloc[idx]} ({df['year'].iloc[idx]}):")
    for term, score in top_terms:
        print(f"  {term}: {score:.3f}")
    print()

# Overall most important terms
mean_tfidf = tfidf_matrix.mean(axis=0).A1
top_terms_overall = sorted(zip(feature_names, mean_tfidf), key=lambda x: x[1], reverse=True)[:15]

# Plot
terms = [t[0] for t in top_terms_overall]
scores = [t[1] for t in top_terms_overall]

fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(terms, scores, color="green", alpha=0.7, edgecolor="black")
ax.set_xlabel("Average TF-IDF Score", fontsize=12)
ax.set_title("Most Distinctive Terms (TF-IDF)", fontsize=14, fontweight="bold")
ax.invert_yaxis()
ax.grid(True, alpha=0.3, axis="x")
plt.tight_layout()
plt.show()

print("TF-IDF identifies words that are distinctive to documents,")
print("not just frequent overall.")

## 5. Temporal Analysis

In [None]:
# Track key terms over time
key_terms = ["workers", "industrial", "society", "economic", "technology"]

# Count occurrences by year
term_by_year = {term: [] for term in key_terms}
years = sorted(df["year"].unique())

for year in years:
    year_docs = df[df["year"] == year]
    year_text = " ".join(year_docs["text"].values).lower()

    for term in key_terms:
        count = year_text.count(term)
        term_by_year[term].append(count)

# Plot trends
fig, ax = plt.subplots(figsize=(12, 6))

for term in key_terms:
    ax.plot(years, term_by_year[term], marker="o", linewidth=2, label=term.capitalize())

ax.set_xlabel("Year", fontsize=12)
ax.set_ylabel("Frequency", fontsize=12)
ax.set_title("Key Terms Over Time (1850-1900)", fontsize=14, fontweight="bold")
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Temporal analysis reveals how vocabulary changes over time.")
print("Different themes gain or lose prominence across decades.")

## 6. Document Similarity

In [None]:
# Calculate document similarity using TF-IDF
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tfidf_matrix)

# Create DataFrame with document titles
sim_df = pd.DataFrame(similarity_matrix, index=df["title"], columns=df["title"])

# Plot heatmap (sample)
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(
    sim_df.iloc[:10, :10],
    annot=True,
    fmt=".2f",
    cmap="YlOrRd",
    square=True,
    ax=ax,
    cbar_kws={"label": "Cosine Similarity"},
)
ax.set_title("Document Similarity Matrix (First 10 Documents)", fontsize=14, fontweight="bold")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Find most similar document pairs
print("\nMost Similar Document Pairs:")
for i in range(len(df)):
    for j in range(i + 1, len(df)):
        if similarity_matrix[i, j] > 0.6:  # Threshold
            print(
                f"  {df['title'].iloc[i]} <-> {df['title'].iloc[j]}: {similarity_matrix[i, j]:.3f}"
            )

## 7. Topic Themes

In [None]:
# Simple topic identification by clustering key terms
# Group documents by their most distinctive terms

topic_groups = {}
for idx, row in df.iterrows():
    doc_tfidf = tfidf_matrix[idx].toarray()[0]
    top_term_idx = doc_tfidf.argmax()
    top_term = feature_names[top_term_idx]

    if top_term not in topic_groups:
        topic_groups[top_term] = []
    topic_groups[top_term].append(row["title"])

print("Document Groupings by Dominant Theme:\n")
for theme, docs in sorted(topic_groups.items(), key=lambda x: len(x[1]), reverse=True)[:5]:
    print(f"Theme: {theme.upper()}")
    for doc in docs:
        print(f"  - {doc}")
    print()

# Count documents per decade
df["decade"] = (df["year"] // 10) * 10
decade_counts = df["decade"].value_counts().sort_index()

print("\nDocuments by Decade:")
for decade, count in decade_counts.items():
    print(f"  {decade}s: {count} documents")

## 8. Summary Statistics

In [None]:
summary = {
    "Total Documents": len(df),
    "Time Span": f"{df['year'].min()}-{df['year'].max()}",
    "Unique Authors": df["author"].nunique(),
    "Total Words": sum(all_words for all_words in [len(tokens) for tokens in df["tokens"]]),
    "Unique Words": len(word_freq),
    "Avg Words/Doc": f"{df['word_count'].mean():.1f}",
    "Most Common Word": most_common[0][0],
    "Top TF-IDF Term": top_terms_overall[0][0],
}

print("=" * 60)
print("TEXT MINING ANALYSIS SUMMARY")
print("=" * 60)
for key, value in summary.items():
    print(f"{key:.<40} {value}")
print("=" * 60)

print("\n✓ Analysis complete!")
print("\nKey Findings:")
print("  1. Industrial and economic themes dominate the corpus")
print("  2. Vocabulary shifts over the 50-year period")
print("  3. Documents cluster by thematic similarity")
print("  4. TF-IDF reveals distinctive terminology")

## Key Concepts Learned

### Text Preprocessing
- **Tokenization**: Splitting text into words
- **Stop words**: Common words removed (the, and, is)
- **Normalization**: Lowercasing, punctuation removal

### Frequency Analysis
- **Term frequency**: How often words appear
- **Word counts**: Basic frequency statistics
- **Common terms**: Most frequent vocabulary

### TF-IDF
- **Term Frequency**: Word frequency in document
- **Inverse Document Frequency**: Rarity across corpus
- **Distinctive terms**: Words unique to documents

### Temporal Analysis
- **Trends over time**: Vocabulary changes
- **Historical periods**: Era-specific terms
- **Thematic evolution**: Shifting topics

### Document Similarity
- **Cosine similarity**: Vector-based comparison
- **Thematic clustering**: Grouping similar texts
- **Relationships**: Connections between documents

## Next Steps

### Extend the Analysis
- Named entity recognition (people, places)
- Sentiment analysis
- Topic modeling (LDA)
- Network analysis of concepts

### Real Historical Corpora
- **[Project Gutenberg](https://www.gutenberg.org/)**: 70,000+ free books
- **[HathiTrust](https://www.hathitrust.org/)**: Millions of digitized texts
- **[Internet Archive](https://archive.org/)**: Historical documents
- **[Chronicling America](https://chroniclingamerica.loc.gov/)**: Historical newspapers

### Advanced Methods
- Word embeddings (Word2Vec, GloVe)
- Topic modeling at scale
- Neural language models
- Stylometry and authorship attribution

## Resources

- **[NLTK Book](https://www.nltk.org/book/)**: Natural Language Processing with Python
- **[Voyant Tools](https://voyant-tools.org/)**: Web-based text analysis
- **[Programming Historian](https://programminghistorian.org/)**: Digital humanities tutorials