# Historical Text Corpus Analysis with AWS

This notebook demonstrates a complete digital humanities workflow using AWS services:
- Download historical texts from Project Gutenberg
- Upload corpus to S3 with metadata organization
- Process texts with Lambda (NLP analysis)
- Query results from DynamoDB
- Visualize linguistic features and comparative analysis

**Prerequisites:**
- AWS credentials configured
- S3 bucket created
- Lambda function deployed
- DynamoDB table created

See `setup_guide.md` for AWS setup instructions.

## 1. Setup and Configuration

In [None]:
# Import libraries
import sys
from pathlib import Path

import boto3
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Add scripts directory to path
sys.path.append("../scripts")

# Configuration
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

print("✓ Libraries imported successfully")

In [None]:
# AWS Configuration
# Replace with your bucket name and region
BUCKET_NAME = "text-corpus-your-id"  # Change this!
DYNAMODB_TABLE = "TextAnalysis"
LAMBDA_FUNCTION = "process-text-document"
AWS_REGION = "us-east-1"

# Initialize AWS clients
s3 = boto3.client("s3", region_name=AWS_REGION)
dynamodb = boto3.resource("dynamodb", region_name=AWS_REGION)
lambda_client = boto3.client("lambda", region_name=AWS_REGION)

# Verify access
try:
    s3.head_bucket(Bucket=BUCKET_NAME)
    print(f"✓ Connected to S3 bucket: {BUCKET_NAME}")
except Exception as e:
    print(f"✗ Error connecting to S3: {e}")
    print("  Please update BUCKET_NAME with your actual bucket name")

## 2. Upload Corpus to S3

We'll use the upload script to download and upload texts from Project Gutenberg.

In [None]:
# Import upload script
from upload_to_s3 import SAMPLE_CORPUS, TextCorpusUploader

# Initialize uploader
uploader = TextCorpusUploader(bucket_name=BUCKET_NAME, local_dir="./corpus")

print(f"Sample corpus contains {len(SAMPLE_CORPUS)} texts:")
for _gid, author, title, period, _genre in SAMPLE_CORPUS[:5]:
    print(f"  - {author}: {title} ({period})")
print("  ...")

In [None]:
# Upload corpus (this will take 5-10 minutes)
# Uncomment to run (skip if already uploaded)

# stats = uploader.upload_corpus(SAMPLE_CORPUS)
# print(f"\nUpload complete!")
# print(f"  Uploaded: {stats['uploaded']} texts")
# print(f"  Failed: {stats['failed']} texts")

In [None]:
# Verify uploads
texts = uploader.list_uploaded_texts()
print(f"Found {len(texts)} texts in S3")

# Display sample
if texts:
    df_texts = pd.DataFrame(
        [
            {
                "key": t["key"],
                "author": t["metadata"].get("author", "Unknown"),
                "title": t["metadata"].get("title", "Unknown"),
                "period": t["metadata"].get("period", "Unknown"),
                "size_kb": t["size"] / 1024,
            }
            for t in texts
        ]
    )
    print("\nSample texts:")
    print(df_texts.head(10))

## 3. Process Texts with Lambda

Lambda will automatically process texts if S3 trigger is configured.
Otherwise, we can manually invoke Lambda for each text.

In [None]:
# Check if S3 trigger is configured
try:
    notification = s3.get_bucket_notification_configuration(Bucket=BUCKET_NAME)
    if "LambdaFunctionConfigurations" in notification:
        print("✓ S3 trigger configured - texts will be processed automatically")
    else:
        print("⚠ S3 trigger not configured - manual processing required")
except Exception as e:
    print(f"Could not check S3 notification: {e}")

In [None]:
# Manually invoke Lambda for a sample text (for testing)
# Uncomment to process a specific text

# sample_key = 'raw/austen/pride-and-prejudice.txt'
# event = {
#     'Records': [{
#         's3': {
#             'bucket': {'name': BUCKET_NAME},
#             'object': {'key': sample_key}
#         }
#     }]
# }

# response = lambda_client.invoke(
#     FunctionName=LAMBDA_FUNCTION,
#     InvocationType='RequestResponse',
#     Payload=json.dumps(event)
# )

# result = json.loads(response['Payload'].read())
# print(f"Lambda response: {result}")

## 4. Query Results from DynamoDB

Retrieve processed results and load into pandas for analysis.

In [None]:
# Import query script
from query_results import TextAnalysisQuerier

# Initialize querier
querier = TextAnalysisQuerier(table_name=DYNAMODB_TABLE)

# Get all documents
documents = querier.get_all_documents()
print(f"Retrieved {len(documents)} analyzed documents from DynamoDB")

if not documents:
    print("⚠ No documents found. Wait for Lambda processing to complete.")
    print("  Check CloudWatch logs: aws logs tail /aws/lambda/process-text-document")

In [None]:
# Convert to DataFrame
if documents:
    df = pd.DataFrame(documents)

    # Display columns
    print("Available columns:")
    print(df.columns.tolist())

    # Display sample
    print("\nSample documents:")
    display_cols = [
        "author",
        "title",
        "period",
        "word_count",
        "vocabulary_richness",
        "avg_sentence_length",
    ]
    display_cols = [col for col in display_cols if col in df.columns]
    print(df[display_cols].head(10))

In [None]:
# Get corpus statistics
stats = querier.get_corpus_statistics()

print("\n" + "=" * 80)
print("CORPUS STATISTICS")
print("=" * 80)
print(f"Total Documents:              {stats.get('total_documents', 0):,}")
print(f"Total Words:                  {stats.get('total_words', 0):,}")
print(f"Average Vocabulary Richness:  {stats.get('avg_vocabulary_richness', 0):.4f}")
print(f"Average Sentence Length:      {stats.get('avg_sentence_length', 0):.2f} words")
print(f"Unique Authors:               {stats.get('authors', 0)}")
print(f"Literary Periods:             {stats.get('periods', 0)}")
print("=" * 80)

## 5. Comparative Analysis

Compare linguistic features across authors and periods.

In [None]:
# Vocabulary Richness by Author
if documents and len(documents) > 0:
    fig, ax = plt.subplots(figsize=(12, 6))

    author_richness = (
        df.groupby("author")["vocabulary_richness"].mean().sort_values(ascending=False)
    )

    author_richness.plot(kind="bar", ax=ax, color="steelblue")
    ax.set_title("Vocabulary Richness by Author", fontsize=16, fontweight="bold")
    ax.set_xlabel("Author", fontsize=12)
    ax.set_ylabel("Average Type-Token Ratio", fontsize=12)
    ax.grid(axis="y", alpha=0.3)
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

    print("\nVocabulary Richness Rankings:")
    for i, (author, richness) in enumerate(author_richness.items(), 1):
        print(f"{i}. {author}: {richness:.4f}")

In [None]:
# Sentence Length by Period
if documents and len(documents) > 0:
    fig, ax = plt.subplots(figsize=(10, 6))

    period_sentence = (
        df.groupby("period")["avg_sentence_length"].mean().sort_values(ascending=False)
    )

    period_sentence.plot(kind="bar", ax=ax, color="coral")
    ax.set_title("Average Sentence Length by Literary Period", fontsize=16, fontweight="bold")
    ax.set_xlabel("Period", fontsize=12)
    ax.set_ylabel("Average Sentence Length (words)", fontsize=12)
    ax.grid(axis="y", alpha=0.3)
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

    print("\nSentence Length by Period:")
    for period, length in period_sentence.items():
        print(f"{period}: {length:.2f} words")

In [None]:
# Word Count Distribution
if documents and len(documents) > 0:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

    # Histogram
    df["word_count"].hist(bins=20, ax=ax1, color="skyblue", edgecolor="black")
    ax1.set_title("Distribution of Word Counts", fontsize=14, fontweight="bold")
    ax1.set_xlabel("Word Count", fontsize=12)
    ax1.set_ylabel("Number of Documents", fontsize=12)
    ax1.grid(axis="y", alpha=0.3)

    # Box plot by author
    df.boxplot(column="word_count", by="author", ax=ax2)
    ax2.set_title("Word Count by Author", fontsize=14, fontweight="bold")
    ax2.set_xlabel("Author", fontsize=12)
    ax2.set_ylabel("Word Count", fontsize=12)
    plt.suptitle("")  # Remove default title
    plt.xticks(rotation=45, ha="right")

    plt.tight_layout()
    plt.show()

In [None]:
# Vocabulary Richness vs Word Count Scatter
if documents and len(documents) > 0:
    fig, ax = plt.subplots(figsize=(12, 8))

    # Color by author
    for author in df["author"].unique():
        author_df = df[df["author"] == author]
        ax.scatter(
            author_df["word_count"],
            author_df["vocabulary_richness"],
            label=author,
            s=100,
            alpha=0.6,
        )

    ax.set_title("Vocabulary Richness vs Document Length", fontsize=16, fontweight="bold")
    ax.set_xlabel("Word Count", fontsize=12)
    ax.set_ylabel("Vocabulary Richness (Type-Token Ratio)", fontsize=12)
    ax.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
    ax.grid(alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Correlation
    corr = df["word_count"].corr(df["vocabulary_richness"])
    print(f"\nCorrelation between word count and vocabulary richness: {corr:.3f}")
    print("Note: Negative correlation is typical - longer texts tend to have lower TTR")

## 6. Detailed Analysis by Author

Deep dive into specific authors' linguistic patterns.

In [None]:
# Query specific author
author_name = "Jane Austen"  # Change to analyze different author

author_docs = querier.query_by_author(author_name)
print(f"Found {len(author_docs)} documents by {author_name}")

if author_docs:
    author_df = pd.DataFrame(author_docs)
    print("\nDocuments:")
    display_cols = ["title", "word_count", "vocabulary_richness", "avg_sentence_length"]
    display_cols = [col for col in display_cols if col in author_df.columns]
    print(author_df[display_cols].to_string(index=False))

In [None]:
# Compare Austen vs Dickens
if documents and len(documents) > 0:
    austen = querier.query_by_author("Jane Austen")
    dickens = querier.query_by_author("Charles Dickens")

    if austen and dickens:
        austen_df = pd.DataFrame(austen)
        dickens_df = pd.DataFrame(dickens)

        fig, axes = plt.subplots(2, 2, figsize=(14, 10))

        # Vocabulary Richness
        data = [austen_df["vocabulary_richness"], dickens_df["vocabulary_richness"]]
        axes[0, 0].boxplot(data, labels=["Austen", "Dickens"])
        axes[0, 0].set_title("Vocabulary Richness", fontweight="bold")
        axes[0, 0].set_ylabel("Type-Token Ratio")
        axes[0, 0].grid(axis="y", alpha=0.3)

        # Sentence Length
        data = [austen_df["avg_sentence_length"], dickens_df["avg_sentence_length"]]
        axes[0, 1].boxplot(data, labels=["Austen", "Dickens"])
        axes[0, 1].set_title("Sentence Length", fontweight="bold")
        axes[0, 1].set_ylabel("Average Words per Sentence")
        axes[0, 1].grid(axis="y", alpha=0.3)

        # Word Count
        data = [austen_df["word_count"], dickens_df["word_count"]]
        axes[1, 0].boxplot(data, labels=["Austen", "Dickens"])
        axes[1, 0].set_title("Document Length", fontweight="bold")
        axes[1, 0].set_ylabel("Total Words")
        axes[1, 0].grid(axis="y", alpha=0.3)

        # Unique Words
        data = [austen_df["unique_words"], dickens_df["unique_words"]]
        axes[1, 1].boxplot(data, labels=["Austen", "Dickens"])
        axes[1, 1].set_title("Vocabulary Size", fontweight="bold")
        axes[1, 1].set_ylabel("Unique Words")
        axes[1, 1].grid(axis="y", alpha=0.3)

        plt.suptitle(
            "Jane Austen vs Charles Dickens: Linguistic Comparison", fontsize=16, fontweight="bold"
        )
        plt.tight_layout()
        plt.show()

## 7. Period-Based Analysis

Compare Romantic vs Victorian literature.

In [None]:
# Query by period
romantic_docs = querier.query_by_period("Romantic")
victorian_docs = querier.query_by_period("Victorian")

print(f"Romantic period: {len(romantic_docs)} documents")
print(f"Victorian period: {len(victorian_docs)} documents")

if romantic_docs and victorian_docs:
    romantic_df = pd.DataFrame(romantic_docs)
    victorian_df = pd.DataFrame(victorian_docs)

    print("\nRomantic Period Statistics:")
    print(f"  Avg Vocabulary Richness: {romantic_df['vocabulary_richness'].mean():.4f}")
    print(f"  Avg Sentence Length: {romantic_df['avg_sentence_length'].mean():.2f}")
    print(f"  Avg Word Count: {romantic_df['word_count'].mean():.0f}")

    print("\nVictorian Period Statistics:")
    print(f"  Avg Vocabulary Richness: {victorian_df['vocabulary_richness'].mean():.4f}")
    print(f"  Avg Sentence Length: {victorian_df['avg_sentence_length'].mean():.2f}")
    print(f"  Avg Word Count: {victorian_df['word_count'].mean():.0f}")

In [None]:
# Visualize period comparison
if romantic_docs and victorian_docs:
    fig, axes = plt.subplots(1, 3, figsize=(16, 5))

    # Vocabulary Richness
    data = [romantic_df["vocabulary_richness"], victorian_df["vocabulary_richness"]]
    bp1 = axes[0].boxplot(data, labels=["Romantic", "Victorian"], patch_artist=True)
    for patch, color in zip(bp1["boxes"], ["lightblue", "lightcoral"]):
        patch.set_facecolor(color)
    axes[0].set_title("Vocabulary Richness", fontsize=14, fontweight="bold")
    axes[0].set_ylabel("Type-Token Ratio", fontsize=11)
    axes[0].grid(axis="y", alpha=0.3)

    # Sentence Length
    data = [romantic_df["avg_sentence_length"], victorian_df["avg_sentence_length"]]
    bp2 = axes[1].boxplot(data, labels=["Romantic", "Victorian"], patch_artist=True)
    for patch, color in zip(bp2["boxes"], ["lightblue", "lightcoral"]):
        patch.set_facecolor(color)
    axes[1].set_title("Sentence Length", fontsize=14, fontweight="bold")
    axes[1].set_ylabel("Words per Sentence", fontsize=11)
    axes[1].grid(axis="y", alpha=0.3)

    # Document Length
    data = [romantic_df["word_count"], victorian_df["word_count"]]
    bp3 = axes[2].boxplot(data, labels=["Romantic", "Victorian"], patch_artist=True)
    for patch, color in zip(bp3["boxes"], ["lightblue", "lightcoral"]):
        patch.set_facecolor(color)
    axes[2].set_title("Document Length", fontsize=14, fontweight="bold")
    axes[2].set_ylabel("Total Words", fontsize=11)
    axes[2].grid(axis="y", alpha=0.3)

    plt.suptitle(
        "Romantic vs Victorian Literature: Linguistic Features", fontsize=16, fontweight="bold"
    )
    plt.tight_layout()
    plt.show()

## 8. Export Results

Save analysis results for further research or publication.

In [None]:
# Export to CSV
if documents:
    output_file = "../results/corpus_analysis.csv"
    Path("../results").mkdir(exist_ok=True)

    df.to_csv(output_file, index=False)
    print(f"✓ Exported {len(df)} documents to {output_file}")

    # Export summary statistics
    summary_file = "../results/summary_statistics.txt"
    with open(summary_file, "w") as f:
        f.write("CORPUS SUMMARY STATISTICS\n")
        f.write("=" * 80 + "\n\n")
        f.write(f"Total Documents: {len(df)}\n")
        f.write(f"Total Words: {df['word_count'].sum():,}\n")
        f.write(f"Avg Vocabulary Richness: {df['vocabulary_richness'].mean():.4f}\n")
        f.write(f"Avg Sentence Length: {df['avg_sentence_length'].mean():.2f}\n\n")

        f.write("By Author:\n")
        f.write("-" * 80 + "\n")
        for author in sorted(df["author"].unique()):
            author_data = df[df["author"] == author]
            f.write(f"\n{author}:\n")
            f.write(f"  Documents: {len(author_data)}\n")
            f.write(f"  Avg Vocabulary Richness: {author_data['vocabulary_richness'].mean():.4f}\n")
            f.write(f"  Avg Sentence Length: {author_data['avg_sentence_length'].mean():.2f}\n")

    print(f"✓ Exported summary statistics to {summary_file}")

## 9. Next Steps

### Further Analysis
- Download detailed JSON results from S3 for word frequency analysis
- Examine named entities (characters, places) from processed texts
- Create word clouds for individual authors or periods
- Perform topic modeling across the corpus
- Build character co-occurrence networks

### Expand Corpus
- Add more authors and texts
- Include different genres (poetry, drama)
- Analyze translations
- Compare different time periods

### Advanced AWS Features
- Use Athena for complex SQL queries
- Set up automated processing with S3 triggers
- Create dashboards with QuickSight
- Move to Tier 3 for production infrastructure

### Cleanup
When finished, follow `cleanup_guide.md` to delete AWS resources and avoid charges.

In [None]:
# Summary
print("\n" + "=" * 80)
print("ANALYSIS COMPLETE")
print("=" * 80)
print(
    f"\nProcessed {len(documents) if documents else 0} documents from {len(df['author'].unique()) if documents else 0} authors"
)
print("\nKey findings:")
if documents and len(documents) > 0:
    top_author = df.groupby("author")["vocabulary_richness"].mean().idxmax()
    top_richness = df.groupby("author")["vocabulary_richness"].mean().max()
    print(f"  - Highest vocabulary richness: {top_author} ({top_richness:.4f})")

    longest_author = df.groupby("author")["avg_sentence_length"].mean().idxmax()
    longest_sentences = df.groupby("author")["avg_sentence_length"].mean().max()
    print(f"  - Longest sentences: {longest_author} ({longest_sentences:.2f} words)")

    most_words = df.groupby("author")["word_count"].sum().idxmax()
    total_words = df.groupby("author")["word_count"].sum().max()
    print(f"  - Most prolific: {most_words} ({total_words:,} words)")

print("\nResults exported to ../results/")
print("\nRemember to run cleanup_guide.md when finished to delete AWS resources!")
print("=" * 80)