# Corpus Linguistics Analysis with AWS

This notebook demonstrates cloud-based corpus linguistics analysis using:
- **S3** for corpus storage
- **Lambda** for automated linguistic annotation
- **DynamoDB** for storing and querying linguistic features

**What you'll do:**
1. Download or prepare a multilingual corpus
2. Upload corpus to S3
3. Trigger Lambda for linguistic analysis
4. Query and visualize results
5. Perform cross-linguistic comparisons

## Setup and Configuration

In [None]:
# Import required libraries
from decimal import Decimal

import boto3
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import seaborn as sns
from wordcloud import WordCloud

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

print("Libraries imported successfully!")

In [None]:
# AWS Configuration
# Replace with your actual bucket name and region
BUCKET_NAME = "linguistic-corpus-YOUR_ID"  # TODO: Update this
DYNAMODB_TABLE = "LinguisticAnalysis"
REGION = "us-east-1"

# Initialize AWS clients
s3_client = boto3.client("s3", region_name=REGION)
dynamodb = boto3.resource("dynamodb", region_name=REGION)
table = dynamodb.Table(DYNAMODB_TABLE)

print(f"Connected to S3 bucket: {BUCKET_NAME}")
print(f"Connected to DynamoDB table: {DYNAMODB_TABLE}")

## Step 1: Prepare Sample Corpus

We'll create a small multilingual corpus for demonstration.

In [None]:
# Create sample corpus locally
import os
from pathlib import Path

corpus_dir = Path("./sample_corpus")
corpus_dir.mkdir(exist_ok=True)

# Sample texts in different languages and genres
samples = {
    "english/academic": """
    Corpus linguistics employs computational methods to analyze large collections of texts.
    The methodology relies on frequency analysis, collocation detection, and concordance generation.
    Researchers use statistical measures to identify significant linguistic patterns and variations.
    Modern corpus analysis tools enable processing of millions of words in multiple languages.
    Lexical diversity metrics provide insights into vocabulary richness and text complexity.
    """,
    "english/news": """
    Scientists announced a breakthrough in natural language processing technology yesterday.
    The new system can analyze text in dozens of languages simultaneously.
    Experts believe this development will revolutionize translation and language learning.
    Several universities have already requested access to the research data.
    The findings were published in a leading computational linguistics journal.
    """,
    "english/fiction": """
    Emma walked through the library, running her fingers along the old books.
    The scent of aged paper and leather filled the quiet room.
    She found the volume she was looking for on the highest shelf.
    Its pages contained stories from centuries past, waiting to be discovered.
    Outside, the rain began to fall softly against the windows.
    """,
    "spanish/academic": """
    La lingüística de corpus estudia el lenguaje mediante análisis de grandes colecciones de textos.
    Los métodos cuantitativos permiten identificar patrones frecuentes en el uso lingüístico.
    Las colocaciones revelan combinaciones de palabras estadísticamente significativas.
    Los investigadores emplean software especializado para procesar millones de palabras.
    La diversidad léxica indica la riqueza vocabular de un texto o corpus.
    """,
    "spanish/news": """
    El gobierno anunció nuevas políticas educativas para mejorar la enseñanza de idiomas.
    Los expertos consideran que el aprendizaje de lenguas extranjeras es fundamental.
    Las escuelas recibirán recursos adicionales para programas multilingües.
    Los estudiantes podrán elegir entre varios idiomas según sus intereses.
    La comunidad educativa ha recibido positivamente las reformas propuestas.
    """,
    "french/academic": """
    La linguistique de corpus analyse de grandes quantités de données textuelles.
    Les chercheurs utilisent des méthodes statistiques pour identifier des tendances linguistiques.
    Les collocations montrent des associations récurrentes entre les mots.
    L'analyse automatique permet de traiter des millions de mots rapidement.
    La diversité lexicale mesure la richesse du vocabulaire utilisé dans un texte.
    """,
}

# Write sample files
for path, content in samples.items():
    file_path = corpus_dir / f"{path}.txt"
    file_path.parent.mkdir(parents=True, exist_ok=True)

    with open(file_path, "w", encoding="utf-8") as f:
        f.write(content.strip())

print(f"Created {len(samples)} sample texts in {corpus_dir}")
print("\nCorpus structure:")
for root, _dirs, files in os.walk(corpus_dir):
    level = root.replace(str(corpus_dir), "").count(os.sep)
    indent = " " * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    subindent = " " * 2 * (level + 1)
    for file in files:
        print(f"{subindent}{file}")

## Step 2: Upload Corpus to S3

Upload the corpus files to S3, organizing by language and genre.

In [None]:
# Upload corpus files to S3

uploaded_files = []

for root, _dirs, files in os.walk(corpus_dir):
    for file in files:
        if file.endswith(".txt"):
            local_path = os.path.join(root, file)

            # Create S3 key maintaining structure
            relative_path = os.path.relpath(local_path, corpus_dir)
            s3_key = f"raw/{relative_path}"

            # Upload file
            s3_client.upload_file(local_path, BUCKET_NAME, s3_key)
            uploaded_files.append(s3_key)
            print(f"Uploaded: {s3_key}")

print(f"\nTotal files uploaded: {len(uploaded_files)}")

## Step 3: Trigger Lambda Processing

If you configured S3 triggers, Lambda will automatically process the files.
Otherwise, you can manually invoke Lambda for each file.

In [None]:
# Wait for Lambda to process files (if using S3 triggers)
import time

print("Waiting for Lambda to process files...")
time.sleep(30)  # Wait 30 seconds for processing
print("Processing should be complete. Let's check DynamoDB...")

In [None]:
# Alternative: Manually invoke Lambda for each file
lambda_client = boto3.client("lambda", region_name=REGION)

# Uncomment to manually invoke Lambda
# for s3_key in uploaded_files:
#     event = {
#         'Records': [{
#             's3': {
#                 'bucket': {'name': BUCKET_NAME},
#                 'object': {'key': s3_key}
#             }
#         }]
#     }
#
#     response = lambda_client.invoke(
#         FunctionName='analyze-linguistic-corpus',
#         InvocationType='Event',  # Async
#         Payload=json.dumps(event)
#     )
#     print(f"Invoked Lambda for {s3_key}")

## Step 4: Query Results from DynamoDB

Retrieve linguistic analysis results.

In [None]:
# Helper function to convert Decimal to float
def decimal_to_float(obj):
    """Convert DynamoDB Decimal types to float."""
    if isinstance(obj, Decimal):
        return float(obj)
    elif isinstance(obj, dict):
        return {k: decimal_to_float(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [decimal_to_float(v) for v in obj]
    return obj


# Scan all results
response = table.scan()
items = response["Items"]

# Handle pagination
while "LastEvaluatedKey" in response:
    response = table.scan(ExclusiveStartKey=response["LastEvaluatedKey"])
    items.extend(response["Items"])

# Convert Decimals
items = [decimal_to_float(item) for item in items]

print(f"Retrieved {len(items)} texts from DynamoDB")

# Convert to DataFrame
if items:
    df = pd.json_normalize(items)
    print("\nDataFrame columns:")
    print(df.columns.tolist())
else:
    print("\nNo items found in DynamoDB. Make sure Lambda has processed the files.")

In [None]:
# Display basic statistics
if not df.empty:
    summary_cols = [
        "text_id",
        "language",
        "genre",
        "word_count",
        "sentence_count",
        "unique_words",
        "lexical_diversity.ttr",
    ]
    available_cols = [col for col in summary_cols if col in df.columns]
    print(df[available_cols].to_string())
else:
    print("No data to display")

## Step 5: Visualize Word Frequencies

Create word clouds for each language/genre combination.

In [None]:
# Generate word clouds
if not df.empty and "top_words" in df.columns:
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()

    for idx, row in df.iterrows():
        if idx >= 6:  # Limit to 6 plots
            break

        # Extract word frequencies
        top_words = row.get("top_words", [])
        if top_words and isinstance(top_words, list):
            word_freq = {item["word"]: item["freq"] for item in top_words if isinstance(item, dict)}

            if word_freq:
                # Generate word cloud
                wordcloud = WordCloud(
                    width=400, height=300, background_color="white"
                ).generate_from_frequencies(word_freq)

                # Plot
                axes[idx].imshow(wordcloud, interpolation="bilinear")
                axes[idx].axis("off")
                title = f"{row.get('language', 'Unknown')} - {row.get('genre', 'Unknown')}"
                axes[idx].set_title(title, fontsize=12, fontweight="bold")

    # Hide unused subplots
    for idx in range(len(df), 6):
        axes[idx].axis("off")

    plt.tight_layout()
    plt.savefig("wordclouds.png", dpi=150, bbox_inches="tight")
    plt.show()
    print("Word clouds saved to wordclouds.png")
else:
    print("No word frequency data available for visualization")

## Step 6: Lexical Diversity Comparison

Compare lexical diversity across languages and genres.

In [None]:
# Plot lexical diversity metrics
if not df.empty and "lexical_diversity.ttr" in df.columns:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # By language
    df.boxplot(column="lexical_diversity.ttr", by="language", ax=axes[0])
    axes[0].set_title("Lexical Diversity (TTR) by Language")
    axes[0].set_xlabel("Language")
    axes[0].set_ylabel("Type-Token Ratio")
    axes[0].get_figure().suptitle("")  # Remove default title

    # By genre
    df.boxplot(column="lexical_diversity.ttr", by="genre", ax=axes[1])
    axes[1].set_title("Lexical Diversity (TTR) by Genre")
    axes[1].set_xlabel("Genre")
    axes[1].set_ylabel("Type-Token Ratio")
    axes[1].get_figure().suptitle("")  # Remove default title

    plt.tight_layout()
    plt.savefig("lexical_diversity.png", dpi=150, bbox_inches="tight")
    plt.show()
    print("Lexical diversity plots saved to lexical_diversity.png")
else:
    print("No lexical diversity data available")

## Step 7: POS Distribution Analysis

Analyze part-of-speech distributions across texts.

In [None]:
# Extract POS distributions
if not df.empty and "pos_distribution" in df.columns:
    pos_data = []

    for _, row in df.iterrows():
        pos_dist = row.get("pos_distribution", {})
        if isinstance(pos_dist, dict):
            for pos, count in pos_dist.items():
                pos_data.append(
                    {
                        "text_id": row.get("text_id", ""),
                        "language": row.get("language", ""),
                        "genre": row.get("genre", ""),
                        "pos": pos,
                        "count": count,
                    }
                )

    if pos_data:
        pos_df = pd.DataFrame(pos_data)

        # Calculate percentages
        totals = pos_df.groupby("text_id")["count"].sum()
        pos_df["percentage"] = pos_df.apply(
            lambda row: (row["count"] / totals[row["text_id"]] * 100)
            if row["text_id"] in totals
            else 0,
            axis=1,
        )

        # Plot POS distribution by genre
        pivot = pos_df.pivot_table(
            index="pos", columns="genre", values="percentage", aggfunc="mean"
        )

        pivot.plot(kind="bar", figsize=(12, 6))
        plt.title("Average POS Distribution by Genre", fontsize=14, fontweight="bold")
        plt.xlabel("Part of Speech")
        plt.ylabel("Percentage")
        plt.legend(title="Genre")
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.savefig("pos_distribution.png", dpi=150, bbox_inches="tight")
        plt.show()
        print("POS distribution plot saved to pos_distribution.png")
    else:
        print("No POS data to visualize")
else:
    print("No POS distribution data available")

## Step 8: Collocation Network Visualization

Visualize top collocations as a network graph.

In [None]:
# Create collocation network for English academic texts
if not df.empty and "collocations.bigrams" in df.columns:
    # Filter for English academic texts
    english_academic = df[(df["language"] == "english") & (df["genre"] == "academic")]

    if not english_academic.empty:
        G = nx.Graph()

        for _, row in english_academic.iterrows():
            bigrams = row.get("collocations.bigrams", [])
            if isinstance(bigrams, list):
                for bigram_data in bigrams[:5]:  # Top 5 collocations
                    if isinstance(bigram_data, dict):
                        bigram = bigram_data.get("bigram", "")
                        pmi = bigram_data.get("pmi", 0)

                        if " " in bigram:
                            word1, word2 = bigram.split(" ", 1)
                            G.add_edge(word1, word2, weight=pmi)

        if G.number_of_nodes() > 0:
            # Draw network
            plt.figure(figsize=(12, 8))
            pos = nx.spring_layout(G, k=2, iterations=50)

            # Draw nodes
            nx.draw_networkx_nodes(G, pos, node_size=1000, node_color="lightblue", alpha=0.7)

            # Draw edges with varying thickness
            edges = G.edges()
            weights = [G[u][v]["weight"] for u, v in edges]
            nx.draw_networkx_edges(G, pos, width=[w / 2 for w in weights], alpha=0.5)

            # Draw labels
            nx.draw_networkx_labels(G, pos, font_size=10, font_weight="bold")

            plt.title("Collocation Network (English Academic)", fontsize=14, fontweight="bold")
            plt.axis("off")
            plt.tight_layout()
            plt.savefig("collocation_network.png", dpi=150, bbox_inches="tight")
            plt.show()
            print("Collocation network saved to collocation_network.png")
        else:
            print("No collocation data to visualize")
    else:
        print("No English academic texts found")
else:
    print("No collocation data available")

## Step 9: Syntactic Complexity Analysis

Compare syntactic complexity across languages and genres.

In [None]:
# Plot syntactic complexity
if not df.empty and "syntactic_complexity.avg_sentence_length" in df.columns:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Average sentence length by language
    df.groupby("language")["syntactic_complexity.avg_sentence_length"].mean().plot(
        kind="bar", ax=axes[0], color="steelblue"
    )
    axes[0].set_title("Average Sentence Length by Language")
    axes[0].set_xlabel("Language")
    axes[0].set_ylabel("Words per Sentence")
    axes[0].tick_params(axis="x", rotation=45)

    # Average sentence length by genre
    df.groupby("genre")["syntactic_complexity.avg_sentence_length"].mean().plot(
        kind="bar", ax=axes[1], color="coral"
    )
    axes[1].set_title("Average Sentence Length by Genre")
    axes[1].set_xlabel("Genre")
    axes[1].set_ylabel("Words per Sentence")
    axes[1].tick_params(axis="x", rotation=45)

    plt.tight_layout()
    plt.savefig("syntactic_complexity.png", dpi=150, bbox_inches="tight")
    plt.show()
    print("Syntactic complexity plots saved to syntactic_complexity.png")
else:
    print("No syntactic complexity data available")

## Step 10: Export Results

Export analysis results to CSV for further analysis.

In [None]:
# Export summary to CSV
if not df.empty:
    export_cols = [
        "text_id",
        "language",
        "genre",
        "word_count",
        "sentence_count",
        "unique_words",
        "avg_word_length",
        "lexical_diversity.ttr",
        "lexical_diversity.mattr",
        "syntactic_complexity.avg_sentence_length",
    ]

    available_export_cols = [col for col in export_cols if col in df.columns]
    df[available_export_cols].to_csv("corpus_analysis_results.csv", index=False)
    print("Results exported to corpus_analysis_results.csv")
    print(f"Columns: {', '.join(available_export_cols)}")
else:
    print("No data to export")

## Summary

In this notebook, you've:
1. Created a multilingual corpus
2. Uploaded it to S3
3. Processed texts with Lambda for linguistic annotation
4. Queried results from DynamoDB
5. Visualized word frequencies, lexical diversity, POS distributions, and collocations
6. Compared syntactic complexity across languages and genres
7. Exported results for further analysis

**Next Steps:**
- Add more texts to your corpus
- Experiment with different languages
- Perform diachronic analysis (language change over time)
- Compare register differences (formal vs. informal)
- Move to Tier 3 for production-scale corpus analysis