# The Lexos Corpus Module

This comprehensive tutorial will guide you through the corpus module, showing you how to manage document collections, perform statistical analysis, and integrate with other Lexos modules.

<a id="setup"></a>
## Setup

First, let's import everything you'll need and set up the environment:

In [None]:
# Core imports
import tempfile
import time
from pathlib import Path
import matplotlib.pyplot as plt

# Lexos imports
from lexos.corpus import Corpus, Record
from lexos.corpus.corpus_stats import CorpusStats
from lexos.io import Loader  # For UTF-8 compliant file loading

# Suppress warnings for cleaner output
# warnings.filterwarnings("ignore", message=".*Expected `str` but got `UUID`.*")
# warnings.filterwarnings("ignore", category=UserWarning, module="pydantic")


## Creating and Managing a Corpus

A corpus is a collection of documents that you want to analyze together. Let's create one:

In [None]:
# CHANGE THIS: Replace with your project name and directory
corpus_name = "My Research Corpus"  # Your project name
corpus_directory = tempfile.mkdtemp()  # Or use a permanent directory like "./my_corpus"

# Create the corpus
corpus = Corpus(
    name=corpus_name,
    corpus_dir=corpus_directory
)

print(f"üìö Created corpus: {corpus.name}")
print(f"üìÅ Storage location: {corpus.corpus_dir}")
print(f"üìä Current state:")
print(f"   - Documents: {corpus.num_docs}")
print(f"   - Active documents: {corpus.num_active_docs}")
print(f"   - Total tokens: {corpus.num_tokens}")

## Adding Documents

Now let's add your documents to the corpus. You can add documents from text strings, files, or any other source. Note that we created the corpus in the previous cell. If you run multiple cells below (or the same cells multiple times), new records will be added to the existing corpus.

### Option A: Add From Text Strings

The cell below provides a dictionary of sample texts with text names as the keys and the text strings as the values. You can replace them with your own texts.

In [None]:
sample_texts = {
    "sample_literature": "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort.",
    "sample_academic": "The rapid advancement of artificial intelligence has profound implications for various sectors of society. Machine learning algorithms are increasingly being deployed in healthcare, finance, and education, raising important questions about accountability, transparency, and fairness in automated decision-making systems.",
    "sample_news": "Local authorities announced today that the new community center will open next month. The facility will feature a library, gymnasium, and meeting spaces for local organizations. Residents have been eagerly anticipating the opening since construction began two years ago.",
    "sample_dialogue": "\"How are you today?\" she asked with a smile. \"I'm doing well, thank you,\" he replied. \"The weather is beautiful, isn't it?\" \"Yes, perfect for a walk in the park,\" she agreed enthusiastically.",
    "sample_technical": "The implementation requires instantiation of the primary data structure followed by iterative processing of the input parameters. Error handling mechanisms must be incorporated to ensure robust operation under various edge case conditions and unexpected input variations."
}

# Add sample texts to corpus
for record_name, text_content in sample_texts.items():
    corpus.add(
        content=text_content,
        name=record_name,
        is_active=True,
        metadata={'type': 'sample', 'category': record_name.split('_')[1]}
    )

print(f"‚úì Added {len(sample_texts)} records")

print(f"\nüìä Corpus Status:")
print(f"   - Total records: {corpus.num_docs}")
print(f"   - Active records: {corpus.num_active_docs}")

### Option B: Add From Text Files

Run the cell below to add records from files, add the paths to your files in the `file_paths` list.

In [None]:
# Replace these example file paths with your actual files
file_paths = [
    # "path/to/your/record1.txt",
    # "path/to/your/record2.txt",
    # "path/to/your/record3.txt",
]

# Create a loader instance and load the files
loader = Loader()
loader.load(file_paths)

# Load each of the files and create a record
for i, doc in enumerate(loader.texts):
    record_name = Path(file_paths[i]).stem  # Use filename without extension
    corpus.add(
        content=doc,
        name=record_name,
        is_active=True,
        metadata={'type': 'sample', 'category': record_name.split('_')[1], 'source_file': file_paths[i]}
    )
    print(f"‚úì Added: {record_name}")

print(f"‚úì Added {len(loader.texts)} records")

print(f"\nüìä Corpus Status:")
print(f"   - Total records: {corpus.num_docs}")
print(f"   - Active records: {corpus.num_active_docs}")

### Option C: Add Processed spaCy Documents

In the example below, we will create spaCy `Doc` versions of our sample texts. 

In [None]:
# Import the Lexos Tokenizer
from lexos.tokenizer import Tokenizer

sample_texts = {
    "sample_literature": "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort.",
    "sample_academic": "The rapid advancement of artificial intelligence has profound implications for various sectors of society. Machine learning algorithms are increasingly being deployed in healthcare, finance, and education, raising important questions about accountability, transparency, and fairness in automated decision-making systems.",
    "sample_news": "Local authorities announced today that the new community center will open next month. The facility will feature a library, gymnasium, and meeting spaces for local organizations. Residents have been eagerly anticipating the opening since construction began two years ago.",
    "sample_dialogue": "\"How are you today?\" she asked with a smile. \"I'm doing well, thank you,\" he replied. \"The weather is beautiful, isn't it?\" \"Yes, perfect for a walk in the park,\" she agreed enthusiastically.",
    "sample_technical": "The implementation requires instantiation of the primary data structure followed by iterative processing of the input parameters. Error handling mechanisms must be incorporated to ensure robust operation under various edge case conditions and unexpected input variations."
}

# Convert the texts to spaCy Docs
tokenizer = Tokenizer(model="en_core_web_sm")
names = list(sample_texts.keys())
docs = [tokenizer.make_doc(text) for text in sample_texts.values()]
sample_docs = dict(zip(names, docs))

# Add docs to corpus
for record_name, doc in sample_docs.items():
    corpus.add(
        content=doc,
        name=record_name,
        is_active=True,
        model="en_core_web_sm",
        metadata={"type": "sample", "category": record_name.split("_")[1]}
    )

# Get the number of parsed docs in the corpus
num_parsed_docs = sum([1 for r in corpus.records.values() if r.is_parsed])

print(f"‚úì Added {len(sample_docs)} records")

print(f"\nüìä Corpus Status:")
print(f"   - Total records: {corpus.num_docs}")
print(f"   - Active records: {corpus.num_active_docs}")
print(f"   - Parsed docs in corpus: {num_parsed_docs}")


## Working with Your Corpus

When you add a record to a corpus, it is stored as a dict, which can access with `corpus.records`. The keys are the records' ids and the values are the `Record` objects. Accessing records is often easier when iterating through the corpus itself, which returns just the `Record` objects:

In [None]:
for record in corpus:
    print(f"- {record.name}: {record.preview}")

To remove a record, use `remove()`, passing it the `id` or `name` of the record. Be careful when using names; if there are duplicate record names, all records with the name you supply will be deleted.

Use the `set()` method to set properties of a record based on its id. For instance `corpus.set("3eb16839-ab0d-4f5b-ae85-39561d687a6e", is_active=False)`.

### Finding Records

To find individual records, use the `get()` method. If you get a single result, the method will return a `Record` object. If you get multiple results, it will be a list of `Record` objects.

In [None]:
result = corpus.get(name="sample_literature")
for record in result:
    # print(record.id, record.name)
    print(repr(record))

To search for records by metadata, use `filter_records()`

In [None]:
result = corpus.filter_records(category="technical")
for record in result:
    print(record)

Note that `filter_records()` has some limitations, especially when filtering multiple categories. For your precise usage, you may have to use a broad search and then narrow down the contents programmatically yourself. For more complex queries, you have the option to use a database. See the separate tutorial on using SQLite.

## Displaying Your Corpus in a Table

You can display your corpus as a pandas DataFrame with the `to_df()` method. This method has an `exclude` parameter which takes a list of columns to hide (by default, `["content", "terms", "tokens"]`). If you wish to exclude metadata fields with the same name as model fields, you can use the prefix "metadata_" to avoid conflicts.

In [None]:
corpus.to_df(exclude=["content", "terms", "tokens", "preview"]).head()

## Saving Your Corpus to Disk

To save your entire corpus to a zip file, use the `save()` method:

```python
corpus.save(path="path/to/your/zip/file")
```

You can load a zipped corpus back into a `Corpus` instance with the `load()` method:

```python
corpus.load(path="path/to/your/zip/file")
```

If you supply a `corpus_dir`, the new unzipped corpus files will be saved to that directory; otherwise, the current directory will be used. If your records have a language model, set `cache=True` so that it will only be loaded once.

## Generating Corpus Statistics

You can get a quick list of the term counts in your corpus with the `term_counts()` method. The cell below demonstrates the available keyword parameters.


In [None]:
print("10 Most common terms in the corpus:")
print(corpus.term_counts())

print()
print("5 Least common terms in the corpus:")
print(corpus.term_counts(n=5, most_common=False))

The Corpus module allows you to generate a range of more advanced statistics about your data. This can help you

- **Understand the quality of your corpus:** Is your corpus large enough for meaningful analysis? Are your documents similar in length, or do you have problematic outliers? Do you have enough vocabulary diversity for reliable results? 
- **Compare different types of texts:** Are there meaningful differences between my documents groups? Which documents are most similar to or different from each other? Do certain authors/genres/time periods show distinct patterns?
- **Examine distinct linguistic patterns:** How rich and diverse in the vocabulary in your corpus? Are there distinctive terms or linguistic features? Are there unusual language patterns worth investigating?
-- **Decide how to prepare data for advanced analysis:** Is your data suitable for the analysis techniques you want to use? How shold you handle outliers and unusual documents? What quality thresholds should you set for your analysis?

A best practice is to begin by assessing the quality of your corpus.

## Examining Corpus Quality

In the cell below, we use the `get_stats()` method to generate a `CorpusStats` instance. We then print various properties of the `CorpusStats` class, along with statistical properties of the `Corpus` class.

In [None]:
# Generate comprehensive corpus statistics
print("üìä BASIC CORPUS ANALYSIS")
print("=" * 50)

# Get statistics (this may take a moment for large corpora)
stats = corpus.get_stats(active_only=True)

print(f"‚úì Analysis complete for {len(stats.docs)} records")
print(f"‚úì Working with {len(stats.dtm.sorted_terms_list)} unique terms")

# === CORPUS SIZE AND SCOPE ===
print(f"\nüìà CORPUS SIZE AND SCOPE")
print("=" * 30)
print(f"üìö Total records in your corpus: {corpus.num_docs}")
print(f"üìù Active records being analyzed: {corpus.num_active_docs}")
print(f"üî§ Total unique words/terms: {corpus.num_terms}")
print(f"üìä Total word tokens: {corpus.num_tokens}")

# === BASIC LENGTH STATISTICS ===
print(f"\nüìè RECORD LENGTH OVERVIEW")
print("=" * 30)
print(f"üìä Average record length: {stats.mean:.0f} words")
print(f"üìâ Shortest record: {stats.doc_stats_df['total_tokens'].min()} words")
print(f"üìà Longest record: {stats.doc_stats_df['total_tokens'].max()} words")
print(f"üìê Length variation: {stats.standard_deviation:.0f} words (standard deviation)")

# Interpret the length variation
cv = stats.standard_deviation / stats.mean if stats.mean > 0 else 0
if cv < 0.15:
    length_interpretation = "Very similar lengths - good for comparison studies"
elif cv < 0.3:
    length_interpretation = "Moderately similar lengths - typical for most research"
else:
    length_interpretation = "Highly variable lengths - check for outliers"

print(f"üí° Interpretation: {length_interpretation}")

# === CORPUS QUALITY ASSESSMENT ===
print(f"\nüéØ CORPUS QUALITY ASSESSMENT")
print("=" * 30)

quality_metrics = stats.corpus_quality_metrics

# Size adequacy
size_adequacy = quality_metrics['corpus_size_metrics']['size_adequacy']
size_icon = "‚úÖ" if size_adequacy in ['adequate', 'large'] else "‚ö†Ô∏è" if size_adequacy == 'small' else "‚ùå"
print(f"{size_icon} Corpus size: {size_adequacy.upper()}")

if size_adequacy == 'very_small':
    print(f"   üìù Recommendation: Add more records for reliable analysis")
    print(f"   üéØ Minimum suggested: {quality_metrics['corpus_size_metrics']['recommended_min_docs']} records")
elif size_adequacy == 'small':
    print(f"   üìù Note: Results will be more reliable with additional records")
else:
    print(f"   üìù Great! Your corpus size is suitable for analysis")

# Length balance
length_balance = quality_metrics['document_length_balance']['classification']
balance_icon = "‚úÖ" if length_balance == 'balanced' else "‚ö†Ô∏è" if length_balance == 'slightly_imbalanced' else "‚ùå"
print(f"{balance_icon} Record length balance: {length_balance.upper()}")

if length_balance == 'highly_imbalanced':
    print(f"   üìù Recommendation: Check for very long/short records that might skew results")
elif length_balance == 'slightly_imbalanced':
    print(f"   üìù Note: Some length variation is normal and usually fine")
else:
    print(f"   üìù Great! Your records have consistent lengths")

# Vocabulary richness
vocab_richness = quality_metrics['vocabulary_richness']['sampling_adequacy']
vocab_icon = "‚úÖ" if vocab_richness == 'sufficient' else "‚ö†Ô∏è" if vocab_richness == 'marginal' else "‚ùå"
print(f"{vocab_icon} Vocabulary diversity: {vocab_richness.upper()}")

if vocab_richness == 'insufficient':
    print(f"   üìù Recommendation: Consider adding more diverse texts")
elif vocab_richness == 'marginal':
    print(f"   üìù Note: Vocabulary diversity is adequate but could be improved")
else:
    print(f"   üìù Great! Your corpus has rich vocabulary diversity")

print(f"\nüí° Overall Assessment:")
if all(metric in ['adequate', 'large', 'balanced', 'sufficient']
       for metric in [size_adequacy, length_balance, vocab_richness]):
    print(f"üéâ Excellent! Your corpus is well-suited for text analysis.")
else:
    print(f"üìù Your corpus is usable, but consider the recommendations above for optimal results.")

# === OUTLIER DETECTION (SIMPLIFIED) ===
print(f"\nüîç UNUSUAL RECORDS (OUTLIER DETECTION)")
print("=" * 40)

outliers = stats.iqr_outliers
if outliers:
    print(f"‚ö†Ô∏è Found {len(outliers)} records with unusual lengths:")
    for record_id, record_name in outliers[:5]:  # Show first 5 outliers
        record = corpus.get(id=record_id)
        length = record.num_tokens() if record.is_parsed else len(record.content.split())
        print(f"   üìÑ {record_name}: {length} words")

    if len(outliers) > 5:
        print(f"   ... and {len(outliers) - 5} more")

    print(f"\nüí° What this means:")
    print(f"   ‚Ä¢ These records are much longer or shorter than typical")
    print(f"   ‚Ä¢ This is often normal (prefaces, abstracts, etc.)")
    print(f"   ‚Ä¢ Consider if these should be included in your analysis")
else:
    print(f"‚úÖ No unusual record lengths detected")
    print(f"üí° All records have similar lengths - great for comparative analysis!")

print(f"\n" + "=" * 50)
print(f"üéØ NEXT STEPS:")
print(f"‚Ä¢ If all assessments look good ‚Üí proceed to advanced analysis")
print(f"‚Ä¢ If you see warnings ‚Üí consider adding more/different records")
print(f"‚Ä¢ If you have outliers ‚Üí decide whether to include or exclude them")
print(f"=" * 50)

### Statistics for Comparative Text Analysis 

*Perfect for: Literary studies, genre analysis, authorship studies, comparing time periods*

In [None]:
print("üìö COMPARATIVE TEXT ANALYSIS")
print("=" * 50)
print("Perfect for: Comparing authors, genres, time periods, or any text groups")
print()

# === ANALYZING TEXT GROUPS ===
print("üîç ANALYZING YOUR TEXT GROUPS")
print("=" * 35)

# Check if we have metadata categories for comparison
metadata_categories = {}
for record in corpus.records.values():
    if record.is_active and record.meta.get('category'):
        category = record.meta.get('category', 'unknown')
        if category not in metadata_categories:
            metadata_categories[category] = {'records': [], 'lengths': []}

        metadata_categories[category]['records'].append(record.name)
        if record.is_parsed:
            metadata_categories[category]['lengths'].append(record.num_tokens())
        else:
            metadata_categories[category]['lengths'].append(len(record.content.split()))

if len(metadata_categories) > 1:
    print(f"üìä Found {len(metadata_categories)} text groups in your corpus:")
    for category, data in metadata_categories.items():
        avg_length = sum(data['lengths']) / len(data['lengths'])
        print(f"   üìñ {category.title()}: {len(data['records'])} records, avg {avg_length:.0f} words")

    # === STATISTICAL COMPARISON ===
    print(f"\nüßÆ STATISTICAL COMPARISON BETWEEN GROUPS")
    print("=" * 45)

    # Get two largest groups for comparison
    sorted_groups = sorted(metadata_categories.items(), key=lambda x: len(x[1]['records']), reverse=True)
    if len(sorted_groups) >= 2:
        group1_name, group1_data = sorted_groups[0]
        group2_name, group2_data = sorted_groups[1]

        print(f"üÜö Comparing: {group1_name.title()} vs {group2_name.title()}")

        try:
            comparison = stats.compare_groups(
                group1_labels=group1_data['records'],
                group2_labels=group2_data['records'],
                metric="total_tokens",
                test_type="mann_whitney"
            )

            # Explain the comparison in accessible terms
            print(f"\nüìä Statistical Results:")
            print(f"   {group1_name.title()} average: {comparison['group1_mean']:.0f} words")
            print(f"   {group2_name.title()} average: {comparison['group2_mean']:.0f} words")

            # Interpret the p-value
            if comparison['p_value'] < 0.001:
                significance = "Very strong evidence"
            elif comparison['p_value'] < 0.01:
                significance = "Strong evidence"
            elif comparison['p_value'] < 0.05:
                significance = "Moderate evidence"
            else:
                significance = "No clear evidence"

            print(f"\nüí° What this means:")
            print(f"   ‚Ä¢ {significance} of a real difference between groups")
            if comparison['is_significant']:
                print(f"   ‚Ä¢ The length difference is statistically meaningful")
                print(f"   ‚Ä¢ Effect size: {comparison['effect_size_interpretation']} difference")
            else:
                print(f"   ‚Ä¢ The groups are similar in length")
                print(f"   ‚Ä¢ Any difference could be due to chance")

            print(f"\nüìà Research Implications:")
            if comparison['is_significant']:
                print(f"   ‚úÖ Groups show distinct patterns - good for comparative analysis")
                print(f"   üî¨ Consider investigating what causes this difference")
            else:
                print(f"   ‚úÖ Groups are comparable - good for controlled comparisons")
                print(f"   üî¨ Focus on content/style rather than length differences")

        except Exception as e:
            print(f"‚ùå Could not compare groups: {e}")

    # === OUTLIER ANALYSIS FOR GROUPS ===
    print(f"\nüéØ OUTLIERS WITHIN EACH GROUP")
    print("=" * 35)

    outliers = stats.iqr_outliers
    if outliers:
        print(f"Records that don't fit typical patterns:")

        # Group outliers by category
        outliers_by_group = {}
        for record_id, record_name in outliers:
            record = corpus.get(id=record_id)
            group = record.meta.get('category', 'unknown')
            if group not in outliers_by_group:
                outliers_by_group[group] = []
            outliers_by_group[group].append(record_name)

        for group, group_outliers in outliers_by_group.items():
            print(f"   üìñ {group.title()}: {', '.join(group_outliers)}")

        print(f"\nüí° Research Value of Outliers:")
        print(f"   ‚Ä¢ May represent unique styles or content")
        print(f"   ‚Ä¢ Could indicate errors in categorization")
        print(f"   ‚Ä¢ Might be your most interesting findings!")
    else:
        print(f"‚úÖ No outliers found - all texts fit their group patterns well")

else:
    print(f"‚ùì No text groups found in your metadata.")
    print(f"üìù To enable group comparison:")
    print(f"   ‚Ä¢ Add metadata categories like 'genre', 'author', 'period'")
    print(f"   ‚Ä¢ See the metadata section above for examples")

print(f"\n" + "=" * 50)
print(f"üéØ NEXT STEPS FOR COMPARATIVE ANALYSIS:")
print(f"‚Ä¢ Use clustering tools to find hidden similarities")
print(f"‚Ä¢ Apply topic modeling to discover thematic differences")
print(f"‚Ä¢ Try stylometric analysis for authorship studies")
print(f"=" * 50)

### üó£Ô∏è Statistics for Linguistic and Vocabulary Analysis

*Perfect for: Language studies, vocabulary complexity, linguistic patterns*

In [None]:
print("üó£Ô∏è LINGUISTIC AND VOCABULARY ANALYSIS")
print("=" * 50)
print("Perfect for: Language complexity, vocabulary richness, linguistic patterns")
print()

# === VOCABULARY DIVERSITY ANALYSIS ===
print("üìö VOCABULARY DIVERSITY AND RICHNESS")
print("=" * 40)

diversity_stats = stats.text_diversity_stats

print(f"üìä Vocabulary Richness Metrics:")
print(f"   üéØ Type-Token Ratio (TTR): {diversity_stats['mean_ttr']:.3f}")
print(f"   üìà Corpus-wide TTR: {diversity_stats['corpus_ttr']:.3f}")
print(f"   üÜï Rare words ratio: {diversity_stats['mean_hapax_ratio']:.3f}")

print(f"\nüí° What these numbers mean:")

# Interpret TTR
ttr = diversity_stats['mean_ttr']
if ttr > 0.7:
    ttr_interpretation = "Very high vocabulary diversity"
    ttr_meaning = "Texts use many different words - sophisticated vocabulary"
elif ttr > 0.5:
    ttr_interpretation = "Good vocabulary diversity"
    ttr_meaning = "Balanced use of vocabulary - typical for quality writing"
elif ttr > 0.3:
    ttr_interpretation = "Moderate vocabulary diversity"
    ttr_meaning = "Some repetition but still varied vocabulary"
else:
    ttr_interpretation = "Lower vocabulary diversity"
    ttr_meaning = "More repetitive language - may indicate specialized/technical content"

print(f"   üìù TTR Assessment: {ttr_interpretation}")
print(f"      {ttr_meaning}")

# Interpret hapax ratio
hapax_ratio = diversity_stats['mean_hapax_ratio']
if hapax_ratio > 0.6:
    hapax_interpretation = "High use of unique words"
    hapax_meaning = "Many words appear only once - rich, varied vocabulary"
elif hapax_ratio > 0.4:
    hapax_interpretation = "Moderate use of unique words"
    hapax_meaning = "Balanced vocabulary with good variety"
else:
    hapax_interpretation = "Lower use of unique words"
    hapax_meaning = "More repetitive vocabulary - may be formulaic or specialized"

print(f"   üìù Rare Words Assessment: {hapax_interpretation}")
print(f"      {hapax_meaning}")

# === LINGUISTIC PATTERN ANALYSIS ===
print(f"\nüîç LINGUISTIC PATTERN ANALYSIS")
print("=" * 35)

# Zipf's Law Analysis
zipf_analysis = stats.zipf_analysis
print(f"üìà Language Pattern Analysis (Zipf's Law):")

if zipf_analysis['follows_zipf']:
    print(f"   ‚úÖ Your texts follow expected language patterns")
    print(f"   üìù This suggests natural, well-formed language")
    print(f"   üéØ Good fit score: {zipf_analysis['zipf_goodness_of_fit']}")
else:
    print(f"   ‚ö†Ô∏è Your texts show unusual language patterns")
    print(f"   üìù This might indicate:")
    print(f"      ‚Ä¢ Highly technical or specialized vocabulary")
    print(f"      ‚Ä¢ Mixed languages or corrupted text")
    print(f"      ‚Ä¢ Very small corpus size")
    print(f"   üéØ Fit score: {zipf_analysis['zipf_goodness_of_fit']}")

print(f"   üìä Technical details: R¬≤ = {zipf_analysis['r_squared']:.3f}")

# Advanced diversity metrics
advanced_diversity = stats.advanced_lexical_diversity
print(f"\nüìä Advanced Vocabulary Metrics:")
print(f"   üîÑ CTTR (Corrected TTR): {advanced_diversity['mean_cttr']:.2f}")
print(f"   üìè RTTR (Root TTR): {advanced_diversity['mean_rttr']:.2f}")
print(f"   üìà Log TTR: {advanced_diversity['mean_log_ttr']:.3f}")

print(f"\nüí° Advanced Metrics Interpretation:")
cttr = advanced_diversity['mean_cttr']
if cttr > 5:
    print(f"   ‚úÖ Very high vocabulary sophistication")
elif cttr > 3:
    print(f"   ‚úÖ Good vocabulary sophistication")
elif cttr > 2:
    print(f"   üìù Moderate vocabulary sophistication")
else:
    print(f"   üìù Lower vocabulary sophistication")

# === VOCABULARY DISTRIBUTION ===
print(f"\nüìã VOCABULARY DISTRIBUTION PATTERNS")
print("=" * 40)

print(f"üìä Most Common Words in Your Corpus:")
try:
    top_terms = corpus.term_counts(n=10, most_common=True)
    for i, (term, count) in enumerate(top_terms, 1):
        percentage = (count / corpus.num_tokens * 100) if corpus.num_tokens > 0 else 0
        print(f"   {i:2d}. '{term}': {count} times ({percentage:.1f}%)")

    print(f"\nüí° Vocabulary Insights:")
    # Check if top words are function words (common pattern)
    function_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might', 'can', 'shall', 'must'}
    top_words = [term.lower() for term, _ in top_terms[:5]]
    function_word_count = sum(1 for word in top_words if word in function_words)

    if function_word_count >= 3:
        print(f"   ‚úÖ Natural language patterns - function words dominate")
        print(f"   üìù This indicates well-formed, natural text")
    else:
        print(f"   ü§î Unusual word frequency patterns detected")
        print(f"   üìù May indicate technical content or data processing needed")

except:
    print(f"   (Could not analyze term frequencies - may need processed records)")

print(f"\n" + "=" * 50)
print(f"üéØ NEXT STEPS FOR LINGUISTIC ANALYSIS:")
print(f"‚Ä¢ Use n-gram analysis to study phrase patterns")
print(f"‚Ä¢ Apply part-of-speech analysis for grammatical patterns")
print(f"‚Ä¢ Try collocation analysis for word association patterns")
print(f"=" * 50)

### Statistics for Advanced Analysis Preparation

*Perfect for: Machine learning, topic modeling, clustering, classification projects*

In [None]:
print("üìä ADVANCED ANALYSIS PREPARATION")
print("=" * 50)
print("Perfect for: Preparing data for machine learning, topic modeling, clustering")
print()

# === DATA QUALITY ASSESSMENT FOR ML ===
print("üéØ DATA QUALITY FOR ADVANCED ANALYSIS")
print("=" * 45)

quality_metrics = stats.corpus_quality_metrics

print(f"üìä Suitability Assessment:")

# Sample size adequacy
size_metrics = quality_metrics['corpus_size_metrics']
size_adequacy = size_metrics['size_adequacy']

if size_adequacy == 'very_small':
    size_recommendation = "‚ùå Too small for reliable ML - add more records"
    ml_ready = False
elif size_adequacy == 'small':
    size_recommendation = "‚ö†Ô∏è Marginal for ML - results may be unstable"
    ml_ready = False
elif size_adequacy == 'adequate':
    size_recommendation = "‚úÖ Good size for most analysis techniques"
    ml_ready = True
else:  # large
    size_recommendation = "‚úÖ Excellent size for advanced analysis"
    ml_ready = True

print(f"   üìà Corpus size: {size_recommendation}")

# Vocabulary richness for ML
vocab_richness = quality_metrics['vocabulary_richness']
sampling_adequacy = vocab_richness['sampling_adequacy']

if sampling_adequacy == 'insufficient':
    vocab_recommendation = "‚ùå Insufficient vocabulary diversity for ML"
    vocab_ready = False
elif sampling_adequacy == 'marginal':
    vocab_recommendation = "‚ö†Ô∏è Limited vocabulary diversity - may affect results"
    vocab_ready = False
else:  # sufficient
    vocab_recommendation = "‚úÖ Good vocabulary diversity for analysis"
    vocab_ready = True

print(f"   üìö Vocabulary diversity: {vocab_recommendation}")

# Distribution analysis for statistical assumptions
print(f"\nüìà STATISTICAL DISTRIBUTION ANALYSIS")
print("=" * 40)

dist_stats = stats.distribution_stats

print(f"üìä Data Distribution Properties:")
print(f"   üìê Symmetry (skewness): {dist_stats['skewness']:.3f}")

if abs(dist_stats['skewness']) < 0.5:
    skewness_interp = "Symmetric distribution - good for most methods"
elif abs(dist_stats['skewness']) < 1.0:
    skewness_interp = "Slightly skewed - usually fine for analysis"
else:
    skewness_interp = "Highly skewed - may need data transformation"

print(f"      üí° {skewness_interp}")

print(f"   üìä Normality test: {'Normally distributed' if dist_stats['is_normal'] else 'Non-normal distribution'}")
if not dist_stats['is_normal']:
    print(f"      üí° Non-parametric methods recommended")
else:
    print(f"      üí° Both parametric and non-parametric methods suitable")

print(f"   üìè Variability: {dist_stats['coefficient_of_variation']:.3f}")
if dist_stats['coefficient_of_variation'] < 0.2:
    var_interp = "Low variability - consistent record lengths"
elif dist_stats['coefficient_of_variation'] < 0.5:
    var_interp = "Moderate variability - typical for text data"
else:
    var_interp = "High variability - consider length normalization"

print(f"      üí° {var_interp}")

# === PREPROCESSING RECOMMENDATIONS ===
print(f"\nüîß PREPROCESSING RECOMMENDATIONS")
print("=" * 35)

print(f"üìã Based on your corpus analysis:")

outliers = stats.iqr_outliers
if outliers:
    print(f"   ‚ö†Ô∏è Outlier handling: {len(outliers)} unusual records detected")
    print(f"      ‚Ä¢ Consider removing or treating outliers separately")
    print(f"      ‚Ä¢ May improve model performance and interpretability")
else:
    print(f"   ‚úÖ Outlier handling: No problematic outliers detected")

if dist_stats['coefficient_of_variation'] > 0.5:
    print(f"   üìè Length normalization: Recommended due to high variability")
    print(f"      ‚Ä¢ Consider document-length normalization")
    print(f"      ‚Ä¢ May improve clustering and classification results")
else:
    print(f"   ‚úÖ Length normalization: Not required - consistent lengths")

if vocab_richness['vocabulary_saturation'] < 1.5:
    print(f"   üìö Vocabulary filtering: Consider removing very rare words")
    print(f"      ‚Ä¢ Set minimum frequency thresholds")
    print(f"      ‚Ä¢ May improve model stability")
else:
    print(f"   ‚úÖ Vocabulary filtering: Good vocabulary coverage")

# === ML READINESS ASSESSMENT ===
print(f"\nü§ñ MACHINE LEARNING READINESS")
print("=" * 35)

readiness_score = 0
max_score = 4

if ml_ready:
    readiness_score += 1
    print(f"   ‚úÖ Sample size: Adequate")
else:
    print(f"   ‚ùå Sample size: Insufficient")

if vocab_ready:
    readiness_score += 1
    print(f"   ‚úÖ Vocabulary diversity: Good")
else:
    print(f"   ‚ùå Vocabulary diversity: Limited")

if len(outliers) <= len(stats.docs) * 0.1:  # Less than 10% outliers
    readiness_score += 1
    print(f"   ‚úÖ Data quality: Good (few outliers)")
else:
    print(f"   ‚ö†Ô∏è Data quality: Consider outlier treatment")

if dist_stats['coefficient_of_variation'] < 0.8:  # Reasonable variability
    readiness_score += 1
    print(f"   ‚úÖ Consistency: Good variability levels")
else:
    print(f"   ‚ö†Ô∏è Consistency: High variability detected")

print(f"\nüéØ Overall ML Readiness: {readiness_score}/{max_score}")

if readiness_score >= 3:
    print(f"   üéâ Your corpus is ready for advanced analysis!")
    print(f"   üöÄ Recommended next steps:")
    print(f"      ‚Ä¢ Topic modeling with LDA or similar")
    print(f"      ‚Ä¢ Document clustering analysis")
    print(f"      ‚Ä¢ Classification experiments")
elif readiness_score >= 2:
    print(f"   üìù Your corpus needs minor improvements")
    print(f"   üîß Address the issues above, then proceed")
else:
    print(f"   ‚ö†Ô∏è Your corpus needs significant preparation")
    print(f"   üìö Consider adding more diverse records")

print(f"\n" + "=" * 50)
print(f"üéØ NEXT STEPS FOR ADVANCED ANALYSIS:")
print(f"‚Ä¢ Export your corpus for external ML tools")
print(f"‚Ä¢ Use Lexos clustering and topic modeling modules")
print(f"‚Ä¢ Set up document classification experiments")
print(f"=" * 50)

## Visualization and Reporting

When you call `Corpus.get_stats()` to produce a `CorpusStats` instance, that instance has a `doc_stats_df` property. This returns a pandas DataFrame with all the generated statistics.

In [None]:
stats.doc_stats_df.head()

Data extracted from this table can be fed into a variety of plotting tools to generate graphs. The following examples use the Python `matplotlib` package, but they are meant only as a starting point.

In [None]:
print("üìä CREATING VISUALIZATIONS")
print("=" * 40)

# Set up matplotlib for better plots
plt.style.use('default')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 10

# Create a visualization of document statistics
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle(f'Corpus Analysis: {corpus.name}', fontsize=16, fontweight='bold')

# 1. Document length distribution
doc_lengths = stats.doc_stats_df['total_tokens']

axes[0, 0].hist(doc_lengths, bins=min(10, len(doc_lengths)), alpha=0.7, color='skyblue', edgecolor='black')
axes[0, 0].axvline(stats.mean, color='red', linestyle='--', label=f'Mean: {stats.mean:.1f}')
axes[0, 0].set_title('Document Length Distribution')
axes[0, 0].set_xlabel('Number of Tokens')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Vocabulary density distribution
vocab_density = stats.doc_stats_df['vocabulary_density']
axes[0, 1].hist(vocab_density, bins=min(10, len(vocab_density)), alpha=0.7, color='lightgreen', edgecolor='black')
axes[0, 1].axvline(vocab_density.mean(), color='red', linestyle='--', label=f'Mean: {vocab_density.mean():.1f}%')
axes[0, 1].set_title('Vocabulary Density Distribution')
axes[0, 1].set_xlabel('Vocabulary Density (%)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Document comparison (length vs vocabulary)
doc_stats_plot = stats.doc_stats_df
scatter = axes[1, 0].scatter(doc_stats_plot['total_tokens'], doc_stats_plot['vocabulary_density'],
                           alpha=0.7, s=60, c='purple')
axes[1, 0].set_title('Document Length vs Vocabulary Density')
axes[1, 0].set_xlabel('Total Tokens')
axes[1, 0].set_ylabel('Vocabulary Density (%)')
axes[1, 0].grid(True, alpha=0.3)

# Add document names as tooltips (if not too many)
if len(doc_stats_plot) <= 20:
    for i, (idx, row) in enumerate(doc_stats_plot.iterrows()):
        axes[1, 0].annotate(idx, (row['total_tokens'], row['vocabulary_density']),
                          xytext=(5, 5), textcoords='offset points', fontsize=8, alpha=0.7)

# 4. Outlier identification plot
doc_lengths_list = doc_stats_plot['total_tokens'].tolist()
colors = ['red' if any(idx in [o[1] for o in outliers] for o in [('', idx)])
          else 'blue' for idx in doc_stats_plot.index]

axes[1, 1].scatter(range(len(doc_lengths_list)), doc_lengths_list, c=colors, alpha=0.7, s=60)
axes[1, 1].axhline(stats.iqr_bounds[0], color='orange', linestyle='--', alpha=0.7, label='IQR Lower Bound')
axes[1, 1].axhline(stats.iqr_bounds[1], color='orange', linestyle='--', alpha=0.7, label='IQR Upper Bound')
axes[1, 1].axhline(stats.mean, color='green', linestyle='-', alpha=0.7, label='Mean')
axes[1, 1].set_title('Outlier Detection (Red = Outliers)')
axes[1, 1].set_xlabel('Document Index')
axes[1, 1].set_ylabel('Token Count')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("‚úì Visualization complete!")
print("\nüí° Interpretation Guide:")
print("   üìä Top Left: Shows how document lengths are distributed")
print("   üìà Top Right: Shows vocabulary richness patterns")
print("   üîç Bottom Left: Reveals relationships between length and vocabulary")
print("   üéØ Bottom Right: Identifies unusual documents (outliers in red)")

## Inter-Module Integration with Other Modules

The `export_statistical_fingerprint()` method returns a dictionary with basic statistics about your corpus, which can be useful for passing this information to other Lexos modules or external tools. Since the state of a corpus can change, each time you generate the dictionary, it is assigned a unique ID corresponding to the state of the corpus at the time of export.


In [None]:
# Export statistical fingerprint for other modules
fingerprint = corpus.export_statistical_fingerprint()

print("üìä Statistical Fingerprint Generated:")
print(f"   Corpus: {fingerprint['corpus_metadata']['name']}")
print(f"   Documents: {fingerprint['corpus_metadata']['num_docs']}")
print(f"   Active docs: {fingerprint['corpus_metadata']['num_active_docs']}")
print(f"   Total tokens: {fingerprint['corpus_metadata']['num_tokens']}")
print(f"   Fingerprint ID: {fingerprint['corpus_metadata']['corpus_fingerprint']}")


Despite the rich statistics you can generate with the `Corpus` class, you are not limited
to these statistics. You may have performed your own analysis using another Lexos module or some other means. The `Corpus` class provides a number methods for managing statistics from external sources. The `import_analysis_results()` is used to ingest a dictionary of arbitrary statistical data. The data can be accessed by calling `get_analysis_results()`. The cell below demonstrates the the use of these methods using some sample statistics you might have generated with another module.

Note that you cannot import analysis results more than once. If you need to do so, use the `overwrite=True` to replace the current contents. In the cell below, this parameter is set in case you run the cell more than once.

In [None]:
print("üîó INTER-MODULE INTEGRATION")
print("=" * 40)

print(f"\nüß™ Simulating Analysis Module Integration:")

# Simulate how other modules would store results
# Example 1: Classification results
classification_results = {
    'algorithm': 'simulated_classifier',
    'accuracy': 0.85,
    'predictions': {
        doc_name: {
            'predicted_class': 'positive' if 'literature' in doc_name or 'academic' in doc_name else 'neutral',
            'confidence': 0.75 + (hash(doc_name) % 20) / 100  # Simulated confidence
        }
        for doc_name in corpus.names.keys()
    },
    'model_parameters': {
        'features': 'bag_of_words',
        'training_size': 1000
    }
}

# Store classification results in corpus
corpus.import_analysis_results(
    module_name='text_classification',
    results_data=classification_results,
    version='1.0.0',
    overwrite=True
)

print(f"   ‚úì Stored classification results")

# Example 2: Topic modeling results
topic_modeling_results = {
    'algorithm': 'LDA',
    'n_topics': 3,
    'coherence_score': 0.62,
    'topics': {
        'topic_0': ['academic', 'research', 'analysis', 'data'],
        'topic_1': ['literature', 'story', 'character', 'narrative'],
        'topic_2': ['news', 'community', 'local', 'center']
    },
    'document_topics': {
        doc_name: {
            'dominant_topic': hash(doc_name) % 3,
            'topic_distribution': [
                0.3 + (hash(doc_name + 'a') % 40) / 100,
                0.3 + (hash(doc_name + 'b') % 40) / 100,
                0.3 + (hash(doc_name + 'c') % 40) / 100
            ]
        }
        for doc_name in corpus.names.keys()
    }
}

corpus.import_analysis_results(
    module_name='topic_modeling',
    results_data=topic_modeling_results,
    version='2.1.0',
    overwrite=True
)

print(f"   ‚úì Stored topic modeling results")

# Show stored results
print(f"\nüìã Analysis Results Summary:")
all_results = corpus.get_analysis_results()
for module_name, data in all_results.items():
    print(f"   üìä {module_name}:")
    print(f"      Version: {data['version']}")
    print(f"      Timestamp: {data['timestamp']}")
    print(f"      Corpus state: {len(data['corpus_state'])} tracked properties")


Because the corpus can change, it is important to check whether the analysis results are still valid for the current state of the corpus. For this, use `validate_analysis_compatibility()`.

In [None]:
print(f"\nüîç Corpus State Validation:")
# Check if analysis results are still valid
for module_name in all_results.keys():
    compatibility = corpus.validate_analysis_compatibility(module_name)
    status = "‚úì Valid" if compatibility['compatible'] else "‚ö†Ô∏è Outdated"
    print(f"   {module_name}: {status}")
    if not compatibility['compatible']:
        print(f"      Reason: {compatibility['reason']}")

## Generating Reports

In the cells above, you generated a variety of reports about your corpus by hand-coding the outputs in the notebook cell. Whilst this gives you a lot of flexibility to change the output, it would be much simpler to generate these reports with a single function call. For this purpose, the Corpus module has a `create_corpus_analysis_report()`.  This function will generate comprehensive reports about the contents of your corpus in a single line of code. By default, it will return the output in Markdown format. However, if you supply and `output_dir`, it will create a directory (if it does not exist) and save the output there, along with CSV and JSON files containing your corpus statistics. If you set `html=True` the report output will be saved in HTML format for viewing in a web browser.

In [None]:
from lexos.corpus.corpus_analysis_report import create_corpus_analysis_report

report = create_corpus_analysis_report(corpus)
print(report)