# Text Data and Exploratory Analysis

## Overview

This notebook introduces the concept of **corpora** (collections of text documents) as the fundamental infrastructure for NLP work. We explore what corpora are, different types of corpora, and the importance of **Exploratory Data Analysis (EDA)** for text data. Understanding your corpus through EDA is crucial before preprocessing, as it helps identify data quality issues, vocabulary characteristics, and preprocessing needs that will guide your entire NLP pipeline.


## Objectives

- Explain the significance of corpora as the fundamental infrastructure for training language models, conducting statistical analysis, and benchmarking NLP systems
- Understand what exploratory data analysis (EDA) is and why it's crucial before preprocessing
- Learn what to assess in EDA for text data: data quality, vocabulary characteristics, and preprocessing needs

## Outline

1. **What is a Corpus?** - Definition and examples of text corpora
2. **Types of Corpora** - News, social media, literature, scientific corpora
3. **Working with Corpora** - Practical examples of corpus structure
4. **Real-World Example Corpora** - Famous corpora used in NLP research
5. **Exploratory Data Analysis (EDA) for Text** - Understanding your data before preprocessing
6. **EDA Components**:
   - Data quality assessment (class distribution, text length, language detection, duplicates)
   - Vocabulary characteristics (vocabulary size, word frequency patterns, class-specific patterns)
   - Preprocessing needs identification (noise patterns, normalization needs)


In NLP, a **corpus** (Latin for "body") is a collection of text documents. Each document in the corpus represents a single text unit (like a book, article, tweet, or sentence), and the collection of all these documents forms the corpus.

It is the raw dataset you work with. Whether it's a folder of PDF invoices, a scraping of Wikipedia, or a database of tweets, it is your corpus.

![Corpora - Corpus - Docuemnts](../../assets/corpora.png)

## Examples of Corpora

- **News corpus**: A collection of news articles from various sources
- **Social media corpus**: A collection of tweets, posts, or comments
- **Literature corpus**: A collection of novels, poems, or plays
- **Scientific corpus**: A collection of research papers or abstracts

Let's see some examples of working with corpora:

In [None]:
# Mixed corpus (different types of documents)
mixed_corpus = {
    "emails": [
        "Subject: Meeting tomorrow at 3 PM",
        "Subject: Project update required"
    ],
    "tweets": [
        "Just finished reading an amazing book! #reading",
        "Beautiful sunset today ðŸŒ…"
    ],
    "articles": [
        "Scientists discover new species in the Amazon rainforest...",
        "Technology advances reshape modern education..."
    ]
}

for doc_type, docs in mixed_corpus.items():
    print(f"\n{doc_type.upper()} ({len(docs)} documents):")
    for doc in docs:
        print(f"  - {doc[:60]}...")

Notice how tech type of document have a different **structure**, **writing tone**, and **vocabulary**.

![A document parsed into an abstract syntax tree](https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/AST_Document.svg/1280px-AST_Document.svg.png)

### Real-World Example Corpora

Let's explore some famous corpora used in NLP research and applications. Understanding these corpora helps us appreciate the scale and diversity of text data used in NLP.


In [None]:
# Comparison: Corpus Sizes and Characteristics
# This table shows the scale differences between different types of corpora

# Real-world corpus statistics (approximate values)
corpus_comparison = {
    'Corpus Name': [
        'Brown Corpus',
        'Reuters-21578',
        'Wikipedia (English)',
        'Common Crawl',
        'Arabic Wikipedia',
        'Twitter (daily)',
        'Google Books'
    ],
    'Documents': [
        '500',
        '21,578',
        '6 million',
        'Billions',
        '1.2 million',
        '500 million',
        '25 million books'
    ],
    'Words (Approx)': [
        '1 million',
        '3 million',
        '6 billion',
        'Trillions',
        '500 million',
        'Billions',
        'Hundreds of billions'
    ],
    'Source': [
        'Brown University, 1961',
        'Reuters News, 1987',
        'Wikipedia Foundation',
        'Web crawls',
        'Wikipedia Foundation',
        'Twitter API',
        'Google Books Project'
    ],
    'Primary Use': [
        'Historical linguistics, benchmarks',
        'Text classification',
        'Language models, embeddings',
        'Large-scale training',
        'Arabic NLP',
        'Sentiment, trends',
        'Historical language analysis'
    ]
}

In [None]:
import pandas as pd

df = pd.DataFrame(corpus_comparison)
df

- Note: These are approximate values. Actual sizes may vary
- Large corpora like Common Crawl and social media are continuously growing

## Why Corpora Matter?

Corpora serve as the foundation for:

1. **Training language models**
2. **Building domain-specific NLP applications**
3. **Statistical analysis of language patterns**
4. **Linguistic research and analysis**

---

## Exploratory Data Analysis (EDA) for Text Data

**Exploratory Data Analysis (EDA)** is the process of understanding your data *before* you start preprocessing or modeling. For text data, EDA helps you:

- **Identify data quality issues** early (wrong language, duplicates, outliers)
- **Understand vocabulary characteristics** (size, frequency patterns, class-specific words)
- **Determine preprocessing needs** (what noise to remove, what to normalize)
- **Make informed decisions** about your NLP pipeline

> **Analogy**: EDA is like inspecting ingredients before cooking. You check if vegetables are fresh, if you have the right quantities, and if anything needs cleaning. Similarly, EDA helps you understand your text data before "cooking" (preprocessing and modeling).

### Why EDA Matters for Text

**Without EDA**, you might:
- Preprocess incorrectly (e.g., remove important punctuation)
- Miss data quality issues (duplicates, wrong language)
- Choose wrong preprocessing steps (e.g., stemming when lemmatization is better)
- Build models on biased or poor-quality data

**With EDA**, you:
- Make informed preprocessing decisions
- Catch data issues early
- Understand what your model will learn from
- Validate that preprocessing improved your data

### EDA Components for Text Data

#### 1. Data Quality Assessment

**Class Distribution**: Are categories balanced or imbalanced?
- **Balanced**: Equal number of examples per class (e.g., 50% positive, 50% negative)
- **Imbalanced**: One class dominates (e.g., 90% positive, 10% negative)
- **Why it matters**: Imbalanced data can bias models toward the majority class

**Text Length Distribution**: How long are your texts?
- Very short texts (1-5 words) might be incomplete or noisy
- Very long texts (1000+ words) might be concatenated or need chunking
- Most texts should fall within a reasonable range

**Language Detection**: Is the text in the expected language?
- Mixed languages need special handling
- Wrong language indicates data collection issues

**Duplicate Detection**: Are there duplicate documents?
- Duplicates can bias training (model sees same example multiple times)
- Need to identify and remove duplicates before training

#### 2. Vocabulary Characteristics

**Vocabulary Size**: How many unique words?
- Small vocabulary (< 1000 words): Limited diversity, might need more data
- Large vocabulary (> 100,000 words): High diversity, might need dimensionality reduction
- Affects vectorization complexity and model performance

**Word Frequency Patterns**: Which words are most common?
- Very common words (appear in >80% of documents) might be stop words
- Rare words (appear in <2 documents) might be typos or noise
- Helps identify stop words to remove

**Class-Specific Patterns**: Do certain words appear more in certain classes?
- Words that appear mostly in one class are good features for classification
- Helps with feature selection and model interpretation

#### 3. Preprocessing Needs Identification

**Noise Patterns**: What needs to be removed?
- URLs, emails, hashtags, mentions (@user)
- HTML tags, special characters
- Identified through EDA by examining sample texts

**Normalization Needs**: What needs to be standardized?
- Case variations (UPPERCASE, lowercase, Title Case)
- Contractions ("don't" vs "do not")
- Elongations ("loooove" vs "love")
- Diacritics (in Arabic: "Ø§Ù„ÙƒØªØ§Ø¨" vs "Ø§Ù„ÙƒØªØ§Ø¨")

**Outliers**: Unusually long or short texts
- Very long texts might be concatenated (need splitting)
- Very short texts might be incomplete (need filtering)

### Example: EDA Workflow

Let's see a practical example of EDA on a sample dataset:

In [None]:
# Example: EDA on a sample text dataset
import pandas as pd
import matplotlib.pyplot as plt

# Load sample data (this would be your actual dataset)
sample_data = {
    'text': [
        'I love this product! It is amazing.',
        'This is terrible. Worst purchase ever.',
        'I love this product! It is amazing.',  # Duplicate
        'Ù…Ù…ØªØ§Ø² Ø±Ø§Ø¦Ø¹',  # Arabic text
        'The quick brown fox jumps over the lazy dog.',
        'Bad product. Do not buy.',
    ],
    'label': ['positive', 'negative', 'positive', 'positive', 'neutral', 'negative']
}

df = pd.DataFrame(sample_data)

# 1. Class Distribution
print("Class Distribution:")
print(df['label'].value_counts())
print(f"\nClass Balance: {df['label'].value_counts().std():.2f} (lower is more balanced)")

# 2. Text Length
df['text_length'] = df['text'].str.len()
print("\nText Length Statistics:")
print(df['text_length'].describe())

# 3. Duplicate Detection
duplicates = df.duplicated(subset=['text'])
print(f"\nDuplicates found: {duplicates.sum()}")

# 4. Vocabulary Size
all_words = ' '.join(df['text']).split()
vocab_size = len(set(all_words))
print(f"\nVocabulary Size: {vocab_size} unique words")

# 5. Most Common Words
from collections import Counter

word_counts = Counter(all_words)
print("\nTop 5 Most Common Words:")
for word, count in word_counts.most_common(5):
    print(f"  '{word}': {count} times")

**Output Analysis:**
- Class distribution shows if data is balanced
- Text length helps identify outliers
- Duplicates need to be removed
- Vocabulary size affects vectorization
- Common words help identify stop words

> **Note**: In the lab (Session 7), you'll perform comprehensive EDA on the Arabic 100k Reviews dataset, including class distribution analysis, text length histograms, vocabulary analysis, and word frequency patterns by class.

### EDA Best Practices

1. **Always do EDA before preprocessing** - Understand your raw data first
2. **Visualize distributions** - Histograms, bar charts help identify patterns
3. **Examine samples** - Look at actual text examples, not just statistics
4. **Compare before and after** - Validate that preprocessing improved your data
5. **Document findings** - Keep notes on what you discovered and decisions you made

### The EDA â†’ Preprocessing Connection

EDA findings directly inform preprocessing decisions:

| EDA Finding | Preprocessing Action |
|------------|---------------------|
| Many URLs found | Remove URLs with regex |
| Mixed case (UPPERCASE, lowercase) | Normalize to lowercase |
| Many duplicates | Remove duplicate documents |
| Very long texts | Consider text splitting or truncation |
| Very short texts | Filter out incomplete texts |
| Common words in >80% of docs | Add to stop word list |
| Rare words in <2 docs | Consider min_df filtering |
| Mixed languages | Use language-specific preprocessing |
| Many elongations ("loooove") | Normalize elongations |

> **Remember**: EDA is not a one-time activity. You should do EDA:
> - **Before preprocessing** (to understand raw data)
> - **After preprocessing** (to validate improvements)
> - **After vectorization** (to understand feature space)

## Key Takeaways

- A **corpus** is a collection of text documents that serves as the fundamental infrastructure for NLP work, including training language models, statistical analysis, and benchmarking.
- Different types of corpora exist (news, social media, literature, scientific) with varying structures, writing tones, and vocabularies.
- Understanding your corpus structure and characteristics is essential for building effective NLP pipelines and avoiding biases in your models.