## Corpus & Normalization: Understanding Text Data
A corpus is a large collection of text used in NLP (Natural Language Processing). Understanding word distributions, vocabulary size, and normalization techniques is crucial for text analysis.

---

### **1. Word Types & Vocabulary Size (`|V|`)**
**Word Types** refer to unique words in a text, while **Vocabulary Size** is the count of these unique words.
- **Example:** In "hello world hello", the word types are `{hello, world}`, so `|V| = 2`.

### **2. Word Instances (`N`)**
**Word Instances** count all words, including repetitions.
- **Example:** In "hello world hello", the total word instances `N = 3`.

### **3. Herdan’s Law**
- **Concept:** The number of unique words (word types) increases slowly compared to the total word instances.
- **Example:** A book may have 100,000 words, but only 20,000 unique words.

### **4. Heap’s Law (`|V| = kN^β`)**
- **Formula:** Vocabulary size grows in proportion to total words following this law.
- **Example:** If `k=10` and `β=0.5`, then for `N=10000`, the vocabulary size is `|V| = 10 * (10000)^0.5 = 1000`.

### **5. Lemma & Text Normalization**
- **Lemma** is the base or dictionary form of a word (used in text normalization).
- **Example:** *run, runs, running* → **run** (lemma).

### **6. Word Forms**
- **Different forms of a word based on tense or grammar.**
- **Example:** *eat, ate, eaten* (different forms of "eat").

### **7. Code Switching**
Mixing languages within a sentence.
- **Example:** "Naan enna solrenaa, I am a good boy."

### **8. Data Sheet / Data Statement**
- **Motivation:** Why was the corpus collected, by whom, and who funded it?
- **Situation:** When and where was the text written/spoken? Was it a conversation, edited text, or social media communication?
- **Language Variety:** What language (dialect/region) is the corpus in?
- **Speaker Demographics:** Age, gender, and background of the text authors.
- **Collection Process:** How was the data collected? Was consent obtained? How was it pre-processed?
- **Annotation Process:** What annotations exist? Who annotated it? How were they trained?
- **Distribution:** Are there copyright or intellectual property restrictions?

---

## Useful UNIX Commands for Text Processing
**Instructions:** Ensure your corpus file is named `sh.txt` and located in the same directory where you run these commands.

In [None]:
# Extract words and list them line by line
!tr -sc 'A-Za-z' '\n' < sh.txt

In [None]:
# Convert text to uppercase
!tr -sc A-Za-z '\t' < sh.txt | tr a-z A-Z

In [None]:
# Count unique words and their frequency
!tr -sc 'A-Za-z' '\n' < 'sh.txt' | sort | uniq -c

In [None]:
# Convert text to lowercase and sort by frequency
!tr -sc 'A-Za-z' '\n' < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r

### **Conclusion**
Understanding corpus statistics, text normalization, and UNIX commands for text processing is essential for NLP. These concepts help in language modeling, text mining, and AI applications. Which concept do you find most interesting? Let’s discuss! 🚀