## Advanced Regex & Corpora: Beyond the Basics
Regex is a powerful tool for pattern matching, but it doesn’t stop at basic searches and replacements. Advanced concepts like substitution, back-references, and look-ahead assertions make Regex even more useful. Additionally, understanding corpora and utterances is essential for working with speech and text data.

---

### **1. Substitution (`s/REGEX/PATTERN/`)**
Used to replace matched patterns in a string.

In [None]:
import re
text = "My number is 123-456-7890"
result = re.sub(r'\d', '#', text)
print(result)  # Output: "My number is ###-###-####"

### **2. Numbered Register: Back Reference (`(.*)(.+)\1\2`)**
Back-references allow matching repeated patterns in text.

In [None]:
text = "hello hello world"
pattern = r'\b(\w+) \1\b'
print(re.search(pattern, text))  # Matches "hello hello"

### **3. Non-Capturing Group (`(?:pattern)`)**
Groups a pattern without storing it for back-referencing.

In [None]:
text = "hello hello world"
pattern = r'(?:hello) world'
print(re.search(pattern, text))  # Matches "hello world"

### **4. Positive Look-ahead (`PATTERN1(?=PATTERN2)`)**
Checks if `PATTERN1` is followed by `PATTERN2`.

In [None]:
text = "I love apple pie"
pattern = r'apple(?= pie)'
print(re.search(pattern, text))  # Matches "apple"

### **5. Negative Look-ahead (`PATTERN1(?!PATTERN2)`)**
Ensures `PATTERN1` is NOT followed by `PATTERN2`.

In [None]:
text = "I love apple juice"
pattern = r'apple(?! pie)'
print(re.search(pattern, text))  # Matches "apple"

### **Corpora vs Utterance**
**Corpus (plural: corpora):** A large collection of structured text data used in NLP.
- Example: A dataset of thousands of transcribed conversations.

**Utterance:** A single spoken phrase or sentence in a conversation.
- Example: "Hey, how are you doing?"

### **Speech Disfluencies & NLP**
Speech often includes **disfluencies** (interruptions or hesitations). NLP models need to handle these correctly.

**Types of Disfluencies:**
- **Fillers (Filled Pauses):** Words like "uh," "um"
- **Fragments:** Broken-off words like "Hope-Hopefully"

### **How Speech Transcription Systems Handle Disfluencies**
- NLP models often remove fillers and correct fragments to improve readability.
**Example:**
  - Raw: "Uh, I want to, um, order a pizza."
  - Processed: "I want to order a pizza."

### **Conclusion**
Regex and corpora are crucial for working with text and speech data. Advanced Regex concepts help refine pattern matching, while understanding utterances and disfluencies improves speech transcription. Have you used any of these techniques in your projects? Let’s discuss! 🚀