In [2]:
from pathlib import Path
import re

## 📚 Corpus Assembly: Merging Text Files for Tokenization

This step combines multiple source documents into a single text file, preparing a unified corpus for downstream tokenization and analysis.

### **Inputs**
- **Directory:** `texts/` containing `.txt` files (UTF-8 encoded).
- **Pattern:** All files matching `*.txt` are included; subfolders are ignored.

### **Process Overview**
1. **Enumerate Files:** Use `Path('./texts').glob('*.txt')` to list all text files in the directory.
2. **Read & Concatenate:** For each file, read its contents as UTF-8 and append to a single output file.
3. **Separation:** Add a end of sequence token (`<EOS>`) after each file to ensure clear separation between documents.
4. **Output:** Write the combined result to `all_text.txt` in the project root.

### **Why This Step?**
- **Consistency:** Ensures all data is in one place for easier processing and reproducibility.
- **Efficiency:** Downstream scripts (tokenizers, analyzers) can operate on a single file, simplifying I/O.
- **Flexibility:** Easy to add or remove source files by updating the `texts/` folder.

### **Validation & Tips**
- Check the size of `all_text.txt` to confirm all data was written.
- Open the file and inspect the start/end of each document for encoding or separator issues.
- For reproducible order, use `sorted(Path('./texts').glob('*.txt'))`.
- To add metadata, consider writing the filename or a header before each document.

### **Next Steps**
- Use `all_text.txt` as input for tokenizer training, vocabulary building, or text analysis.
- Optionally, preprocess the text (e.g., normalization, lowercasing) before tokenization if required by your workflow.

In [3]:
files = Path('./texts').glob('*.txt')
with open('all_text.txt', 'w', encoding='utf-8') as outfile:
    for file in files:
        outfile.write(Path(file).read_text(encoding='utf-8') + '<EOS>')



## 📏 Corpus Character Count: Quick Data Integrity Check

After merging your text files into `all_text.txt`, it's important to verify that the corpus was assembled correctly and contains the expected amount of data. This cell performs a simple but effective validation by reading the entire file and reporting the total number of characters.

### **Purpose**
- **Sanity Check:** Confirms that `all_text.txt` is not empty and that the concatenation process worked as intended.
- **Data Integrity:** Helps detect issues such as missing files, encoding errors, or incomplete writes.
- **Baseline Metric:** Provides a reference point for future preprocessing or modifications.

### **What This Cell Does**
1. Opens `all_text.txt` in read mode with UTF-8 encoding.
2. Reads the entire file content into a string variable (`raw_text`).
3. Prints the total number of characters in the corpus.

### **How to Use the Output**
- **Expected Value:** The character count should be large and nonzero. If it's unexpectedly small, check your `texts/` directory for missing or empty files.
- **Troubleshooting:** If you encounter encoding errors, ensure all input files are UTF-8 encoded.
- **Scaling:** For very large corpora, consider reading the file in chunks or using file size in bytes (`os.stat('all_text.txt').st_size`) instead.

### **Next Steps**
- Use `raw_text` as input for tokenization, vocabulary extraction, or further text analysis.
- Optionally, perform additional validation (e.g., line count, previewing the start/end of the file) to further ensure data quality.

In [4]:
with open('all_text.txt', 'r', encoding='utf-8') as input_text:
    raw_text = input_text.read()
print(f"Total number of characters in the raw text: {len(raw_text)}")


Total number of characters in the raw text: 330569


## 🧮 Tokenization and Vocabulary Construction

This cell performs the core tokenization step and builds a vocabulary from your assembled corpus. It splits the raw text into tokens, counts them, and constructs a mapping from each unique token to a unique integer index.

### **What This Cell Does**
1. **Tokenization:**  
   - Uses a regular expression with `re.split` to break the text into tokens.
   - The pattern splits on common punctuation marks (`[,.!?():;_'""]`), double dashes (`--`), and whitespace (`\s`).
   - This approach preserves punctuation as separate tokens and ensures that words, punctuation, and spaces are all represented.

2. **Token Count:**  
   - Prints the total number of tokens generated from the corpus.
   - Useful for understanding the granularity and size of your tokenized data.

3. **Unique Tokens:**  
   - Converts the token list to a set to extract all unique tokens.
   - Sorts them for reproducibility and prints the total count.
   - This gives you the vocabulary size, a key metric for language modeling and analysis.

4. **Vocabulary Mapping:**  
   - Creates a dictionary (`vocab`) mapping each unique token to a unique integer index.
   - This mapping is essential for converting text into numerical form for machine learning models.
   - Adds a special (`<UNK>`) token to the end of the vocabulary to handle unknown tokens.

### **Tips & Customization**
- **Regex Tuning:** Adjust the regular expression to better fit your language or domain (e.g., handle contractions, special symbols, or multi-word expressions).
- **Whitespace Handling:** The current pattern includes whitespace as tokens. If you want to ignore or merge whitespace, modify the regex accordingly.
- **Vocabulary Filtering:** For large corpora, consider filtering out rare tokens or applying additional normalization (e.g., lowercasing, stemming).

### **Next Steps**
- Use the `tokenized_text` list for further processing, such as sequence modeling or n-gram analysis.
- The `vocab` dictionary can be saved and reused for encoding new text or training models.
- Analyze token frequency distributions or visualize the most common tokens for insights into your corpus.

In [5]:
tokenized_text = re.split(r'([,.!?():;_\'"]|--|\s)', raw_text)
print(f"Total number of tokens: {len(tokenized_text)}")

all_tokens = sorted(set(tokenized_text))
print(f"Total number of unique tokens: {len(all_tokens)}")

vocab = {token : index for index, token in enumerate(all_tokens)}
vocab.update({'<UNK>' : len(vocab)})





Total number of tokens: 143975
Total number of unique tokens: 7357


## 🧩 MiniTokenizer Class: Encoding and Decoding Text

This cell defines the `MiniTokenizer` class, which provides simple methods to convert text into sequences of token IDs (encoding) and to reconstruct text from token IDs (decoding) using the vocabulary built in previous steps.

### **Class Overview**
- **Initialization (`__init__`):**
  - Takes a `vocab` dictionary mapping tokens to unique integer indices.
  - Builds an `inverse_vocab` dictionary for reverse lookup (index to token), enabling decoding.

- **Encoding (`encode` method):**
  - Splits input text into tokens using the same regular expression as before (`re.split(r'([,.!?():;_\'"]|--|\s)', text)`).
  - Converts each token into its corresponding integer ID using the vocabulary.
  - Adds the special token (`<UNK>`) if an unkown token is encountered.
  - Returns a list of token IDs representing the input text.

- **Decoding (`decode` method):**
  - Converts a list of token IDs back into tokens using `inverse_vocab`.
  - Joins the tokens into a string, separated by spaces.
  - Returns the reconstructed text.

### **Usage Example**
```python
tokenizer = MiniTokenizer(vocab)
ids = tokenizer.encode("Hello, world!")
print(ids)  # Encoded token IDs
print(tokenizer.decode(ids))  # Decoded text (may include extra spaces)

In [10]:
class MiniTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.inverse_vocab = {index : token for token, index in vocab.items()}
    def encode(self, text):
        tokens = re.split(r'([,.!?():;_\'"]|--|\s)', text)
        token_ids = [self.vocab[token] if token in self.vocab else self.vocab['<UNK>'] for token in tokens]
        return token_ids
    def decode(self, token_ids):
        text = ''.join([self.inverse_vocab[token_id] for token_id in token_ids])
        return text

In [14]:
tokenizer = MiniTokenizer(vocab)
ids = tokenizer.encode("Hello, world! This is a test of how well the tokenizer works.")
print(ids)  # Encoded token IDs
print(tokenizer.decode(ids))  # Decoded text (may include extra spaces)

[7357, 10, 0, 3, 7182, 4, 0, 3, 1335, 3, 4267, 3, 1473, 3, 6587, 3, 4955, 3, 3929, 3, 7092, 3, 6602, 3, 7357, 3, 7181, 12, 0]
<UNK>, world! This is a test of how well the <UNK> works.
