# Tokenization with Labels for Supervised Learning

This notebook demonstrates different methods to add labels to tokenized data.

For detailed documentation, see **TOKENIZATION_LABELS_GUIDE.md**

In [None]:
# Install required packages
pip install transformers datasets tokenizers torch

In [None]:
# Import the tokenization utility
from tokenization_with_labels import TokenizationWithLabels, create_example_data_files
import json

In [None]:
# Create example data files to demonstrate different methods
create_example_data_files()
print("Example data files created!")

## Method 1: Labels in Separate File

Use this when you have:
- Text data in one file (e.g., quran_data.txt)
- Labels in another file (e.g., labels.txt)
- Both files have the same number of lines

In [None]:
# Method 1: Tokenize with labels from separate files
tokenizer = TokenizationWithLabels()

tokenized_data = tokenizer.tokenize_with_labels_from_separate_file(
    'examples/quran_data_example.txt',
    'examples/labels_example.txt'
)

print(f"Tokenized {len(tokenized_data)} items")
print("\nFirst example:")
print(json.dumps(tokenized_data[0], ensure_ascii=False, indent=2))

## Method 2: Embedded Labels in CSV

Use this when you have a CSV file with columns for both text and labels.

In [None]:
# Method 2: Tokenize with embedded labels from CSV
tokenizer = TokenizationWithLabels()

tokenized_data = tokenizer.tokenize_with_embedded_labels(
    'examples/quran_with_labels_example.csv',
    text_column='verse',
    label_column='category'
)

print(f"Tokenized {len(tokenized_data)} items")
print("\nFirst example:")
print(json.dumps(tokenized_data[0], ensure_ascii=False, indent=2))

## Method 3: Embedded Labels in JSON

Use this when you have a JSON file with fields for both text and labels.

In [None]:
# Method 3: Tokenize with embedded labels from JSON
tokenizer = TokenizationWithLabels()

tokenized_data = tokenizer.tokenize_with_embedded_labels(
    'examples/quran_with_labels_example.json',
    text_column='verse',
    label_column='category',
    file_format='json'
)

print(f"Tokenized {len(tokenized_data)} items")
print("\nFirst example:")
print(json.dumps(tokenized_data[0], ensure_ascii=False, indent=2))

## Method 4: Label Mapping

Use this when you have a dictionary mapping line numbers to labels.

In [None]:
# Method 4: Tokenize with label mapping
tokenizer = TokenizationWithLabels()

# Define label mapping (line index to label)
label_mapping = {
    0: 'basmala',
    1: 'praise',
    2: 'attribute'
}

tokenized_data = tokenizer.tokenize_with_label_mapping(
    'examples/quran_data_example.txt',
    label_mapping
)

print(f"Tokenized {len(tokenized_data)} items")
print("\nFirst example:")
print(json.dumps(tokenized_data[0], ensure_ascii=False, indent=2))

## Method 5: Pattern-Based Labels

Use this when labels can be derived from the text itself using rules or patterns.

In [None]:
# Method 5: Tokenize with pattern-based labels
tokenizer = TokenizationWithLabels()

# Define a function to derive labels from text
def get_label_from_text(text):
    if 'بِسْمِ اللَّهِ' in text:
        return 'basmala'
    elif 'الْحَمْدُ' in text:
        return 'praise'
    elif len(text) > 50:
        return 'long_verse'
    else:
        return 'short_verse'

tokenized_data = tokenizer.tokenize_with_pattern_based_labels(
    'examples/quran_data_example.txt',
    get_label_from_text
)

print(f"Tokenized {len(tokenized_data)} items")
print("\nFirst example:")
print(json.dumps(tokenized_data[0], ensure_ascii=False, indent=2))

## Using with Hugging Face Transformers

You can integrate a pre-trained tokenizer from the transformers library.

In [None]:
# CELL ID: efffdefa - Tokenization with labels using transformers

# Import transformers
from transformers import AutoTokenizer
from tokenization_with_labels import TokenizationWithLabels

# Load a pre-trained Arabic tokenizer
# Options: 'aubmindlab/bert-base-arabertv2', 'asafaya/bert-base-arabic', etc.
hf_tokenizer = AutoTokenizer.from_pretrained('aubmindlab/bert-base-arabertv2')

# Create tokenizer with the Hugging Face tokenizer
tokenizer = TokenizationWithLabels(tokenizer=hf_tokenizer)

# Example: Use any method to add labels
# Replace with your actual data files
tokenized_data = tokenizer.tokenize_with_labels_from_separate_file(
    'examples/quran_data_example.txt',  # Your quran_data file
    'examples/labels_example.txt'       # Your labels file
)

# Display results
print(f"Tokenized {len(tokenized_data)} items with labels")
print("\nFirst example with transformer tokens:")
print(json.dumps(tokenized_data[0], ensure_ascii=False, indent=2))

# Save the tokenized data
tokenizer.save_tokenized_data(
    tokenized_data,
    'tokenized_quran_with_labels.json',
    format='json'
)

print("\nTokenized data saved to tokenized_quran_with_labels.json")

## Saving Tokenized Data

Save your tokenized data with labels for later use.

In [None]:
# Save tokenized data with labels
tokenizer = TokenizationWithLabels()

# Assuming you have tokenized_data from one of the methods above
tokenizer.save_tokenized_data(
    tokenized_data,
    'output_tokenized_data.json',
    format='json'
)

# Or save as CSV
tokenizer.save_tokenized_data(
    tokenized_data,
    'output_tokenized_data.csv',
    format='csv'
)

print("Data saved successfully!")

## Next Steps

1. **Identify your label source**: Determine where your labels are or how they should be created
2. **Prepare your data**: Organize your quran_data and labels according to one of the methods above
3. **Choose the appropriate method**: Select the method that best fits your data organization
4. **Tokenize with labels**: Run the tokenization code with your actual data files
5. **Verify the results**: Check that labels are correctly associated with each text
6. **Train your model**: Use the tokenized data with labels for supervised learning

For more details, refer to **TOKENIZATION_LABELS_GUIDE.md**