<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [5]</a>'.</span>

# SciTeX String Processing Utilities

This comprehensive notebook demonstrates the SciTeX str module capabilities, covering string processing, formatting, and text manipulation utilities.

## Features Covered

### Text Processing
* String cleaning and path sanitization
* Text search and replacement
* String parsing and extraction
* Space normalization

### Formatting and Display
* Colored text output
* Debug printing utilities
* Block text formatting
* Readable byte formatting

### Scientific Text
* LaTeX formatting and fallbacks
* Scientific notation
* Mathematical text formatting
* Plot text optimization

### Security and Privacy
* API key masking
* ANSI escape code removal
* Safe text handling

In [1]:
# Detect notebook name for output directory
import os
from pathlib import Path

# Get notebook name (for papermill compatibility)
notebook_name = "04_scitex_str"
if 'PAPERMILL_NOTEBOOK_NAME' in os.environ:
    notebook_name = Path(os.environ['PAPERMILL_NOTEBOOK_NAME']).stem


In [2]:
import sys
sys.path.insert(0, '../src')
import scitex
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import re
import os

# Set up example data directory
data_dir = Path("./str_examples")
data_dir.mkdir(exist_ok=True)


## Part 1: Basic String Processing

### 1.1 String Cleaning and Sanitization

In [3]:
# Path cleaning examples
problematic_paths = [
    "/home/user/data with spaces/file.txt",
    "C:\\Users\\Name\\Documents\\file.txt",
    "~/data/file with special chars!@#.txt",
    "./data//double//slashes///file.txt",
    "data/./current/./directory/file.txt"
]


for path in problematic_paths:
    try:
        cleaned = scitex.str.clean_path(path)
    except Exception as e:
        pass  # Fixed incomplete except block

# String capitalization
test_strings = [
    "Hello World",
    "MACHINE LEARNING",
    "DataScience",
    "python_programming",
    "AI-Research"
]


for text in test_strings:
    decapitalized = scitex.str.decapitalize(text)

### 1.2 Space Normalization and Text Cleanup

In [4]:
# Space normalization
messy_texts = [
    "This    has     multiple   spaces",
    "\t\nTabs and newlines\t\n everywhere\t\n",
    "   Leading and trailing spaces   ",
    "Mixed\t\n\r\n   whitespace   characters",
    "Normal text with single spaces"
]


for text in messy_texts:
    normalized = scitex.str.squeeze_spaces(text)

# ANSI escape code removal
colored_texts = [
    "\033[31mRed text\033[0m",
    "\033[1;32mBold green text\033[0m",
    "\033[4;34mUnderlined blue text\033[0m",
    "Normal text without ANSI codes",
    "\033[91mBright red\033[0m mixed with \033[92mgreen\033[0m"
]


for text in colored_texts:
    clean_text = scitex.str.remove_ansi(text)

## Part 2: Text Search and Manipulation

### 2.1 Search and Grep Functionality

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [5]:
# Create sample text for searching
sample_text = """
This is a sample text for demonstrating search functionality.
The text contains multiple lines and various patterns.
We have numbers like 123, 456, and 789.
Email addresses: john@example.com, jane.doe@university.edu
Phone numbers: (555) 123-4567, 555-987-6543
URLs: https://www.example.com, http://test.org
Some special characters: !@#$%^&*()
And finally, this text ends here.
"""

# Write to file for grep demonstration
test_file = data_dir / "sample_text.txt"
with open(test_file, 'w') as f:
    f.write(sample_text)


# Search for patterns in text
search_patterns = [
    "sample",
    "[0-9]+",  # Numbers
    "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",  # Email
    "https?://[^\s]+",  # URLs
    "\([0-9]{3}\) [0-9]{3}-[0-9]{4}",  # Phone numbers
]

for pattern in search_patterns:
    try:
        results = scitex.str.search(sample_text, pattern)
        if results:
            # Condition met
        else:
            pass  # Fixed incomplete block
    except Exception as e:
        pass  # Fixed incomplete except block

# Grep in file
try:
    grep_results = scitex.str.grep(str(test_file), "numbers")
    if grep_results:
        # Condition met
    else:
        pass  # Fixed incomplete block
except Exception as e:    pass  # Fixed incomplete except block


IndentationError: expected an indented block after 'if' statement on line 31 (135243619.py, line 33)

### 2.2 Text Replacement and Parsing

In [None]:
# Text replacement examples
replacement_examples = [
    ("Hello World", "World", "Python"),
    ("The quick brown fox", "brown", "red"),
    ("Machine Learning AI", "AI", "Artificial Intelligence"),
    ("Data Science 2024", "2024", "2025"),
    ("test@example.com", "@example.com", "@newdomain.org")
]


for original, old, new in replacement_examples:
    try:
        replaced = scitex.str.replace(original, old, new)
    except Exception as e:
        pass  # Fixed incomplete except block

# Text parsing
parse_examples = [
    "name=John age=30 city=NYC",
    "temperature=25.5 humidity=60% pressure=1013.25",
    "model=LinearRegression accuracy=0.95 loss=0.05",
    "date=2024-01-01 time=12:30:00 timezone=UTC"
]


for text in parse_examples:
    try:
        parsed = scitex.str.parse(text)
    except Exception as e:        pass  # Fixed incomplete except block


## Part 3: Colored Text and Debug Output

### 3.1 Colored Text Output

In [None]:
# Colored text examples
colors = ['red', 'green', 'blue', 'yellow', 'magenta', 'cyan', 'white']
text_styles = ['normal', 'bold', 'underline']


# Basic colors
for color in colors:
    try:
        colored = scitex.str.color_text(f"This is {color} text", color)
    except Exception as e:
        pass  # Fixed incomplete except block

# Using shorthand ct function
try:
    # Try block
except Exception as e:
    pass  # Fixed incomplete except block

# Demonstration of different message types
messages = [
    ("Operation completed successfully", "green"),
    ("Warning: Low disk space", "yellow"),
    ("Error: File not found", "red"),
    ("Info: Processing data", "blue"),
    ("Debug: Variable x = 42", "magenta")
]

for message, color in messages:
    try:
        colored_msg = scitex.str.color_text(message, color)
    except Exception as e:        pass  # Fixed incomplete except block


### 3.2 Debug Printing and Block Formatting

In [None]:
import pandas as pd
import numpy as np
# Debug printing examples
debug_data = {
    'variables': {'x': 42, 'y': 3.14, 'name': 'test'},
    'array': np.array([1, 2, 3, 4, 5]),
    'dataframe': pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}),
    'list': [1, 2, 3, 4, 5],
    'nested': {'level1': {'level2': {'value': 100}}}
}


for name, data in debug_data.items():
    try:
        scitex.str.print_debug(data, name)
    except Exception as e:
        pass  # Fixed incomplete except block

# Block text formatting
block_texts = [
    "This is a simple message",
    "This is a longer message that demonstrates block formatting capabilities",
    "Multi-line\nblock text\nexample",
    "Important: This is a critical message that needs attention!"
]


for text in block_texts:
    try:
        scitex.str.printc(text)  # Print colored/formatted block
    except Exception as e:        pass  # Fixed incomplete except block


## Part 4: Scientific Text and LaTeX Formatting

### 4.1 LaTeX Style Formatting

In [None]:
# LaTeX style formatting examples
scientific_texts = [
    "alpha",
    "beta",
    "gamma",
    "theta",
    "lambda",
    "mu",
    "sigma",
    "phi",
    "x_hat",
    "y_bar",
    "z_prime"
]


for text in scientific_texts:
    try:
        latex_formatted = scitex.str.to_latex_style(text)
    except Exception as e:
        pass  # Fixed incomplete except block

# Safe LaTeX formatting with fallback

for text in scientific_texts:
    try:
        safe_latex = scitex.str.safe_to_latex_style(text)
    except Exception as e:
        pass  # Fixed incomplete except block

# Hat notation in LaTeX
hat_examples = ['x', 'y', 'z', 'theta', 'phi', 'mu']

for var in hat_examples:
    try:
        hat_formatted = scitex.str.add_hat_in_latex_style(var)
    except Exception as e:        pass  # Fixed incomplete except block


### 4.2 Scientific Text and Plot Formatting

In [None]:
# Scientific text formatting for plots
plot_labels = [
    ("Temperature (C)", "°C"),
    ("Pressure (Pa)", "Pa"),
    ("Voltage (V)", "V"),
    ("Current (A)", "A"),
    ("Frequency (Hz)", "Hz"),
    ("Energy (J)", "J"),
    ("Power (W)", "W")
]


for label, unit in plot_labels:
    try:
        formatted = scitex.str.scientific_text(label)
    except Exception as e:
        pass  # Fixed incomplete except block

# Plot text formatting

plot_texts = [
    "x-axis label",
    "y-axis label",
    "Main Title",
    "Subplot Title",
    "Legend Entry"
]

for text in plot_texts:
    try:
        formatted = scitex.str.format_plot_text(text)
    except Exception as e:
        pass  # Fixed incomplete except block

# Axis labels and titles

axis_examples = [
    ("time", "seconds"),
    ("amplitude", "volts"),
    ("frequency", "hertz"),
    ("temperature", "celsius"),
    ("pressure", "pascals")
]

for var, unit in axis_examples:
    try:
        axis_label = scitex.str.axis_label(var, unit)
        title_formatted = scitex.str.title(f"{var} vs time")
    except Exception as e:        pass  # Fixed incomplete except block


### 4.3 Digit Factoring and Smart Formatting

In [None]:
import numpy as np
# Digit factoring for better readability
large_numbers = [
    [1000, 2000, 3000, 4000, 5000],
    [1500000, 2500000, 3500000, 4500000],
    [0.001, 0.002, 0.003, 0.004, 0.005],
    [12345, 23456, 34567, 45678, 56789],
    [1.2e6, 2.3e6, 3.4e6, 4.5e6, 5.6e6]
]


for numbers in large_numbers:
    try:
        factored = scitex.str.factor_out_digits(numbers)
    except Exception as e:
        pass  # Fixed incomplete except block

# Smart tick formatting
tick_examples = [
    np.array([0, 1000, 2000, 3000, 4000, 5000]),
    np.array([0.001, 0.002, 0.003, 0.004, 0.005]),
    np.array([1e6, 2e6, 3e6, 4e6, 5e6]),
    np.array([0.0001, 0.0002, 0.0003, 0.0004, 0.0005])
]


for ticks in tick_examples:
    try:
        formatted = scitex.str.smart_tick_formatter(ticks)
    except Exception as e:        pass  # Fixed incomplete except block


## Part 5: Security and Privacy Features

### 5.1 API Key Masking

In [None]:
# API key masking examples
sensitive_data = [
    "API_KEY=sk-1234567890abcdef1234567890abcdef",
    "SECRET_TOKEN=ghp_1234567890abcdef1234567890abcdef123456",
    "DATABASE_URL=postgresql://user:password@localhost:5432/db",
    "OPENAI_API_KEY=sk-proj-abcdef1234567890abcdef1234567890abcdef",
    "AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE",
    "This is normal text without sensitive data"
]


for data in sensitive_data:
    try:
        masked = scitex.str.mask_api(data)
    except Exception as e:
        pass  # Fixed incomplete except block

# Demonstration of different API key formats
api_formats = [
    "sk-1234567890abcdef",  # OpenAI style
    "ghp_1234567890abcdef",  # GitHub style
    "xoxb-1234567890",  # Slack style
    "ya29.1234567890",  # Google style
    "EAACEdEose0cBA1234567890"  # Facebook style
]


for api_key in api_formats:
    try:
        masked = scitex.str.mask_api(api_key)
    except Exception as e:        pass  # Fixed incomplete except block


## Part 6: Utility Functions

### 6.1 Readable Bytes and File Sizes

In [None]:
# Readable byte formatting
byte_sizes = [
    1024,  # 1 KB
    1024**2,  # 1 MB
    1024**3,  # 1 GB
    1024**4,  # 1 TB
    1500000,  # 1.5 MB
    2500000000,  # 2.5 GB
    512,  # 512 bytes
    1023,  # Just under 1 KB
    1048576 + 524288  # 1.5 MB
]


for size in byte_sizes:
    try:
        readable = scitex.str.readable_bytes(size)
    except Exception as e:
        pass  # Fixed incomplete except block

# File size examples with actual files

# Create test files of different sizes
test_files = [
    ("small.txt", "Small file content"),
    ("medium.txt", "Medium file content\n" * 1000),
    ("large.txt", "Large file content with lots of text\n" * 10000)
]

for filename, content in test_files:
    filepath = data_dir / filename
    with open(filepath, 'w') as f:
        f.write(content)
    
    file_size = filepath.stat().st_size
    readable_size = scitex.str.readable_bytes(file_size)

## Part 7: LaTeX Fallback System

### 7.1 LaTeX Capability Detection and Fallbacks

In [None]:
# LaTeX capability detection

try:
    latex_available = scitex.str.check_latex_capability()
    
    latex_status = scitex.str.get_latex_status()
    
    fallback_mode = scitex.str.get_fallback_mode()
    
except Exception as e:
    pass  # Fixed incomplete except block

# LaTeX fallback examples
latex_expressions = [
    r"$\alpha + \beta = \gamma$",
    r"$\frac{x^2}{y^2} = z$",
    r"$\sum_{i=1}^{n} x_i$",
    r"$\int_{0}^{\infty} e^{-x} dx$",
    r"$\sqrt{\frac{a}{b}}$"
]


for expr in latex_expressions:
    try:
        # Try safe rendering
        safe_rendered = scitex.str.safe_latex_render(expr)
        
        # Try conversion to unicode
        unicode_version = scitex.str.latex_to_unicode(expr)
        
        # Try conversion to mathtext
        mathtext_version = scitex.str.latex_to_mathtext(expr)
        
    except Exception as e:
        pass  # Fixed incomplete except block

# Fallback mode management

try:
    # Enable fallback
    scitex.str.enable_latex_fallback()
    
    # Set fallback mode
    scitex.str.set_fallback_mode('unicode')
    
    # Test with fallback
    test_expr = r"$\alpha + \beta$"
    result = scitex.str.safe_latex_render(test_expr)
    
except Exception as e:    pass  # Fixed incomplete except block


## Part 8: Practical Applications

### 8.1 Scientific Data Processing Pipeline

In [None]:
# Create a comprehensive text processing pipeline
class TextProcessor:
    def __init__(self):
        self.processing_log = []
    
    def log_step(self, step, input_text, output_text):
        self.processing_log.append({
        'step': step,
        'input': input_text,
        'output': output_text,
        'input_length': len(input_text),
        'output_length': len(output_text)
        })
    
    def process_scientific_text(self, text):
        """Process scientific text through multiple cleaning steps."""
        original_text = text
        
        # Step 1: Normalize spaces
        text = scitex.str.squeeze_spaces(text)
        self.log_step("Space normalization", original_text, text)
        
        # Step 2: Remove ANSI codes
        text = scitex.str.remove_ansi(text)
        self.log_step("ANSI removal", self.processing_log[-1]['output'], text)
        
        # Step 3: Format for LaTeX
        text = scitex.str.safe_to_latex_style(text)
        self.log_step("LaTeX formatting", self.processing_log[-1]['output'], text)
        
        # Step 4: Mask sensitive data
        text = scitex.str.mask_api(text)
        self.log_step("API masking", self.processing_log[-1]['output'], text)
        
        return text
    
    def print_processing_log(self):
        """Print the processing log with colored output."""
        
        for i, entry in enumerate(self.processing_log, 1):
            # Loop body

# Test the pipeline
processor = TextProcessor()

test_scientific_texts = [
    "\033[31mTemperature    measurements\033[0m   showed   alpha = 0.05   significance with API_KEY=sk-1234567890abcdef",
    "\t\nPressure   data\t\ncontains    beta   coefficients   SECRET_TOKEN=ghp_abcdef1234567890\n",
    "The   gamma   distribution   parameters   were   DATABASE_URL=postgresql://user:pass@host:5432/db"
]


for i, text in enumerate(test_scientific_texts, 1):
    processed = processor.process_scientific_text(text)

# Show processing log
processor.print_processing_log()

### 8.2 Report Generation with Formatted Text

In [None]:
# Generate a formatted scientific report
def generate_scientific_report(experiment_data):
    """Generate a formatted scientific report."""
    
    # Report header
    report = []
    report.append(scitex.str.ct("=" * 60, "blue"))
    report.append(scitex.str.ct("SCIENTIFIC EXPERIMENT REPORT", "blue"))
    report.append(scitex.str.ct("=" * 60, "blue"))
    report.append("")
    
    # Experiment info
    report.append(scitex.str.ct("EXPERIMENT INFORMATION", "green"))
    report.append("-" * 30)
    report.append(f"Name: {experiment_data['name']}")
    report.append(f"Date: {experiment_data['date']}")
    report.append(f"Researcher: {experiment_data['researcher']}")
    report.append("")
    
    # Parameters
    report.append(scitex.str.ct("PARAMETERS", "green"))
    report.append("-" * 15)
    for param, value in experiment_data['parameters'].items():
        formatted_param = scitex.str.to_latex_style(param)
        report.append(f"{formatted_param}: {value}")
    report.append("")
    
    # Results
    report.append(scitex.str.ct("RESULTS", "green"))
    report.append("-" * 10)
    for metric, value in experiment_data['results'].items():
        if isinstance(value, float):
            formatted_value = f"{value:.4f}"
        else:
            formatted_value = str(value)
        report.append(f"{metric}: {formatted_value}")
    report.append("")
    
    # File sizes
    if 'file_sizes' in experiment_data:
        report.append(scitex.str.ct("FILE SIZES", "green"))
        report.append("-" * 12)
        for filename, size in experiment_data['file_sizes'].items():
            readable_size = scitex.str.readable_bytes(size)
            report.append(f"{filename}: {readable_size}")
        report.append("")
    
    # Status
    status = experiment_data.get('status', 'unknown')
    if status == 'success':
        status_line = scitex.str.ct(f"Status: {status.upper()}", "green")
    elif status == 'warning':
        status_line = scitex.str.ct(f"Status: {status.upper()}", "yellow")
    else:
        status_line = scitex.str.ct(f"Status: {status.upper()}", "red")
    
    report.append(status_line)
    report.append("")
    report.append(scitex.str.ct("=" * 60, "blue"))
    
    return "\n".join(report)

# Sample experiment data
experiment_data = {
    'name': 'Neural Network Performance Analysis',
    'date': '2024-01-15',
    'researcher': 'Dr. Jane Smith',
    'parameters': {
    'alpha': 0.001,
    'beta': 0.9,
    'gamma': 0.999,
    'lambda': 0.01,
    'epochs': 100,
    'batch_size': 32
    },
    'results': {
    'accuracy': 0.9542,
    'precision': 0.9123,
    'recall': 0.8876,
    'f1_score': 0.8998,
    'training_time': '2h 45m'
    },
    'file_sizes': {
    'model.pkl': 15728640,  # 15 MB
    'training_data.csv': 104857600,  # 100 MB
    'results.json': 2048,  # 2 KB
    'logs.txt': 524288  # 512 KB
    },
    'status': 'success'
}

# Generate and print the report
report = generate_scientific_report(experiment_data)

# Save report to file
report_file = data_dir / "experiment_report.txt"
with open(report_file, 'w') as f:
    # Remove color codes for file output
    clean_report = scitex.str.remove_ansi(report)
    f.write(clean_report)


## Summary and Best Practices

This tutorial demonstrated the comprehensive string processing capabilities of the SciTeX str module:

### Key Features Covered:
1. **Text Cleaning**: `clean_path()`, `squeeze_spaces()`, `remove_ansi()`
2. **Search and Replace**: `search()`, `grep()`, `replace()`
3. **Colored Output**: `color_text()`, `ct()` for enhanced readability
4. **Debug Tools**: `print_debug()`, `printc()` for development
5. **LaTeX Support**: `to_latex_style()`, `safe_latex_render()` with fallbacks
6. **Scientific Formatting**: `scientific_text()`, `format_plot_text()`
7. **Security**: `mask_api()` for sensitive data protection
8. **Utility Functions**: `readable_bytes()`, `factor_out_digits()`
9. **Smart Formatting**: `smart_tick_formatter()`, `axis_label()`
10. **LaTeX Fallback System**: Robust handling of LaTeX unavailability

### Best Practices:
- Use **text cleaning** functions before processing scientific data
- Apply **API masking** to protect sensitive information
- Use **colored output** for better user experience
- Implement **LaTeX fallbacks** for robust scientific text rendering
- Use **smart formatting** for better plot readability
- Apply **debug tools** during development
- Use **readable byte formatting** for file size reporting
- Implement **comprehensive text processing pipelines** for consistent results

In [None]:
cleanup = False  # Set to True to remove example files
# Cleanup
import shutil

# cleanup = "n"  # input("Clean up example files? (y/n): ").lower().startswith('y')
if cleanup:
    shutil.rmtree(data_dir)
else:
    total_size = sum(f.stat().st_size for f in data_dir.rglob('*') if f.is_file())