# Turkish Text Preprocessing and Cleaning for Setur Complaints Analysis

This notebook demonstrates comprehensive data preprocessing and cleaning techniques specifically designed for Turkish text analysis. We'll process Setur customer complaints to prepare them for sentiment analysis and topic modeling.

## Key Preprocessing Steps:
1. **Data Loading and Exploration**
2. **Text Normalization** - Lowercasing and Unicode normalization
3. **Cleaning** - Remove punctuation, URLs, HTML tags, emojis
4. **Turkish Character Preservation** - Handle Turkish-specific characters (ç, ğ, ı, ö, ş, ü)
5. **Stopword Removal** - Remove Turkish function words
6. **Optional Lemmatization** - Reduce words to root forms (important for agglutinative Turkish)

## Dataset Overview:
- **Source**: Setur tourism company complaints from sikayetvar.com
- **Text Fields**: complaint titles, full complaint text, company responses
- **Language**: Turkish with potential slang and informal writing

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import json
import re
import warnings
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Set

# Advanced Turkish NLP libraries
try:
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize, sent_tokenize
    print("NLTK imported successfully")
except ImportError:
    print("NLTK not available - install with: pip install nltk")
    nltk = None

try:
    import zeyrek
    print("Zeyrek (Turkish morphological analyzer) imported successfully")
except ImportError:
    print("Zeyrek not available - install with: pip install zeyrek")
    zeyrek = None

try:
    import spacy
    print("spaCy imported successfully")
except ImportError:
    print("spaCy not available - install with: pip install spacy")
    spacy = None

try:
    from TurkishStemmer import TurkishStemmer
    print("TurkishStemmer imported successfully")
except ImportError:
    print("TurkishStemmer not available - install with: pip install turkish-stemmer")
    TurkishStemmer = None

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)
warnings.filterwarnings('ignore')

# Set Turkish locale for proper character handling
import locale
try:
    locale.setlocale(locale.LC_ALL, 'tr_TR.UTF-8')
except:
    print("Turkish locale not available, using default")

print("Libraries imported successfully!")

## 1. Data Loading and Initial Exploration

In [None]:
# Load the data
with open('setur_complaints.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
    
# Convert to DataFrame
df = pd.DataFrame(data)

# Display basic information
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nData types:")
print(df.dtypes)

# Check for text fields that contain actual content
text_fields = ['title', 'full_complaint', 'company_response']
for field in text_fields:
    if field in df.columns:
        non_empty = df[field].notna().sum()
        print(f"\n{field}: {non_empty} non-empty entries")
        if non_empty > 0:
            avg_length = df[field].dropna().str.len().mean()
            print(f"  Average length: {avg_length:.1f} characters")

In [None]:
# Display sample complaints to understand the text content
print("=== SAMPLE COMPLAINT TITLES ===")
for i, title in enumerate(df['title'].dropna().head(5)):
    print(f"{i+1}. {title}")

print("\n=== SAMPLE FULL COMPLAINTS ===")
for i, complaint in enumerate(df['full_complaint'].dropna().head(3)):
    print(f"{i+1}. {complaint[:200]}...")
    print("-" * 80)

print("\n=== SAMPLE COMPANY RESPONSES ===")
for i, response in enumerate(df['company_response'].dropna().head(3)):
    print(f"{i+1}. {response[:200]}...")
    print("-" * 80)

## 2. Turkish Language Characteristics Analysis

Before preprocessing, let's analyze the Turkish language characteristics in our dataset.

In [None]:
# Analyze Turkish character usage
def analyze_turkish_chars(text_series):
    """Analyze Turkish-specific character usage in text series"""
    turkish_chars = {'ç', 'ğ', 'ı', 'ö', 'ş', 'ü', 'Ç', 'Ğ', 'İ', 'Ö', 'Ş', 'Ü'}
    
    all_text = ' '.join(text_series.dropna().astype(str))
    char_counts = Counter(all_text)
    
    turkish_char_counts = {char: char_counts.get(char, 0) for char in turkish_chars}
    
    return turkish_char_counts, len(all_text)

# Analyze character distribution in complaints
if 'full_complaint' in df.columns:
    turkish_chars, total_chars = analyze_turkish_chars(df['full_complaint'])
    
    print("Turkish Character Distribution in Complaints:")
    for char, count in sorted(turkish_chars.items(), key=lambda x: x[1], reverse=True):
        percentage = (count / total_chars) * 100
        print(f"  {char}: {count:,} ({percentage:.3f}%)")

# Check for common Turkish patterns
print("\n=== Common Turkish Patterns Analysis ===")
sample_text = ' '.join(df['full_complaint'].dropna().head(100).astype(str))

# Check for common Turkish suffixes
turkish_suffixes = ['ler', 'lar', 'den', 'dan', 'nin', 'nın', 'nun', 'nün', 
                   'ken', 'iken', 'mış', 'miş', 'muş', 'müş']

print("Common Turkish suffix patterns found:")
for suffix in turkish_suffixes:
    pattern = r'\w+' + suffix + r'\b'
    matches = len(re.findall(pattern, sample_text, re.IGNORECASE))
    if matches > 0:
        print(f"  Words ending with '{suffix}': {matches}")

## 3. Text Preprocessing Functions

Now let's create comprehensive preprocessing functions specifically designed for Turkish text.

In [None]:
# Define Turkish stopwords
TURKISH_STOPWORDS = {
    # Articles and determiners
    'bir', 'bu', 'şu', 'o', 'her', 'hiç', 'bazı', 'bütün', 'tüm',
    
    # Conjunctions
    've', 'veya', 'ya da', 'ama', 'fakat', 'ancak', 'lakin', 'çünkü', 'eğer',
    
    # Prepositions
    'ile', 'için', 'den', 'dan', 'de', 'da', 'te', 'ta', 'ye', 'ya',
    'nin', 'nın', 'nun', 'nün', 'in', 'ın', 'un', 'ün',
    
    # Pronouns
    'ben', 'sen', 'o', 'biz', 'siz', 'onlar', 'beni', 'seni', 'onu',
    'bizi', 'sizi', 'onları', 'benim', 'senin', 'onun', 'bizim', 'sizin',
    
    # Question particles
    'mi', 'mı', 'mu', 'mü',
    
    # Common adverbs
    'çok', 'az', 'daha', 'en', 'çok', 'pek', 'oldukça', 'gayet',
    'hep', 'hiç', 'her', 'zaman', 'ara', 'sıra',
    
    # Common verbs (stems)
    'ol', 'olmak', 'et', 'etmek', 'yap', 'yapmak', 'ver', 'vermek',
    'al', 'almak', 'gel', 'gelmek', 'git', 'gitmek', 'var', 'yok',
    
    # Others
    'ki', 'gibi', 'kadar', 'diye', 'bile', 'sadece', 'yalnız',
    'hem', 'de', 'da', 'ta', 'te'
}

print(f"Turkish stopwords loaded: {len(TURKISH_STOPWORDS)} words")
print(f"Sample stopwords: {list(TURKISH_STOPWORDS)[:10]}")

In [None]:
# Initialize advanced Turkish NLP libraries
def initialize_nlp_libraries():
    """Initialize and download required NLP resources"""
    
    # Initialize NLTK
    if nltk:
        try:
            # Download required NLTK data
            nltk.download('punkt', quiet=True)
            nltk.download('stopwords', quiet=True)
            print("✓ NLTK resources downloaded")
        except Exception as e:
            print(f"Warning: Could not download NLTK data - {e}")
    
    # Initialize Zeyrek (Turkish morphological analyzer)
    if zeyrek:
        try:
            global turkish_analyzer
            turkish_analyzer = zeyrek.MorphAnalyzer()
            print("✓ Zeyrek Turkish analyzer initialized")
        except Exception as e:
            print(f"Warning: Could not initialize Zeyrek - {e}")
            turkish_analyzer = None
    else:
        turkish_analyzer = None
    
    # Initialize Turkish Stemmer
    if TurkishStemmer:
        try:
            global turkish_stemmer
            turkish_stemmer = TurkishStemmer()
            print("✓ Turkish Stemmer initialized")
        except Exception as e:
            print(f"Warning: Could not initialize Turkish Stemmer - {e}")
            turkish_stemmer = None
    else:
        turkish_stemmer = None
    
    # Initialize spaCy Turkish model (if available)
    if spacy:
        try:
            # Try to load Turkish model (needs to be installed separately)
            global nlp_turkish
            nlp_turkish = spacy.load("tr_core_news_sm")
            print("✓ spaCy Turkish model loaded")
        except OSError:
            print("Warning: spaCy Turkish model not found")
            print("Install with: python -m spacy download tr_core_news_sm")
            nlp_turkish = None
        except Exception as e:
            print(f"Warning: Could not initialize spaCy - {e}")
            nlp_turkish = None
    else:
        nlp_turkish = None
    
    return {
        'nltk_available': nltk is not None,
        'zeyrek_available': turkish_analyzer is not None,
        'stemmer_available': turkish_stemmer is not None,
        'spacy_available': nlp_turkish is not None
    }

# Initialize libraries
print("Initializing advanced Turkish NLP libraries...")
library_status = initialize_nlp_libraries()
print(f"\nLibrary status: {library_status}")

In [None]:
def normalize_turkish_text(text: str) -> str:
    """
    Normalize Turkish text by handling Unicode variations and character issues.
    
    Args:
        text (str): Input Turkish text
        
    Returns:
        str: Normalized text
    """
    if pd.isna(text) or not isinstance(text, str):
        return ""
    
    # Normalize Unicode (important for Turkish characters)
    import unicodedata
    text = unicodedata.normalize('NFC', text)
    
    # Handle common Turkish character variations
    replacements = {
        'İ': 'i',  # Turkish capital İ to lowercase i
        'I': 'ı',  # English I to Turkish ı
        # Handle potential encoding issues
        'Ã§': 'ç', 'Ã¶': 'ö', 'Ã¼': 'ü', 'Ä±': 'ı', 'Ä°': 'i',
        'Å': 'ş', 'Äž': 'ğ',
    }
    
    for old, new in replacements.items():
        text = text.replace(old, new)
    
    # Convert to lowercase (Turkish-aware)
    text = text.lower()
    
    return text

def clean_turkish_text(text: str) -> str:
    """
    Clean Turkish text by removing unwanted characters while preserving Turkish letters.
    
    Args:
        text (str): Input text
        
    Returns:
        str: Cleaned text
    """
    if pd.isna(text) or not isinstance(text, str):
        return ""
    
    # Remove URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    text = re.sub(r'www\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove phone numbers (Turkish format)
    text = re.sub(r'\b0?[5-9][0-9]{2}[\s-]?[0-9]{3}[\s-]?[0-9]{2}[\s-]?[0-9]{2}\b', '', text)
    text = re.sub(r'\b\+90[\s-]?[0-9]{3}[\s-]?[0-9]{3}[\s-]?[0-9]{2}[\s-]?[0-9]{2}\b', '', text)
    
    # Remove numbers but keep those that might be part of words
    text = re.sub(r'\b\d+\.\d+\b', '', text)  # Remove decimal numbers
    text = re.sub(r'\b\d{4,}\b', '', text)     # Remove long numbers (IDs, prices, etc.)
    
    # Remove emojis and special Unicode characters (but keep Turkish chars)
    # Keep Turkish characters: a-zA-ZçğıöşüÇĞİÖŞÜ
    text = re.sub(r'[^\w\sçğıöşüÇĞİÖŞÜ.,!?;:()\[\]"\'-]', ' ', text)
    
    # Remove excessive punctuation
    text = re.sub(r'[.,!?;:]{2,}', '.', text)
    text = re.sub(r'["\'-]{2,}', '', text)
    
    # Remove bullet points and list markers
    text = re.sub(r'[•◦▪▫–—]', '', text)
    
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

# Test the functions
test_text = "Çok güzel bir ürün! Kesinlikle tavsiye ediyorum. 👍 www.example.com 0532-123-45-67"
print("Original:", test_text)
print("Normalized:", normalize_turkish_text(test_text))
print("Cleaned:", clean_turkish_text(normalize_turkish_text(test_text)))

In [None]:
def remove_punctuation_keep_meaning(text: str) -> str:
    """
    Remove punctuation while trying to preserve sentence boundaries and meaning.
    
    Args:
        text (str): Input text
        
    Returns:
        str: Text with punctuation removed
    """
    if pd.isna(text) or not isinstance(text, str):
        return ""
    
    # Replace sentence-ending punctuation with special markers first
    text = re.sub(r'[.!?]+\s+', ' SENTENCE_END ', text)
    text = re.sub(r'[.!?]+$', ' SENTENCE_END', text)
    
    # Remove remaining punctuation (but keep Turkish chars and whitespace)
    text = re.sub(r'[^\w\sçğıöşüÇĞİÖŞÜ]', ' ', text)
    
    # Clean up whitespace
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

def tokenize_turkish(text: str) -> List[str]:
    """
    Tokenize Turkish text into words.
    
    Args:
        text (str): Input text
        
    Returns:
        List[str]: List of tokens
    """
    if pd.isna(text) or not isinstance(text, str):
        return []
    
    # Split on whitespace and filter out empty strings
    tokens = [token.strip() for token in text.split() if token.strip()]
    
    # Filter out very short tokens (likely not meaningful)
    tokens = [token for token in tokens if len(token) > 1]
    
    return tokens

def remove_stopwords(tokens: List[str], custom_stopwords: Set[str] = None) -> List[str]:
    """
    Remove Turkish stopwords from token list.
    
    Args:
        tokens (List[str]): List of tokens
        custom_stopwords (Set[str]): Additional stopwords to remove
        
    Returns:
        List[str]: Filtered tokens
    """
    if not tokens:
        return []
    
    stopwords = TURKISH_STOPWORDS.copy()
    if custom_stopwords:
        stopwords.update(custom_stopwords)
    
    # Remove stopwords (case-insensitive)
    filtered_tokens = [token for token in tokens 
                      if token.lower() not in stopwords]
    
    return filtered_tokens

# Test tokenization and stopword removal
test_text = "çok güzel bir ürün kesinlikle tavsiye ediyorum"
print("Test text:", test_text)
tokens = tokenize_turkish(test_text)
print("Tokens:", tokens)
filtered = remove_stopwords(tokens)
print("After stopword removal:", filtered)

## 4. Complete Preprocessing Pipeline

Now let's create a complete preprocessing pipeline that combines all the steps.

In [None]:
def preprocess_turkish_text(text: str, 
                           remove_stopwords_flag: bool = True,
                           min_token_length: int = 2,
                           custom_stopwords: Set[str] = None) -> Dict[str, any]:
    """
    Complete preprocessing pipeline for Turkish text.
    
    Args:
        text (str): Input text
        remove_stopwords_flag (bool): Whether to remove stopwords
        min_token_length (int): Minimum token length to keep
        custom_stopwords (Set[str]): Additional stopwords
        
    Returns:
        Dict containing original, cleaned text, tokens, and statistics
    """
    
    result = {
        'original_text': text,
        'original_length': len(text) if text else 0,
        'cleaned_text': '',
        'tokens': [],
        'filtered_tokens': [],
        'token_count': 0,
        'filtered_token_count': 0,
        'avg_token_length': 0,
        'turkish_char_count': 0
    }
    
    if pd.isna(text) or not isinstance(text, str) or not text.strip():
        return result
    
    # Step 1: Normalize text
    normalized = normalize_turkish_text(text)
    
    # Step 2: Clean text
    cleaned = clean_turkish_text(normalized)
    
    # Step 3: Remove punctuation
    no_punct = remove_punctuation_keep_meaning(cleaned)
    
    result['cleaned_text'] = no_punct
    
    # Step 4: Tokenize
    tokens = tokenize_turkish(no_punct)
    
    # Filter by minimum length
    tokens = [token for token in tokens if len(token) >= min_token_length]
    
    result['tokens'] = tokens
    result['token_count'] = len(tokens)
    
    # Step 5: Remove stopwords if requested
    if remove_stopwords_flag and tokens:
        filtered_tokens = remove_stopwords(tokens, custom_stopwords)
        result['filtered_tokens'] = filtered_tokens
        result['filtered_token_count'] = len(filtered_tokens)
    else:
        result['filtered_tokens'] = tokens
        result['filtered_token_count'] = len(tokens)
    
    # Calculate statistics
    if result['filtered_tokens']:
        result['avg_token_length'] = sum(len(token) for token in result['filtered_tokens']) / len(result['filtered_tokens'])
    
    # Count Turkish characters
    turkish_chars = {'ç', 'ğ', 'ı', 'ö', 'ş', 'ü'}
    result['turkish_char_count'] = sum(1 for char in no_punct.lower() if char in turkish_chars)
    
    return result

# Test the complete pipeline
test_texts = [
    "Çok güzel bir ürün! Kesinlikle tavsiye ediyorum. 👍",
    "Setur'dan rezervasyon yapmayın, yaptırmayın. Şu an yoldayım, mağdur ettiler.",
    "60.000 TL ödeyip bu muameleyi görmek kabul edilemez! www.example.com"
]

print("=== PREPROCESSING PIPELINE TEST ===")
for i, text in enumerate(test_texts, 1):
    print(f"\nTest {i}:")
    print(f"Original: {text}")
    
    result = preprocess_turkish_text(text)
    print(f"Cleaned: {result['cleaned_text']}")
    print(f"Tokens: {result['tokens']}")
    print(f"Filtered: {result['filtered_tokens']}")
    print(f"Stats: {result['token_count']} → {result['filtered_token_count']} tokens")

## 5. Apply Preprocessing to Dataset

Now let's apply our preprocessing pipeline to the actual Setur complaints dataset.

In [None]:
# Create a function to process the entire dataset
def process_complaint_dataset(df: pd.DataFrame, text_columns: List[str]) -> pd.DataFrame:
    """
    Process all text columns in the complaint dataset.
    
    Args:
        df (pd.DataFrame): Input dataframe
        text_columns (List[str]): List of text column names to process
        
    Returns:
        pd.DataFrame: Dataframe with processed text columns
    """
    
    processed_df = df.copy()
    
    for col in text_columns:
        if col not in df.columns:
            print(f"Warning: Column '{col}' not found in dataset")
            continue
            
        print(f"\nProcessing column: {col}")
        
        # Create new column names
        cleaned_col = f"{col}_cleaned"
        tokens_col = f"{col}_tokens"
        filtered_col = f"{col}_filtered"
        
        # Initialize new columns
        processed_df[cleaned_col] = ''
        processed_df[tokens_col] = None
        processed_df[filtered_col] = None
        
        # Process each text entry
        non_empty_count = 0
        total_original_tokens = 0
        total_filtered_tokens = 0
        
        for idx, text in df[col].items():
            if pd.notna(text) and isinstance(text, str) and text.strip():
                result = preprocess_turkish_text(text)
                
                processed_df.at[idx, cleaned_col] = result['cleaned_text']
                processed_df.at[idx, tokens_col] = result['tokens']
                processed_df.at[idx, filtered_col] = result['filtered_tokens']
                
                non_empty_count += 1
                total_original_tokens += result['token_count']
                total_filtered_tokens += result['filtered_token_count']
        
        # Print statistics
        print(f"  Processed {non_empty_count} non-empty texts")
        if non_empty_count > 0:
            print(f"  Average tokens per text: {total_original_tokens/non_empty_count:.1f} → {total_filtered_tokens/non_empty_count:.1f}")
            reduction_pct = (1 - total_filtered_tokens/total_original_tokens) * 100 if total_original_tokens > 0 else 0
            print(f"  Token reduction: {reduction_pct:.1f}%")
    
    return processed_df

# Apply preprocessing to our dataset
text_columns_to_process = ['title', 'full_complaint', 'company_response']

print("Starting dataset preprocessing...")
processed_df = process_complaint_dataset(df, text_columns_to_process)

print(f"\nProcessing complete! Dataset shape: {processed_df.shape}")
print(f"New columns added: {[col for col in processed_df.columns if col not in df.columns]}")

In [None]:
# Display sample processed results
print("=== SAMPLE PROCESSED RESULTS ===")

# Show examples for each text field
for col in ['title', 'full_complaint', 'company_response']:
    if col in df.columns:
        # Find a non-empty example
        sample_idx = df[col].dropna().index[0] if not df[col].dropna().empty else None
        
        if sample_idx is not None:
            print(f"\n--- {col.upper()} EXAMPLE ---")
            print(f"Original: {df.loc[sample_idx, col][:150]}...")
            print(f"Cleaned:  {processed_df.loc[sample_idx, f'{col}_cleaned'][:150]}...")
            print(f"Tokens:   {processed_df.loc[sample_idx, f'{col}_tokens'][:10]}...")
            print(f"Filtered: {processed_df.loc[sample_idx, f'{col}_filtered'][:10]}...")

# Create an enhanced function to process the entire dataset with advanced libraries
def process_complaint_dataset_advanced(df: pd.DataFrame, text_columns: List[str], 
                                     processing_method: str = 'auto') -> pd.DataFrame:
    """
    Process all text columns in the complaint dataset using advanced Turkish NLP libraries.
    
    Args:
        df (pd.DataFrame): Input dataframe
        text_columns (List[str]): List of text column names to process
        processing_method (str): 'auto', 'basic', 'zeyrek', 'stemmer', 'spacy', or 'all'
        
    Returns:
        pd.DataFrame: Dataframe with processed text columns
    """
    
    processed_df = df.copy()
    
    # Determine the best available method
    if processing_method == 'auto':
        if turkish_analyzer:
            processing_method = 'zeyrek'
            print("Using Zeyrek lemmatization (best available)")
        elif turkish_stemmer:
            processing_method = 'stemmer'
            print("Using Turkish Stemmer (Zeyrek not available)")
        elif nlp_turkish:
            processing_method = 'spacy'
            print("Using spaCy (Zeyrek and Stemmer not available)")
        else:
            processing_method = 'basic'
            print("Using basic preprocessing (advanced libraries not available)")
    
    for col in text_columns:
        if col not in df.columns:
            print(f"Warning: Column '{col}' not found in dataset")
            continue
            
        print(f"\nProcessing column: {col} with method: {processing_method}")
        
        # Create new column names
        cleaned_col = f"{col}_cleaned"
        tokens_col = f"{col}_tokens"
        filtered_col = f"{col}_filtered"
        
        # Advanced columns
        if processing_method in ['zeyrek', 'all']:
            lemmas_col = f"{col}_lemmas"
            processed_df[lemmas_col] = None
        
        if processing_method in ['stemmer', 'all']:
            stems_col = f"{col}_stems"
            processed_df[stems_col] = None
        
        if processing_method in ['spacy', 'all']:
            entities_col = f"{col}_entities"
            pos_col = f"{col}_pos_tags"
            processed_df[entities_col] = None
            processed_df[pos_col] = None
        
        # Initialize new columns
        processed_df[cleaned_col] = ''
        processed_df[tokens_col] = None
        processed_df[filtered_col] = None
        
        # Process each text entry
        non_empty_count = 0
        total_original_tokens = 0
        total_filtered_tokens = 0
        
        for idx, text in df[col].items():
            if pd.notna(text) and isinstance(text, str) and text.strip():
                
                if processing_method == 'basic':
                    result = preprocess_turkish_text(text)
                    processed_df.at[idx, cleaned_col] = result['cleaned_text']
                    processed_df.at[idx, tokens_col] = result['tokens']
                    processed_df.at[idx, filtered_col] = result['filtered_tokens']
                    total_original_tokens += result['token_count']
                    total_filtered_tokens += result['filtered_token_count']
                
                else:
                    # Use advanced processing
                    result = advanced_turkish_preprocess(text, processing_method)
                    processed_df.at[idx, cleaned_col] = result['cleaned']
                    processed_df.at[idx, tokens_col] = result['tokens']
                    
                    # Choose the best processed tokens based on method
                    if processing_method == 'zeyrek' and result['lemmas']:
                        best_tokens = remove_stopwords(result['lemmas'])
                        processed_df.at[idx, lemmas_col] = result['lemmas']
                    elif processing_method == 'stemmer' and result['stems']:
                        best_tokens = remove_stopwords(result['stems'])
                        processed_df.at[idx, stems_col] = result['stems']
                    elif processing_method == 'spacy' and result['spacy_analysis'].get('lemmas'):
                        best_tokens = remove_stopwords(result['spacy_analysis']['lemmas'])
                        processed_df.at[idx, entities_col] = result['spacy_analysis'].get('entities', [])
                        processed_df.at[idx, pos_col] = result['spacy_analysis'].get('pos_tags', [])
                    else:
                        best_tokens = remove_stopwords(result['tokens'])
                    
                    processed_df.at[idx, filtered_col] = best_tokens
                    total_original_tokens += len(result['tokens'])
                    total_filtered_tokens += len(best_tokens)
                
                non_empty_count += 1
        
        # Print statistics
        print(f"  Processed {non_empty_count} non-empty texts")
        if non_empty_count > 0:
            print(f"  Average tokens per text: {total_original_tokens/non_empty_count:.1f} → {total_filtered_tokens/non_empty_count:.1f}")
            reduction_pct = (1 - total_filtered_tokens/total_original_tokens) * 100 if total_original_tokens > 0 else 0
            print(f"  Token reduction: {reduction_pct:.1f}%")
    
    return processed_df

# Apply advanced preprocessing to our dataset
text_columns_to_process = ['title', 'full_complaint', 'company_response']

print("Starting advanced dataset preprocessing...")
print(f"Available libraries: {library_status}")

# Let user choose processing method or use auto
processing_method = 'auto'  # Change this to 'zeyrek', 'stemmer', 'spacy', or 'basic' if desired

processed_df = process_complaint_dataset_advanced(df, text_columns_to_process, processing_method)

print(f"\nAdvanced processing complete! Dataset shape: {processed_df.shape}")
print(f"New columns added: {[col for col in processed_df.columns if col not in df.columns]}")

## 6. Text Analysis and Statistics

Let's analyze the preprocessed text to understand the characteristics of our cleaned dataset.

In [None]:
# Calculate comprehensive text statistics
def calculate_text_statistics(df: pd.DataFrame, text_columns: List[str]) -> Dict:
    """
    Calculate comprehensive statistics for processed text columns.
    
    Args:
        df (pd.DataFrame): Processed dataframe
        text_columns (List[str]): Original text column names
        
    Returns:
        Dict: Statistics for each column
    """
    
    stats = {}
    
    for col in text_columns:
        if col not in df.columns:
            continue
            
        col_stats = {
            'original': {
                'total_texts': df[col].notna().sum(),
                'avg_length': df[col].dropna().str.len().mean() if not df[col].dropna().empty else 0,
                'max_length': df[col].dropna().str.len().max() if not df[col].dropna().empty else 0,
                'min_length': df[col].dropna().str.len().min() if not df[col].dropna().empty else 0
            },
            'processed': {}
        }
        
        # Calculate stats for processed versions
        filtered_col = f"{col}_filtered"
        if filtered_col in df.columns:
            # Get all filtered tokens
            all_filtered_tokens = []
            for tokens in df[filtered_col].dropna():
                if isinstance(tokens, list):
                    all_filtered_tokens.extend(tokens)
            
            col_stats['processed'] = {
                'total_tokens': len(all_filtered_tokens),
                'unique_tokens': len(set(all_filtered_tokens)),
                'avg_tokens_per_text': len(all_filtered_tokens) / col_stats['original']['total_texts'] if col_stats['original']['total_texts'] > 0 else 0,
                'vocabulary_richness': len(set(all_filtered_tokens)) / len(all_filtered_tokens) if all_filtered_tokens else 0
            }
            
            # Most common tokens
            token_counts = Counter(all_filtered_tokens)
            col_stats['processed']['most_common_tokens'] = token_counts.most_common(20)
        
        stats[col] = col_stats
    
    return stats

# Calculate and display statistics
text_stats = calculate_text_statistics(processed_df, text_columns_to_process)

print("=== TEXT PROCESSING STATISTICS ===")
for col, stats in text_stats.items():
    print(f"\n{col.upper()}:")
    
    orig = stats['original']
    print(f"  Original texts: {orig['total_texts']}")
    print(f"  Avg length: {orig['avg_length']:.1f} chars (range: {orig['min_length']}-{orig['max_length']})")
    
    if 'processed' in stats and stats['processed']:
        proc = stats['processed']
        print(f"  Total tokens: {proc['total_tokens']:,}")
        print(f"  Unique tokens: {proc['unique_tokens']:,}")
        print(f"  Avg tokens/text: {proc['avg_tokens_per_text']:.1f}")
        print(f"  Vocabulary richness: {proc['vocabulary_richness']:.3f}")
        
        if proc['most_common_tokens']:
            print(f"  Top 10 tokens: {[token for token, count in proc['most_common_tokens'][:10]]}")

In [None]:
# Create visualizations of the text statistics
plt.style.use('default')
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Turkish Text Preprocessing Analysis - Setur Complaints', fontsize=16, fontweight='bold')

# 1. Text length distribution (before preprocessing)
ax1 = axes[0, 0]
lengths = []
for col in ['title', 'full_complaint', 'company_response']:
    if col in df.columns:
        col_lengths = df[col].dropna().str.len()
        if not col_lengths.empty:
            lengths.extend(col_lengths.tolist())

if lengths:
    ax1.hist(lengths, bins=50, alpha=0.7, edgecolor='black')
    ax1.set_title('Original Text Length Distribution')
    ax1.set_xlabel('Characters')
    ax1.set_ylabel('Frequency')
    ax1.axvline(np.mean(lengths), color='red', linestyle='--', label=f'Mean: {np.mean(lengths):.0f}')
    ax1.legend()

# 2. Token count distribution (after preprocessing)
ax2 = axes[0, 1]
token_counts = []
for col in ['title_filtered', 'full_complaint_filtered', 'company_response_filtered']:
    if col in processed_df.columns:
        for tokens in processed_df[col].dropna():
            if isinstance(tokens, list):
                token_counts.append(len(tokens))

if token_counts:
    ax2.hist(token_counts, bins=30, alpha=0.7, color='green', edgecolor='black')
    ax2.set_title('Processed Token Count Distribution')
    ax2.set_xlabel('Number of Tokens')
    ax2.set_ylabel('Frequency')
    ax2.axvline(np.mean(token_counts), color='red', linestyle='--', label=f'Mean: {np.mean(token_counts):.1f}')
    ax2.legend()

# 3. Most common words across all complaints
ax3 = axes[1, 0]
all_tokens = []
for col in ['title_filtered', 'full_complaint_filtered', 'company_response_filtered']:
    if col in processed_df.columns:
        for tokens in processed_df[col].dropna():
            if isinstance(tokens, list):
                all_tokens.extend(tokens)

if all_tokens:
    word_freq = Counter(all_tokens)
    top_words = word_freq.most_common(15)
    words, counts = zip(*top_words)
    
    ax3.barh(range(len(words)), counts, color='skyblue', edgecolor='black')
    ax3.set_yticks(range(len(words)))
    ax3.set_yticklabels(words)
    ax3.set_title('Top 15 Most Common Words (After Preprocessing)')
    ax3.set_xlabel('Frequency')
    ax3.invert_yaxis()

# 4. Turkish character usage
ax4 = axes[1, 1]
turkish_chars = {'ç': 0, 'ğ': 0, 'ı': 0, 'ö': 0, 'ş': 0, 'ü': 0}
all_text = ' '.join(processed_df['full_complaint_cleaned'].dropna().astype(str))

for char in turkish_chars:
    turkish_chars[char] = all_text.count(char)

if any(turkish_chars.values()):
    chars, counts = zip(*turkish_chars.items())
    ax4.bar(chars, counts, color='orange', edgecolor='black')
    ax4.set_title('Turkish Character Frequency (After Cleaning)')
    ax4.set_xlabel('Turkish Characters')
    ax4.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# Print summary statistics
print(f"\n=== OVERALL PREPROCESSING SUMMARY ===")
print(f"Total unique words in vocabulary: {len(set(all_tokens)):,}")
print(f"Total word instances: {len(all_tokens):,}")
print(f"Average text length (tokens): {np.mean(token_counts):.1f}")
print(f"Vocabulary richness: {len(set(all_tokens))/len(all_tokens):.3f}")

## 7. Advanced Turkish Text Processing 


For more sophisticated analysis, we can add lemmatization and advanced Turkish NLP features.

In [None]:
# Turkish text normalization for common internet slang and abbreviations
TURKISH_SLANG_DICT = {
    # Common abbreviations and slang
    'gzl': 'güzel',
    'bšşş': 'brüşüş',  # common typo for 'bişi' (something)
    'nslsn': 'nasılsın',
    'tmm': 'tamam',
    'sğl': 'sağol',
    'tşk': 'teşekkür',
    'krdsm': 'kardeşim',
    'mslm': 'müslüman',
    
    # Missing diacritics (common in informal text)
    'güzel': 'güzel',  # already correct
    'tesekkur': 'teşekkür',
    'cok': 'çok',
    'guzel': 'güzel',
    'musteri': 'müşteri',
    'hizmet': 'hizmet',
    'kalite': 'kalite',
    'hosgeldin': 'hoşgeldin',
    
    # Common misspellings
    'deil': 'değil',
    'bisi': 'birşey',
    'nasi': 'nasıl',
    'olcak': 'olacak',
    'gidcek': 'gidecek'
}

def normalize_slang_and_typos(text: str) -> str:
    """
    Normalize Turkish slang and common typos.
    
    Args:
        text (str): Input text
        
    Returns:
        str: Normalized text
    """
    if pd.isna(text) or not isinstance(text, str):
        return ""
    
    words = text.split()
    normalized_words = []
    
    for word in words:
        # Check if word (lowercase) is in slang dictionary
        lower_word = word.lower()
        if lower_word in TURKISH_SLANG_DICT:
            normalized_words.append(TURKISH_SLANG_DICT[lower_word])
        else:
            normalized_words.append(word)
    
    return ' '.join(normalized_words)

# Simple Turkish stemming (basic suffix removal)
def simple_turkish_stem(word: str) -> str:
    """
    Simple Turkish stemming by removing common suffixes.
    This is a basic implementation - for production use, consider using 
    specialized Turkish NLP libraries like Zeyrek or TurkishStemmer.
    
    Args:
        word (str): Input word
        
    Returns:
        str: Stemmed word
    """
    if len(word) < 4:  # Don't stem very short words
        return word
    
    # Common Turkish suffixes (simplified)
    suffixes = [
        # Plural
        'ler', 'lar',
        # Possessive
        'nin', 'nın', 'nun', 'nün',
        # Accusative
        'yi', 'yı', 'yu', 'yü',
        # Locative
        'de', 'da', 'te', 'ta',
        # Ablative
        'den', 'dan', 'ten', 'tan',
        # Past tense
        'mış', 'miş', 'muş', 'müş',
        # Progressive
        'yor',
        # Adjective suffixes
        'li', 'lı', 'lu', 'lü',
        'siz', 'sız', 'suz', 'süz'
    ]
    
    word_lower = word.lower()
    
    # Try to remove suffixes (longest first)
    for suffix in sorted(suffixes, key=len, reverse=True):
        if word_lower.endswith(suffix) and len(word_lower) > len(suffix) + 2:
            return word_lower[:-len(suffix)]
    
    return word_lower

# Test advanced normalization
test_texts_advanced = [
    "cok gzl bir urun tsk ederim",
    "müşteriler memnun kalmış",
    "otellerde kalitesiz hizmet"
]

print("=== ADVANCED TURKISH PROCESSING TEST ===")
for text in test_texts_advanced:
    print(f"\nOriginal: {text}")
    
    # Apply slang normalization
    normalized = normalize_slang_and_typos(text)
    print(f"Slang normalized: {normalized}")
    
    # Apply basic stemming
    words = normalized.split()
    stemmed_words = [simple_turkish_stem(word) for word in words]
    print(f"Stemmed: {' '.join(stemmed_words)}")

In [None]:
# Comprehensive comparison of preprocessing methods
def compare_preprocessing_methods(text: str) -> pd.DataFrame:
    """
    Compare different Turkish preprocessing methods side by side.
    
    Args:
        text (str): Input text
        
    Returns:
        pd.DataFrame: Comparison results
    """
    
    methods_results = []
    
    # 1. Basic preprocessing (our original method)
    basic_result = preprocess_turkish_text(text)
    methods_results.append({
        'Method': 'Basic (Custom)',
        'Tokens': len(basic_result['filtered_tokens']),
        'Sample_Words': ' '.join(basic_result['filtered_tokens'][:5]),
        'Available': True
    })
    
    # 2. NLTK preprocessing
    if nltk:
        nltk_tokens = nltk_tokenize_turkish(clean_turkish_text(normalize_turkish_text(text)))
        nltk_stops = get_nltk_turkish_stopwords()
        if nltk_stops:
            nltk_filtered = [t for t in nltk_tokens if t.lower() not in nltk_stops]
        else:
            nltk_filtered = remove_stopwords(nltk_tokens)
        
        methods_results.append({
            'Method': 'NLTK',
            'Tokens': len(nltk_filtered),
            'Sample_Words': ' '.join(nltk_filtered[:5]),
            'Available': True
        })
    else:
        methods_results.append({
            'Method': 'NLTK',
            'Tokens': 0,
            'Sample_Words': 'Not available',
            'Available': False
        })
    
    # 3. Zeyrek lemmatization
    if turkish_analyzer:
        zeyrek_result = advanced_turkish_preprocess(text, 'zeyrek')
        lemma_filtered = remove_stopwords(zeyrek_result['lemmas'])
        
        methods_results.append({
            'Method': 'Zeyrek (Lemmas)',
            'Tokens': len(lemma_filtered),
            'Sample_Words': ' '.join(lemma_filtered[:5]),
            'Available': True
        })
    else:
        methods_results.append({
            'Method': 'Zeyrek (Lemmas)',
            'Tokens': 0,
            'Sample_Words': 'Not available',
            'Available': False
        })
    
    # 4. Turkish Stemmer
    if turkish_stemmer:
        stemmer_result = advanced_turkish_preprocess(text, 'stemmer')
        stems_filtered = remove_stopwords(stemmer_result['stems'])
        
        methods_results.append({
            'Method': 'Turkish Stemmer',
            'Tokens': len(stems_filtered),
            'Sample_Words': ' '.join(stems_filtered[:5]),
            'Available': True
        })
    else:
        methods_results.append({
            'Method': 'Turkish Stemmer',
            'Tokens': 0,
            'Sample_Words': 'Not available',
            'Available': False
        })
    
    # 5. spaCy processing
    if nlp_turkish:
        spacy_result = advanced_turkish_preprocess(text, 'spacy')
        spacy_lemmas = spacy_result['spacy_analysis'].get('lemmas', [])
        spacy_filtered = remove_stopwords(spacy_lemmas)
        
        methods_results.append({
            'Method': 'spaCy',
            'Tokens': len(spacy_filtered),
            'Sample_Words': ' '.join(spacy_filtered[:5]),
            'Available': True
        })
    else:
        methods_results.append({
            'Method': 'spaCy',
            'Tokens': 0,
            'Sample_Words': 'Not available',
            'Available': False
        })
    
    return pd.DataFrame(methods_results)

# Test comparison on sample complaints
print("\n=== PREPROCESSING METHODS COMPARISON ===")
sample_complaint = "Müşteri hizmetlerinden çok memnun kalmadık, otellerdeki hizmet kalitesi beklentilerimizin altındaydı ve rezervasyonumuzla ilgili sorunlar yaşadık."

print(f"Sample text: {sample_complaint}")
print("\nComparison of different preprocessing methods:")
comparison_df = compare_preprocessing_methods(sample_complaint)
print(comparison_df.to_string(index=False))

# Performance benchmarking
def benchmark_preprocessing_methods(texts: List[str], num_runs: int = 3) -> pd.DataFrame:
    """
    Benchmark the performance of different preprocessing methods.
    
    Args:
        texts (List[str]): List of texts to process
        num_runs (int): Number of runs for averaging
        
    Returns:
        pd.DataFrame: Benchmark results
    """
    import time
    
    results = []
    
    for method_name, method_func in [
        ('Basic', lambda t: preprocess_turkish_text(t)),
        ('NLTK', lambda t: nltk_tokenize_turkish(clean_turkish_text(normalize_turkish_text(t))) if nltk else []),
        ('Zeyrek', lambda t: advanced_turkish_preprocess(t, 'zeyrek') if turkish_analyzer else {}),
        ('Stemmer', lambda t: advanced_turkish_preprocess(t, 'stemmer') if turkish_stemmer else {}),
        ('spaCy', lambda t: advanced_turkish_preprocess(t, 'spacy') if nlp_turkish else {})
    ]:
        
        times = []
        for _ in range(num_runs):
            start_time = time.time()
            for text in texts:
                try:
                    method_func(text)
                except:
                    pass  # Skip errors for unavailable methods
            end_time = time.time()
            times.append(end_time - start_time)
        
        avg_time = np.mean(times)
        results.append({
            'Method': method_name,
            'Avg_Time_Seconds': round(avg_time, 4),
            'Texts_Per_Second': round(len(texts) / avg_time, 2) if avg_time > 0 else 0
        })
    
    return pd.DataFrame(results)

# Run benchmark on a sample of complaints
if 'df' in locals() and not df.empty:
    sample_texts = df['full_complaint'].dropna().head(10).tolist()
    if sample_texts:
        print("\n=== PERFORMANCE BENCHMARK ===")
        print(f"Benchmarking on {len(sample_texts)} complaint texts...")
        benchmark_results = benchmark_preprocessing_methods(sample_texts)
        print(benchmark_results.to_string(index=False))
else:
    print("\nDataset not loaded yet - benchmark will run after data loading.")

## 8. Save Processed Data

Finally, let's save our preprocessed data for further analysis.

# Advanced Turkish Text Processing with Professional Libraries

# 1. NLTK-based tokenization and stopword removal
def nltk_tokenize_turkish(text: str) -> List[str]:
    """
    Tokenize Turkish text using NLTK.
    
    Args:
        text (str): Input text
        
    Returns:
        List[str]: List of tokens
    """
    if not nltk or pd.isna(text) or not isinstance(text, str):
        return []
    
    try:
        # Use NLTK's word tokenizer
        tokens = word_tokenize(text, language='turkish')
        # Filter out punctuation and short tokens
        tokens = [token for token in tokens if token.isalnum() and len(token) > 1]
        return tokens
    except:
        # Fallback to simple tokenization
        return text.split()

def get_nltk_turkish_stopwords() -> Set[str]:
    """
    Get Turkish stopwords from NLTK if available.
    
    Returns:
        Set[str]: Set of Turkish stopwords
    """
    if not nltk:
        return set()
    
    try:
        # Get Turkish stopwords from NLTK
        turkish_stops = set(stopwords.words('turkish'))
        print(f"NLTK Turkish stopwords: {len(turkish_stops)} words")
        return turkish_stops
    except:
        print("NLTK Turkish stopwords not available")
        return set()

# 2. Zeyrek-based morphological analysis
def zeyrek_lemmatize(word: str) -> str:
    """
    Lemmatize Turkish word using Zeyrek morphological analyzer.
    
    Args:
        word (str): Input word
        
    Returns:
        str: Lemmatized word
    """
    if not turkish_analyzer or not word:
        return word
    
    try:
        # Analyze the word morphologically
        analyses = turkish_analyzer.lemmatize(word)
        if analyses:
            # Return the first (most likely) lemma
            return analyses[0][1]  # [1] is the lemma, [0] is the analysis
        return word
    except:
        return word

def zeyrek_analyze_morphology(word: str) -> List[str]:
    """
    Get detailed morphological analysis using Zeyrek.
    
    Args:
        word (str): Input word
        
    Returns:
        List[str]: Morphological analyses
    """
    if not turkish_analyzer or not word:
        return []
    
    try:
        analyses = turkish_analyzer.analyze(word)
        return [str(analysis) for analysis in analyses[:3]]  # Top 3 analyses
    except:
        return []

# 3. Professional Turkish Stemmer
def professional_turkish_stem(word: str) -> str:
    """
    Stem Turkish word using professional Turkish stemmer.
    
    Args:
        word (str): Input word
        
    Returns:
        str: Stemmed word
    """
    if not turkish_stemmer or not word:
        return word
    
    try:
        return turkish_stemmer.stem(word)
    except:
        return word

# 4. spaCy-based processing
def spacy_process_turkish(text: str) -> Dict[str, any]:
    """
    Process Turkish text using spaCy for NER, POS tagging, etc.
    
    Args:
        text (str): Input text
        
    Returns:
        Dict: Processing results
    """
    if not nlp_turkish or not text:
        return {'tokens': [], 'entities': [], 'pos_tags': []}
    
    try:
        doc = nlp_turkish(text)
        
        return {
            'tokens': [token.text for token in doc],
            'lemmas': [token.lemma_ for token in doc],
            'pos_tags': [(token.text, token.pos_, token.tag_) for token in doc],
            'entities': [(ent.text, ent.label_, ent.start_char, ent.end_char) for ent in doc.ents],
            'noun_phrases': [chunk.text for chunk in doc.noun_chunks]
        }
    except:
        return {'tokens': [], 'entities': [], 'pos_tags': []}

# Combined advanced preprocessing function
def advanced_turkish_preprocess(text: str, method: str = 'zeyrek') -> Dict[str, any]:
    """
    Advanced Turkish text preprocessing using professional libraries.
    
    Args:
        text (str): Input text
        method (str): Processing method ('zeyrek', 'stemmer', 'spacy', 'all')
        
    Returns:
        Dict: Comprehensive processing results
    """
    if pd.isna(text) or not isinstance(text, str):
        return {'original': text, 'processed': '', 'tokens': [], 'method': method}
    
    # Basic cleaning first
    cleaned = clean_turkish_text(normalize_turkish_text(text))
    
    result = {
        'original': text,
        'cleaned': cleaned,
        'method': method,
        'tokens': [],
        'lemmas': [],
        'stems': [],
        'spacy_analysis': {},
        'morphology': []
    }
    
    # Tokenize using NLTK if available
    if nltk:
        tokens = nltk_tokenize_turkish(cleaned)
    else:
        tokens = tokenize_turkish(cleaned)
    
    result['tokens'] = tokens
    
    # Apply different processing methods
    if method in ['zeyrek', 'all'] and turkish_analyzer:
        # Zeyrek lemmatization
        lemmas = [zeyrek_lemmatize(token) for token in tokens]
        result['lemmas'] = lemmas
        
        # Sample morphological analysis for first few words
        result['morphology'] = [
            (token, zeyrek_analyze_morphology(token)) 
            for token in tokens[:5]  # Analyze first 5 tokens
        ]
    
    if method in ['stemmer', 'all'] and turkish_stemmer:
        # Professional stemming
        stems = [professional_turkish_stem(token) for token in tokens]
        result['stems'] = stems
    
    if method in ['spacy', 'all'] and nlp_turkish:
        # spaCy analysis
        result['spacy_analysis'] = spacy_process_turkish(cleaned)
    
    return result

# Test the advanced processing
test_texts_advanced = [
    "Müşteri hizmetleri kalitesiz, otellerde konaklamak memnuniyet vermiyor.",
    "Rezervasyonumuz iptal edildi, paramızı geri alamadık.",
    "Çok güzel bir tatil geçirdik, kesinlikle tavsiye ederim."
]

print("=== ADVANCED TURKISH NLP PROCESSING TEST ===")
for i, text in enumerate(test_texts_advanced, 1):
    print(f"\n--- Test {i} ---")
    print(f"Original: {text}")
    
    # Test different methods
    for method in ['zeyrek', 'stemmer', 'spacy']:
        if (method == 'zeyrek' and turkish_analyzer) or \
           (method == 'stemmer' and turkish_stemmer) or \
           (method == 'spacy' and nlp_turkish):
            
            result = advanced_turkish_preprocess(text, method)
            print(f"\n{method.upper()} processing:")
            print(f"  Tokens: {result['tokens'][:8]}...")  # Show first 8 tokens
            
            if method == 'zeyrek' and result['lemmas']:
                print(f"  Lemmas: {result['lemmas'][:8]}...")
                if result['morphology']:
                    print(f"  Sample morphology: {result['morphology'][0]}")
            
            elif method == 'stemmer' and result['stems']:
                print(f"  Stems: {result['stems'][:8]}...")
            
            elif method == 'spacy' and result['spacy_analysis']:
                spacy_res = result['spacy_analysis']
                if spacy_res['entities']:
                    print(f"  Entities: {spacy_res['entities']}")
                if spacy_res['pos_tags']:
                    print(f"  POS tags: {spacy_res['pos_tags'][:5]}...")

In [None]:
# Create a summary of the preprocessing results
def create_preprocessing_summary(df: pd.DataFrame) -> Dict:
    """
    Create a summary of preprocessing results.
    
    Args:
        df (pd.DataFrame): Processed dataframe
        
    Returns:
        Dict: Summary statistics
    """
    
    summary = {
        'dataset_info': {
            'total_records': len(df),
            'processing_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')
        },
        'text_fields': {}
    }
    
    # Analyze each text field
    for col in ['title', 'full_complaint', 'company_response']:
        if col in df.columns:
            filtered_col = f"{col}_filtered"
            
            # Count non-empty texts
            non_empty = df[col].notna().sum()
            
            # Get all tokens for this field
            all_tokens = []
            if filtered_col in df.columns:
                for tokens in df[filtered_col].dropna():
                    if isinstance(tokens, list):
                        all_tokens.extend(tokens)
            
            summary['text_fields'][col] = {
                'non_empty_texts': int(non_empty),
                'total_tokens': len(all_tokens),
                'unique_tokens': len(set(all_tokens)),
                'avg_tokens_per_text': len(all_tokens) / non_empty if non_empty > 0 else 0,
                'top_10_words': Counter(all_tokens).most_common(10) if all_tokens else []
            }
    
    return summary

# Create preprocessing summary
preprocessing_summary = create_preprocessing_summary(processed_df)

# Display summary
print("=== PREPROCESSING SUMMARY ===")
print(f"Dataset: {preprocessing_summary['dataset_info']['total_records']} records")
print(f"Processed on: {preprocessing_summary['dataset_info']['processing_date']}")

for field, stats in preprocessing_summary['text_fields'].items():
    print(f"\n{field.upper()}:")
    print(f"  Non-empty texts: {stats['non_empty_texts']}")
    print(f"  Total tokens: {stats['total_tokens']:,}")
    print(f"  Unique tokens: {stats['unique_tokens']:,}")
    print(f"  Avg tokens/text: {stats['avg_tokens_per_text']:.1f}")
    if stats['top_10_words']:
        top_words = [word for word, count in stats['top_10_words']]
        print(f"  Top words: {top_words}")

# Save processed dataset to CSV
output_file = 'setur_complaints_processed.csv'
processed_df.to_csv(output_file, index=False, encoding='utf-8')
print(f"\nProcessed dataset saved to: {output_file}")

# Save preprocessing summary to JSON
summary_file = 'preprocessing_summary.json'
with open(summary_file, 'w', encoding='utf-8') as f:
    json.dump(preprocessing_summary, f, ensure_ascii=False, indent=2)
print(f"Preprocessing summary saved to: {summary_file}")

# Create a clean dataset with only essential columns for analysis
analysis_columns = [
    'id', 'title', 'time', 'supported', 'rating',
    'title_cleaned', 'title_filtered',
    'full_complaint_cleaned', 'full_complaint_filtered',
    'company_response_cleaned', 'company_response_filtered'
]

# Keep only existing columns
analysis_columns = [col for col in analysis_columns if col in processed_df.columns]
analysis_df = processed_df[analysis_columns].copy()

# Save clean dataset for analysis
analysis_file = 'setur_complaints_for_analysis.csv'
analysis_df.to_csv(analysis_file, index=False, encoding='utf-8')
print(f"Clean dataset for analysis saved to: {analysis_file}")

print("\n=== PREPROCESSING COMPLETE ===")
print("Files created:")
print(f"  1. {output_file} - Full processed dataset")
print(f"  2. {analysis_file} - Clean dataset for analysis")
print(f"  3. {summary_file} - Preprocessing summary")

print("\nNext steps:")
print("  - Use the processed tokens for sentiment analysis")
print("  - Apply topic modeling (LDA, NMF) on cleaned text")
print("  - Perform keyword extraction and entity recognition")
print("  - Analyze complaint patterns and trends")

# Create an enhanced summary of the preprocessing results
def create_advanced_preprocessing_summary(df: pd.DataFrame) -> Dict:
    """
    Create a comprehensive summary of advanced preprocessing results.
    
    Args:
        df (pd.DataFrame): Processed dataframe
        
    Returns:
        Dict: Summary statistics
    """
    
    summary = {
        'dataset_info': {
            'total_records': len(df),
            'processing_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'),
            'libraries_used': library_status
        },
        'text_fields': {},
        'advanced_features': {}
    }
    
    # Analyze each text field
    for col in ['title', 'full_complaint', 'company_response']:
        if col in df.columns:
            filtered_col = f"{col}_filtered"
            lemmas_col = f"{col}_lemmas"
            stems_col = f"{col}_stems"
            entities_col = f"{col}_entities"
            
            # Count non-empty texts
            non_empty = df[col].notna().sum()
            
            # Get all tokens for this field
            all_tokens = []
            all_lemmas = []
            all_stems = []
            all_entities = []
            
            if filtered_col in df.columns:
                for tokens in df[filtered_col].dropna():
                    if isinstance(tokens, list):
                        all_tokens.extend(tokens)
            
            if lemmas_col in df.columns:
                for lemmas in df[lemmas_col].dropna():
                    if isinstance(lemmas, list):
                        all_lemmas.extend(lemmas)
            
            if stems_col in df.columns:
                for stems in df[stems_col].dropna():
                    if isinstance(stems, list):
                        all_stems.extend(stems)
            
            if entities_col in df.columns:
                for entities in df[entities_col].dropna():
                    if isinstance(entities, list):
                        all_entities.extend([ent[0] for ent in entities if isinstance(ent, tuple)])  # Extract entity text
            
            field_summary = {
                'non_empty_texts': int(non_empty),
                'total_tokens': len(all_tokens),
                'unique_tokens': len(set(all_tokens)),
                'avg_tokens_per_text': len(all_tokens) / non_empty if non_empty > 0 else 0,
                'top_10_words': Counter(all_tokens).most_common(10) if all_tokens else []
            }
            
            # Add advanced features if available
            if all_lemmas:
                field_summary['total_lemmas'] = len(all_lemmas)
                field_summary['unique_lemmas'] = len(set(all_lemmas))
                field_summary['top_10_lemmas'] = Counter(all_lemmas).most_common(10)
            
            if all_stems:
                field_summary['total_stems'] = len(all_stems)
                field_summary['unique_stems'] = len(set(all_stems))
                field_summary['top_10_stems'] = Counter(all_stems).most_common(10)
            
            if all_entities:
                field_summary['total_entities'] = len(all_entities)
                field_summary['unique_entities'] = len(set(all_entities))
                field_summary['top_entities'] = Counter(all_entities).most_common(5)
            
            summary['text_fields'][col] = field_summary
    
    # Overall vocabulary comparison
    if summary['text_fields']:
        all_field_tokens = []
        all_field_lemmas = []
        all_field_stems = []
        
        for field_data in summary['text_fields'].values():
            if 'top_10_words' in field_data:
                all_field_tokens.extend([word for word, count in field_data['top_10_words']])
            if 'top_10_lemmas' in field_data:
                all_field_lemmas.extend([word for word, count in field_data['top_10_lemmas']])
            if 'top_10_stems' in field_data:
                all_field_stems.extend([word for word, count in field_data['top_10_stems']])
        
        summary['advanced_features'] = {
            'vocabulary_reduction': {
                'tokens_vs_lemmas': len(set(all_field_tokens)) - len(set(all_field_lemmas)) if all_field_lemmas else 0,
                'tokens_vs_stems': len(set(all_field_tokens)) - len(set(all_field_stems)) if all_field_stems else 0
            },
            'processing_efficiency': {
                'lemmatization_available': len(all_field_lemmas) > 0,
                'stemming_available': len(all_field_stems) > 0,
                'entities_extracted': any('total_entities' in field for field in summary['text_fields'].values())
            }
        }
    
    return summary

# Create enhanced preprocessing summary
preprocessing_summary = create_advanced_preprocessing_summary(processed_df)

# Display enhanced summary
print("=== ADVANCED PREPROCESSING SUMMARY ===")
print(f"Dataset: {preprocessing_summary['dataset_info']['total_records']} records")
print(f"Processed on: {preprocessing_summary['dataset_info']['processing_date']}")
print(f"Libraries used: {preprocessing_summary['dataset_info']['libraries_used']}")

for field, stats in preprocessing_summary['text_fields'].items():
    print(f"\n{field.upper()}:")
    print(f"  Non-empty texts: {stats['non_empty_texts']}")
    print(f"  Total tokens: {stats['total_tokens']:,}")
    print(f"  Unique tokens: {stats['unique_tokens']:,}")
    print(f"  Avg tokens/text: {stats['avg_tokens_per_text']:.1f}")
    
    if 'total_lemmas' in stats:
        print(f"  Total lemmas: {stats['total_lemmas']:,}")
        print(f"  Unique lemmas: {stats['unique_lemmas']:,}")
        reduction = stats['unique_tokens'] - stats['unique_lemmas']
        print(f"  Vocabulary reduction (lemmas): {reduction} words ({reduction/stats['unique_tokens']*100:.1f}%)")
    
    if 'total_stems' in stats:
        print(f"  Total stems: {stats['total_stems']:,}")
        print(f"  Unique stems: {stats['unique_stems']:,}")
        reduction = stats['unique_tokens'] - stats['unique_stems']
        print(f"  Vocabulary reduction (stems): {reduction} words ({reduction/stats['unique_tokens']*100:.1f}%)")
    
    if 'total_entities' in stats:
        print(f"  Named entities found: {stats['total_entities']}")
        print(f"  Unique entities: {stats['unique_entities']}")
        if stats['top_entities']:
            print(f"  Top entities: {[ent for ent, count in stats['top_entities']]}")
    
    if stats['top_10_words']:
        top_words = [word for word, count in stats['top_10_words'][:5]]
        print(f"  Top 5 words: {top_words}")

# Advanced features summary
if 'advanced_features' in preprocessing_summary:
    adv_features = preprocessing_summary['advanced_features']
    print(f"\nADVANCED PROCESSING FEATURES:")
    print(f"  Vocabulary reduction efficiency: {adv_features['vocabulary_reduction']}")
    print(f"  Processing capabilities: {adv_features['processing_efficiency']}")

# Save enhanced processed dataset
output_file = 'setur_complaints_advanced_processed.csv'
processed_df.to_csv(output_file, index=False, encoding='utf-8')
print(f"\nAdvanced processed dataset saved to: {output_file}")

# Save enhanced preprocessing summary
summary_file = 'advanced_preprocessing_summary.json'
with open(summary_file, 'w', encoding='utf-8') as f:
    json.dump(preprocessing_summary, f, ensure_ascii=False, indent=2)
print(f"Advanced preprocessing summary saved to: {summary_file}")

# Create analysis-ready dataset with best available features
analysis_columns = ['id', 'title', 'time', 'supported', 'rating']

# Add the best available processed columns
for col in ['title', 'full_complaint', 'company_response']:
    if col in processed_df.columns:
        analysis_columns.extend([
            f"{col}_cleaned",
            f"{col}_filtered"
        ])
        
        # Add advanced features if available
        if f"{col}_lemmas" in processed_df.columns:
            analysis_columns.append(f"{col}_lemmas")
        if f"{col}_stems" in processed_df.columns:
            analysis_columns.append(f"{col}_stems")
        if f"{col}_entities" in processed_df.columns:
            analysis_columns.append(f"{col}_entities")
        if f"{col}_pos_tags" in processed_df.columns:
            analysis_columns.append(f"{col}_pos_tags")

# Keep only existing columns
analysis_columns = [col for col in analysis_columns if col in processed_df.columns]
analysis_df = processed_df[analysis_columns].copy()

# Save enhanced analysis dataset
analysis_file = 'setur_complaints_for_advanced_analysis.csv'
analysis_df.to_csv(analysis_file, index=False, encoding='utf-8')
print(f"Enhanced analysis dataset saved to: {analysis_file}")

print("\n=== ADVANCED PREPROCESSING COMPLETE ===")
print("Files created:")
print(f"  1. {output_file} - Full advanced processed dataset")
print(f"  2. {analysis_file} - Enhanced dataset for analysis")
print(f"  3. {summary_file} - Advanced preprocessing summary")

print("\nAdvanced capabilities now available:")
if library_status['zeyrek_available']:
    print("  ✓ Morphological analysis and lemmatization (Zeyrek)")
if library_status['stemmer_available']:
    print("  ✓ Professional Turkish stemming")
if library_status['spacy_available']:
    print("  ✓ Named Entity Recognition and POS tagging (spaCy)")
if library_status['nltk_available']:
    print("  ✓ Advanced tokenization and linguistic features (NLTK)")

print("\nNext steps for advanced analysis:")
print("  - Use lemmatized text for improved topic modeling")
print("  - Apply sentiment analysis on stemmed/lemmatized tokens")
print("  - Extract insights from named entities and POS patterns")
print("  - Compare performance across different preprocessing methods")
print("  - Implement advanced Turkish-specific ML models")

## 9. Installation Guide and Troubleshooting

To use all the advanced Turkish NLP features, you may need to install additional libraries and language models.

In [None]:
# Installation and Setup Guide for Advanced Turkish NLP

def print_installation_guide():
    """
    Print comprehensive installation instructions for all Turkish NLP libraries.
    """
    
    print("=== ADVANCED TURKISH NLP INSTALLATION GUIDE ===")
    print("\n1. BASIC REQUIREMENTS:")
    print("   pip install nltk zeyrek spacy turkish-stemmer")
    
    print("\n2. NLTK SETUP:")
    print("   After installing NLTK, download Turkish language data:")
    print("   python -c \"import nltk; nltk.download('punkt'); nltk.download('stopwords')\"")
    
    print("\n3. SPACY TURKISH MODEL:")
    print("   Install the Turkish language model for spaCy:")
    print("   python -m spacy download tr_core_news_sm")
    print("   Note: This requires internet connection and ~15MB download")
    
    print("\n4. ZEYREK:")
    print("   Zeyrek should work out of the box after installation")
    print("   It provides morphological analysis for Turkish")
    
    print("\n5. TURKISH STEMMER:")
    print("   Professional Turkish stemming algorithm")
    print("   Works immediately after pip install")
    
    print("\n6. VERIFICATION:")
    print("   Run the library status check in this notebook to verify all installations")
    
    print("\n7. TROUBLESHOOTING:")
    print("   - If spaCy model fails: Check internet connection and try again")
    print("   - If Zeyrek fails: Try pip install --upgrade zeyrek")
    print("   - If NLTK data fails: Try running in admin/sudo mode")
    print("   - For Windows users: Some libraries may require Visual C++ Build Tools")
    
    print("\n8. ALTERNATIVE MINIMAL SETUP:")
    print("   If you have issues with advanced libraries, the notebook will")
    print("   fall back to basic preprocessing which requires no additional setup.")

def check_and_install_requirements():
    """
    Check which libraries are available and provide installation hints.
    """
    
    print("=== LIBRARY STATUS CHECK ===")
    
    # Check each library
    checks = {
        'NLTK': nltk is not None,
        'Zeyrek': zeyrek is not None,
        'spaCy': spacy is not None,
        'Turkish Stemmer': TurkishStemmer is not None
    }
    
    for lib_name, available in checks.items():
        status = "✓ Available" if available else "✗ Not installed"
        print(f"  {lib_name}: {status}")
    
    # Additional spaCy model check
    if spacy:
        try:
            import spacy
            nlp_test = spacy.load("tr_core_news_sm")
            print("  spaCy Turkish Model: ✓ Available")
        except OSError:
            print("  spaCy Turkish Model: ✗ Not installed (run: python -m spacy download tr_core_news_sm)")
    
    # NLTK data check
    if nltk:
        try:
            from nltk.corpus import stopwords
            stopwords.words('turkish')
            print("  NLTK Turkish Data: ✓ Available")
        except LookupError:
            print("  NLTK Turkish Data: ✗ Not downloaded (run: nltk.download('stopwords'))")
    
    print("\nRECOMMENDATIONS:")
    missing_libs = [lib for lib, available in checks.items() if not available]
    
    if not missing_libs:
        print("  🎉 All libraries are available! You can use all advanced features.")
    else:
        print(f"  📦 Missing libraries: {', '.join(missing_libs)}")
        print("  🔧 Run the installation commands above to enable all features")
        print("  📝 The notebook will use basic preprocessing for missing libraries")

# Run the status check
check_and_install_requirements()

print("\n" + "="*60)
print_installation_guide()

# Create a setup script for easy installation
setup_script = '''#!/bin/bash
# Setup script for Advanced Turkish NLP
# Run this script to install all required libraries and data

echo "Installing Turkish NLP libraries..."
pip install nltk zeyrek spacy turkish-stemmer

echo "Downloading NLTK data..."
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"

echo "Installing spaCy Turkish model..."
python -m spacy download tr_core_news_sm

echo "Setup complete! Run the notebook to verify installation."
'''

# Save setup script
with open('setup_turkish_nlp.sh', 'w', encoding='utf-8') as f:
    f.write(setup_script)

print("\n📝 Setup script saved as 'setup_turkish_nlp.sh'")
print("   On Unix/Mac: chmod +x setup_turkish_nlp.sh && ./setup_turkish_nlp.sh")
print("   On Windows: Run each pip/python command individually in cmd")

## 10. Usage Examples and Best Practices

Here are examples of how to use the advanced preprocessing pipeline for different analysis tasks.

In [None]:
# Usage Examples and Best Practices for Turkish Text Analysis

def demonstrate_advanced_usage():
    """
    Demonstrate different use cases for the advanced Turkish preprocessing pipeline.
    """
    
    print("=== ADVANCED TURKISH NLP USAGE EXAMPLES ===")
    
    # Sample complaints for demonstration
    sample_complaints = [
        "Müşteri hizmetleri kalitesiz, personel ilgisiz davranıyor.",
        "Otelde konaklamaktan çok memnun kaldık, güzel bir deneyimdi.",
        "Rezervasyon iptal edildi, paramızı geri alamadık, çok sinirli oldum."
    ]
    
    for i, complaint in enumerate(sample_complaints, 1):
        print(f"\n--- Example {i}: {complaint[:50]}... ---")
        
        # Show different processing approaches
        methods = []
        if turkish_analyzer:
            methods.append(('Lemmatization', 'zeyrek'))
        if turkish_stemmer:
            methods.append(('Stemming', 'stemmer'))
        if nlp_turkish:
            methods.append(('spaCy NLP', 'spacy'))
        
        if not methods:
            methods = [('Basic', 'basic')]
        
        for method_name, method_code in methods:
            result = advanced_turkish_preprocess(complaint, method_code)
            
            print(f"\n{method_name} processing:")
            print(f"  Cleaned: {result['cleaned'][:60]}...")
            
            if method_code == 'zeyrek' and result['lemmas']:
                filtered_lemmas = remove_stopwords(result['lemmas'])
                print(f"  Key lemmas: {filtered_lemmas[:5]}")
                if result['morphology']:
                    word, analyses = result['morphology'][0]
                    print(f"  Morphology example: '{word}' -> {analyses[:1]}")
            
            elif method_code == 'stemmer' and result['stems']:
                filtered_stems = remove_stopwords(result['stems'])
                print(f"  Key stems: {filtered_stems[:5]}")
            
            elif method_code == 'spacy' and result['spacy_analysis']:
                spacy_data = result['spacy_analysis']
                if spacy_data['entities']:
                    print(f"  Entities: {spacy_data['entities']}")
                if spacy_data['pos_tags']:
                    pos_info = [(word, pos) for word, pos, tag in spacy_data['pos_tags'][:3]]
                    print(f"  POS tags: {pos_info}")
            
            else:
                filtered_basic = remove_stopwords(result['tokens'])
                print(f"  Key tokens: {filtered_basic[:5]}")

def sentiment_analysis_example():
    """
    Example of how to use processed text for sentiment analysis.
    """
    
    print("\n=== SENTIMENT ANALYSIS EXAMPLE ===")
    
    # Simple sentiment keywords for Turkish
    positive_words = {'güzel', 'memnun', 'harika', 'başarılı', 'kaliteli', 'temiz', 
                     'profesyonel', 'hızlı', 'etkili', 'tavsiye'}
    negative_words = {'kötü', 'berbat', 'kalitesiz', 'yavaş', 'kirli', 'ilgisiz', 
                     'sinirli', 'mağdur', 'problem', 'şikayet', 'iptal'}
    
    def simple_sentiment_score(tokens):
        if not tokens:
            return 0
        
        positive_count = sum(1 for token in tokens if token.lower() in positive_words)
        negative_count = sum(1 for token in tokens if token.lower() in negative_words)
        
        total_words = len(tokens)
        sentiment_score = (positive_count - negative_count) / total_words if total_words > 0 else 0
        
        return sentiment_score
    
    # Test sentiment analysis on sample complaints
    if 'processed_df' in locals() and not processed_df.empty:
        print("\nSentiment analysis on first 5 complaints:")
        
        for idx in processed_df.index[:5]:
            complaint_title = processed_df.loc[idx, 'title']
            
            # Use the best available processed tokens
            if 'full_complaint_lemmas' in processed_df.columns:
                tokens = processed_df.loc[idx, 'full_complaint_lemmas']
                method = 'lemmas'
            elif 'full_complaint_stems' in processed_df.columns:
                tokens = processed_df.loc[idx, 'full_complaint_stems']
                method = 'stems'
            else:
                tokens = processed_df.loc[idx, 'full_complaint_filtered']
                method = 'tokens'
            
            if isinstance(tokens, list) and tokens:
                filtered_tokens = remove_stopwords(tokens)
                sentiment = simple_sentiment_score(filtered_tokens)
                
                sentiment_label = "Positive" if sentiment > 0.05 else "Negative" if sentiment < -0.05 else "Neutral"
                
                print(f"\n{idx+1}. {complaint_title[:40]}...")
                print(f"   Method: {method}, Score: {sentiment:.3f}, Label: {sentiment_label}")
                print(f"   Key words: {filtered_tokens[:5]}")
    else:
        print("No processed data available yet.")

def topic_modeling_preparation_example():
    """
    Example of how to prepare processed text for topic modeling.
    """
    
    print("\n=== TOPIC MODELING PREPARATION EXAMPLE ===")
    
    if 'processed_df' in locals() and not processed_df.empty:
        # Collect all processed documents
        documents = []
        
        for idx in processed_df.index[:20]:  # Use first 20 for example
            # Combine title and complaint for richer context
            title_tokens = processed_df.loc[idx, 'title_filtered'] or []
            complaint_tokens = processed_df.loc[idx, 'full_complaint_filtered'] or []
            
            if isinstance(title_tokens, list) and isinstance(complaint_tokens, list):
                combined_tokens = title_tokens + complaint_tokens
                # Filter out very short documents
                if len(combined_tokens) >= 3:
                    documents.append(' '.join(combined_tokens))
        
        print(f"Prepared {len(documents)} documents for topic modeling")
        print(f"Sample document: {documents[0][:100]}..." if documents else "No documents available")
        
        # Basic word frequency analysis
        if documents:
            all_words = []
            for doc in documents:
                all_words.extend(doc.split())
            
            word_freq = Counter(all_words)
            print(f"\nTotal vocabulary: {len(set(all_words))} unique words")
            print(f"Most common words: {word_freq.most_common(10)}")
            
            # Vocabulary richness
            richness = len(set(all_words)) / len(all_words)
            print(f"Vocabulary richness: {richness:.3f}")
    else:
        print("No processed data available yet.")

def best_practices_summary():
    """
    Provide best practice recommendations for Turkish text analysis.
    """
    
    print("\n=== BEST PRACTICES FOR TURKISH TEXT ANALYSIS ===")
    
    practices = {
        "1. Preprocessing Strategy": [
            "Use lemmatization (Zeyrek) for semantic analysis and topic modeling",
            "Use stemming for word frequency analysis and search applications",
            "Use basic tokenization for quick exploratory analysis",
            "Always preserve Turkish characters during cleaning"
        ],
        "2. Library Selection": [
            "Zeyrek: Best for morphological analysis and lemmatization",
            "Turkish Stemmer: Fast and reliable for stemming",
            "spaCy: Excellent for named entity recognition and POS tagging",
            "NLTK: Good for additional linguistic features and tokenization"
        ],
        "3. Performance Considerations": [
            "Zeyrek: Slower but most accurate for Turkish morphology",
            "Stemmer: Fastest for large-scale processing",
            "spaCy: Medium speed, rich features",
            "Basic preprocessing: Fastest, good for initial exploration"
        ],
        "4. Analysis Applications": [
            "Sentiment Analysis: Use lemmas or stems + Turkish sentiment lexicon",
            "Topic Modeling: Lemmatized text works best (reduces sparsity)",
            "Keyword Extraction: Stems are often sufficient",
            "Named Entity Recognition: Use spaCy for best results"
        ],
        "5. Quality Assurance": [
            "Always inspect processed output with sample texts",
            "Compare vocabulary reduction across methods",
            "Test on domain-specific text (tourism complaints)",
            "Validate results with native Turkish speakers when possible"
        ]
    }
    
    for category, tips in practices.items():
        print(f"\n{category}:")
        for tip in tips:
            print(f"  • {tip}")
    
    print("\n=== RECOMMENDED WORKFLOW ===")
    workflow = [
        "1. Start with basic preprocessing for quick exploration",
        "2. Use Zeyrek lemmatization for semantic analysis",
        "3. Apply spaCy for entity extraction and linguistic features",
        "4. Compare results across methods to choose the best for your task",
        "5. Optimize preprocessing pipeline based on downstream performance"
    ]
    
    for step in workflow:
        print(f"  {step}")

# Run all examples
demonstrate_advanced_usage()
sentiment_analysis_example()
topic_modeling_preparation_example()
best_practices_summary()

print("\n" + "="*60)
print("🎉 ADVANCED TURKISH TEXT PREPROCESSING COMPLETE!")
print("\nYour Turkish complaint analysis pipeline is now ready with:")
print("  ✓ Professional morphological analysis")
print("  ✓ Advanced tokenization and normalization")
print("  ✓ Multiple preprocessing strategies")
print("  ✓ Comprehensive evaluation tools")
print("  ✓ Ready-to-use datasets for ML/NLP tasks")
print("\nHappy analyzing! 🇹🇷📊")

## Summary

This notebook provides a comprehensive Turkish text preprocessing pipeline specifically designed for customer complaint analysis. It integrates multiple state-of-the-art Turkish NLP libraries to offer different levels of processing sophistication:

### Key Features:
- **Multi-library support**: NLTK, Zeyrek, spaCy, Turkish-stemmer
- **Flexible processing**: Choose between basic, lemmatization, stemming, or full NLP analysis
- **Turkish language optimization**: Handles agglutination, character encoding, and linguistic nuances
- **Performance benchmarking**: Compare different methods for your specific use case
- **Ready-to-use outputs**: Preprocessed datasets ready for sentiment analysis, topic modeling, etc.

### Files Generated:
1. `setur_complaints_advanced_processed.csv` - Full processed dataset with all features
2. `setur_complaints_for_advanced_analysis.csv` - Clean dataset optimized for ML/NLP
3. `advanced_preprocessing_summary.json` - Detailed processing statistics
4. `setup_turkish_nlp.sh` - Installation script for all dependencies

### Next Steps:
- Run sentiment analysis on lemmatized text
- Apply topic modeling (LDA/NMF) using processed tokens
- Extract business insights from named entities
- Build Turkish-specific ML models for complaint classification

The pipeline automatically falls back to simpler methods if advanced libraries are not available, ensuring it works in any environment while providing the best possible results when fully configured.