# Natural Language Processing with NLTK for Tag Generation

In this notebook, we'll explore how to use the Natural Language Toolkit (NLTK) to generate tags from textual data in vehicle repair records. While regular expressions (regex) are useful for simple pattern matching, NLTK provides more powerful text processing capabilities specifically designed for natural language processing.

We'll cover:

1. Introduction to NLTK and why it's used for text processing
2. Text preprocessing with NLTK (tokenization, stopword removal, stemming, lemmatization)
3. Part-of-speech tagging to identify important terms
4. Named entity recognition for identifying specific components
5. Frequency analysis and collocation detection
6. Generating meaningful tags from repair descriptions

This approach will improve upon the basic regex method we used previously while remaining accessible and easy to explain in an interview setting.

## 1. Introduction to NLTK: What is it and Why Use it?

### What is NLTK?

NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources (like WordNet), along with a suite of text processing libraries for:

- Tokenization (splitting text into words or sentences)
- Stemming and lemmatization (reducing words to their base forms)
- Part-of-speech tagging (identifying nouns, verbs, adjectives, etc.)
- Named entity recognition (identifying names of people, organizations, locations, etc.)
- Parsing and semantic reasoning

### Why Use NLTK for Tag Generation?

1. **Specialized for Language Processing**: Unlike regex, which is a general pattern-matching tool, NLTK is specifically designed for natural language processing tasks.

2. **Linguistic Knowledge**: NLTK incorporates linguistic knowledge that helps it understand language structure and meaning, not just character patterns.

3. **Pre-built Resources**: It comes with pre-built resources like stopword lists, stemmers, and lexicons that would be tedious to create manually.

4. **More Accurate Results**: By understanding language context and structure, NLTK can generate more meaningful and accurate tags than simple regex.

5. **Industry Standard**: NLTK is widely used in industry and academia, making it a valuable skill to demonstrate in interviews.

Let's start by importing the necessary libraries and preparing our data.

In [None]:
# Import basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from collections import Counter

# Import entire NLTK library
import nltk

# Download necessary NLTK resources all at once
print("Downloading necessary NLTK resources...")
try:
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('wordnet')
    nltk.download('averaged_perceptron_tagger')
    nltk.download('maxent_ne_chunker')
    nltk.download('words')
except Exception as e:
    print(f"Note: {e}")
    print("You can ignore download errors if the resources are already available.")

print("Libraries imported successfully!")
print("NLTK version:", nltk.__version__)

## 2. Loading and Preparing Dataset

Now, let's load the vehicle repair dataset we previously worked with. For our NLTK-based tag generation, we'll focus on the text columns that contain descriptions of problems and repairs.

In [None]:
# Load the previously saved cleaned data
try:
    # First try to load the cleaned data from the previous notebook
    df = pd.read_csv('cleaned_and_tagged_data_final1.csv')
    print("Loaded cleaned data from previous notebook")
except FileNotFoundError:
    # If the cleaned data is not available, load and clean the original data
    print("Cleaned data not found, loading original data")
    df = pd.read_excel('DA -Task 2..xlsx')
    
    # Basic cleaning (simplified version of the previous notebook)
    # Fill missing values in object columns with 'Unknown' and numeric columns with mean
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col] = df[col].fillna('Unknown')
        elif df[col].dtype in ['int64', 'float64']:
            df[col] = df[col].fillna(df[col].mean())

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Let's focus on the text columns we'll use for tag generation
text_columns = ['CUSTOMER_VERBATIM', 'CORRECTION_VERBATIM']

# Check if these columns exist in the dataframe
text_columns = [col for col in text_columns if col in df.columns]

if not text_columns:
    raise ValueError("Required text columns not found in the dataset")

# Preview the text data we'll be working with
print("Sample text data from each column:")
for col in text_columns:
    print(f"\n{col} (first 3 examples):")
    for i, text in enumerate(df[col].head(3)):
        print(f"{i+1}. {text}")
        
# Combine the text fields as we did in the basic approach
df['combined_text'] = df[text_columns].apply(
    lambda x: ' '.join([str(text) for text in x if not pd.isna(text)]), axis=1
)

print("\nCombined text examples (first 3):")
for i, text in enumerate(df['combined_text'].head(3)):
    print(f"{i+1}. {text[:200]}..." if len(text) > 200 else f"{i+1}. {text}")

## 3. Text Preprocessing with NLTK

Text preprocessing is a critical step in any NLP pipeline. NLTK provides several tools to clean and normalize text data before further analysis. Let's understand each preprocessing step:

### 1. Tokenization

**What is it?** Breaking text into smaller units (tokens) such as words or sentences.

**Why use it?** 
- Text needs to be split into words before it can be analyzed
- NLTK's tokenizers understand language rules better than simple splitting by spaces
- Handles punctuation and special cases properly

**NLTK Functions:**
- `word_tokenize()`: Splits text into words
- `sent_tokenize()`: Splits text into sentences

### 2. Stopword Removal

**What is it?** Filtering out common words that add little meaning (e.g., "the", "is", "and").

**Why use it?**
- Reduces noise in the data
- Focuses analysis on meaningful content words
- Improves efficiency by reducing the number of tokens to process

**NLTK Resources:**
- `nltk.corpus.stopwords`: Pre-defined lists of stopwords in multiple languages

### 3. Stemming vs. Lemmatization

**Stemming:**
- Removes word endings to get the stem/root
- Fast but often produces non-real words
- Example: "running" → "run", "easily" → "easili"

**Lemmatization:**
- Reduces words to their dictionary base form
- Slower but produces real words
- Example: "running" → "run", "better" → "good"

**Why use these?**
- Normalizes variations of the same word
- Reduces vocabulary size
- Groups related terms together

**NLTK Tools:**
- `PorterStemmer()`: Popular stemming algorithm
- `WordNetLemmatizer()`: Lemmatization using WordNet dictionary

### 4. Part-of-Speech (POS) Tagging

**What is it?** Identifying grammatical parts of speech (noun, verb, adjective, etc.)

**Why use it?**
- Helps understand word function in context
- Allows filtering for specific types of words (e.g., nouns for technical terms)
- Improves accuracy of lemmatization

**NLTK Function:**
- `pos_tag()`: Tags words with their part of speech

Let's implement these preprocessing steps on our vehicle repair data.

In [None]:
# Define a comprehensive text preprocessing function using NLTK
def preprocess_text(text, tokenize=True, remove_stopwords=True, lemmatize=True, 
                   stem=False, pos_filter=None):
    """
    Preprocess text using NLTK's tools
    
    Args:
        text (str): Input text to preprocess
        tokenize (bool): Whether to tokenize the text
        remove_stopwords (bool): Whether to remove stopwords
        lemmatize (bool): Whether to perform lemmatization
        stem (bool): Whether to perform stemming (not used if lemmatize is True)
        pos_filter (list): List of POS tags to keep (e.g., ['NN', 'NNS'] for nouns)
        
    Returns:
        list or str: Preprocessed tokens or joined text
    """
    # Ensure text is a string
    if not isinstance(text, str):
        text = str(text)
    
    # Convert text to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    
    # Tokenize the text
    if tokenize:
        tokens = word_tokenize(text)
    else:
        tokens = text.split()
    
    # Remove stopwords if specified
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        tokens = [token for token in tokens if token not in stop_words and len(token) > 1]
    
    # Apply POS tagging and filter by specified tags if needed
    if pos_filter:
        tagged_tokens = pos_tag(tokens)
        tokens = [word for word, tag in tagged_tokens if tag in pos_filter]
    
    # Lemmatize tokens if specified
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
    # Stem tokens if lemmatization is not used
    elif stem:
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(token) for token in tokens]
    
    return tokens

# Apply basic preprocessing to the combined text
print("Preprocessing text data with NLTK...")
df['combined_text'] = df[text_columns].apply(
    lambda x: ' '.join([str(text) for text in x if not pd.isna(text)]), axis=1
)

# Process with different NLTK techniques to demonstrate
df['tokenized'] = df['combined_text'].apply(
    lambda x: preprocess_text(x, tokenize=True, remove_stopwords=True, lemmatize=False, stem=False)
)

df['lemmatized'] = df['combined_text'].apply(
    lambda x: preprocess_text(x, tokenize=True, remove_stopwords=True, lemmatize=True, stem=False)
)

df['stemmed'] = df['combined_text'].apply(
    lambda x: preprocess_text(x, tokenize=True, remove_stopwords=True, lemmatize=False, stem=True)
)

# Extract nouns (often the most important for technical text)
df['nouns_only'] = df['combined_text'].apply(
    lambda x: preprocess_text(x, tokenize=True, remove_stopwords=True, 
                            lemmatize=True, pos_filter=['NN', 'NNS'])
)

# Display examples to show the difference between techniques
print("\nNLTK Preprocessing Examples:")
sample_idx = 0  # Use the first document as an example

print(f"\nOriginal Text:")
print(df['combined_text'].iloc[sample_idx][:200] + "..." if len(df['combined_text'].iloc[sample_idx]) > 200 
      else df['combined_text'].iloc[sample_idx])

print(f"\nAfter Tokenization and Stopword Removal:")
print(df['tokenized'].iloc[sample_idx][:15])

print(f"\nAfter Lemmatization:")
print(df['lemmatized'].iloc[sample_idx][:15])

print(f"\nAfter Stemming:")
print(df['stemmed'].iloc[sample_idx][:15])

print(f"\nNouns Only:")
print(df['nouns_only'].iloc[sample_idx][:15])

# Create a function to compare the techniques
def compare_preprocessing(text):
    """Compare different preprocessing techniques on a sample text"""
    original = text
    tokenized = preprocess_text(text, lemmatize=False, stem=False)
    lemmatized = preprocess_text(text, lemmatize=True, stem=False)
    stemmed = preprocess_text(text, lemmatize=False, stem=True)
    nouns = preprocess_text(text, lemmatize=True, pos_filter=['NN', 'NNS'])
    
    # Create a comparison table
    comparison = pd.DataFrame({
        'Original': [original],
        'Tokenized': [' '.join(tokenized)],
        'Lemmatized': [' '.join(lemmatized)],
        'Stemmed': [' '.join(stemmed)],
        'Nouns Only': [' '.join(nouns)]
    })
    
    return comparison.T.rename(columns={0: 'Result'})

# Display a detailed comparison for a short example
sample_text = "The customer complained about the steering wheel vibrating when braking. Replaced the front rotors and performed wheel alignment."
comparison_df = compare_preprocessing(sample_text)
print("\nDetailed Preprocessing Comparison:")
print(comparison_df)

## 4. Part-of-Speech Tagging and Named Entity Recognition

For generating meaningful tags from technical text like vehicle repair descriptions, identifying the right types of words is crucial. Two NLTK techniques are particularly useful for this:

### Part-of-Speech (POS) Tagging

**What is it?** 
POS tagging is the process of marking up each word in text with its corresponding part of speech (noun, verb, adjective, etc.).

**Why is it useful for tag generation?**
- **Nouns** are usually the most important for technical texts, as they identify parts, components, and systems
- **Adjectives** often describe conditions or qualities of parts
- **Verbs** indicate actions or issues
- Filtering by specific POS tags helps focus on the most relevant terms

**NLTK Implementation:**
- `nltk.pos_tag()` assigns tags like NN (noun), VB (verb), JJ (adjective)
- Common useful tags:
  - NN, NNS: Singular and plural nouns
  - JJ: Adjectives
  - VB, VBG, VBN: Verbs in different forms

### Named Entity Recognition (NER)

**What is it?**
NER identifies named entities in text and classifies them into predefined categories like person names, organizations, locations, etc.

**Why is it useful for tag generation?**
- Can identify specific parts or components that might be proper nouns
- Helps extract structured information from unstructured text
- Captures entities that might be missed by simple word frequency analysis

**NLTK Implementation:**
- `nltk.ne_chunk()` identifies named entities

Let's implement these techniques on our vehicle repair data.

In [None]:
# Apply POS tagging to the text data
def pos_tag_text(text):
    """
    Perform part-of-speech tagging on text
    
    Args:
        text (str): Input text
        
    Returns:
        list: List of (word, tag) tuples
    """
    # Ensure text is a string
    if not isinstance(text, str):
        text = str(text)
    
    # Tokenize and tag
    tokens = word_tokenize(text.lower())
    tagged = pos_tag(tokens)
    
    return tagged

# Extract words by POS tag
def extract_by_pos(tagged_text, target_tags):
    """
    Extract words that match specific POS tags
    
    Args:
        tagged_text (list): List of (word, tag) tuples
        target_tags (list): List of target POS tags
        
    Returns:
        list: Words that match the target tags
    """
    return [word for word, tag in tagged_text if tag in target_tags]

# Apply POS tagging to a sample of documents
sample_size = min(100, len(df))
sample_indices = np.random.choice(len(df), sample_size, replace=False)

pos_results = []
for idx in sample_indices:
    text = df['combined_text'].iloc[idx]
    tagged = pos_tag_text(text)
    
    # Extract words by type
    nouns = extract_by_pos(tagged, ['NN', 'NNS', 'NNP', 'NNPS'])
    verbs = extract_by_pos(tagged, ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
    adjectives = extract_by_pos(tagged, ['JJ', 'JJR', 'JJS'])
    
    pos_results.append({
        'index': idx,
        'nouns': nouns,
        'verbs': verbs,
        'adjectives': adjectives
    })

# Convert to DataFrame for easier analysis
pos_df = pd.DataFrame(pos_results)

# Display examples
print("POS Tagging Examples:")
for i in range(min(3, len(pos_df))):
    print(f"\nDocument {pos_df['index'].iloc[i]}:")
    print(f"Nouns: {', '.join(pos_df['nouns'].iloc[i][:10])}")
    print(f"Verbs: {', '.join(pos_df['verbs'].iloc[i][:10])}")
    print(f"Adjectives: {', '.join(pos_df['adjectives'].iloc[i][:10])}")

# Visualize distribution of parts of speech
pos_counts = {
    'Nouns': np.mean([len(x) for x in pos_df['nouns']]),
    'Verbs': np.mean([len(x) for x in pos_df['verbs']]),
    'Adjectives': np.mean([len(x) for x in pos_df['adjectives']])
}

plt.figure(figsize=(10, 6))
plt.bar(pos_counts.keys(), pos_counts.values(), color=['blue', 'green', 'red'])
plt.title('Average Number of Words by Part of Speech')
plt.ylabel('Average Count per Document')
plt.grid(axis='y', alpha=0.3)
plt.show()

# Named Entity Recognition
print("\nNamed Entity Recognition Examples:")

def perform_ner(text):
    """
    Perform Named Entity Recognition on text
    
    Args:
        text (str): Input text
        
    Returns:
        Tree: NLTK Tree containing named entities
    """
    # Ensure text is a string
    if not isinstance(text, str):
        text = str(text)
    
    # Tokenize, tag, and chunk
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    entities = ne_chunk(tagged)
    
    return entities

# Apply NER to a few example documents
for i in range(3):
    if i < len(df):
        text = df['combined_text'].iloc[i]
        entities = perform_ner(text)
        
        # Extract named entities
        named_entities = []
        for subtree in entities:
            if isinstance(subtree, nltk.Tree):
                entity_type = subtree.label()
                entity_text = ' '.join([word for word, tag in subtree.leaves()])
                named_entities.append((entity_text, entity_type))
        
        print(f"\nDocument {i}:")
        print(f"Text: {text[:150]}...")
        print(f"Named Entities: {named_entities}")

# Create a function to generate tags based on POS and NER
def generate_pos_tags(text, num_nouns=5, num_verbs=3, num_adj=2):
    """
    Generate tags based on POS tagging
    
    Args:
        text (str): Input text
        num_nouns (int): Number of noun tags to include
        num_verbs (int): Number of verb tags to include
        num_adj (int): Number of adjective tags to include
        
    Returns:
        list: Generated tags
    """
    # Ensure text is a string
    if not isinstance(text, str):
        text = str(text)
    
    # Process text
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    
    # Tokenize and remove stopwords
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    
    # Tag parts of speech
    tagged = pos_tag(tokens)
    
    # Extract by part of speech
    nouns = extract_by_pos(tagged, ['NN', 'NNS', 'NNP', 'NNPS'])
    verbs = extract_by_pos(tagged, ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
    adjectives = extract_by_pos(tagged, ['JJ', 'JJR', 'JJS'])
    
    # Count frequencies
    noun_counts = Counter(nouns)
    verb_counts = Counter(verbs)
    adj_counts = Counter(adjectives)
    
    # Get the most common words of each type
    top_nouns = [word for word, _ in noun_counts.most_common(num_nouns)]
    top_verbs = [word for word, _ in verb_counts.most_common(num_verbs)]
    top_adjs = [word for word, _ in adj_counts.most_common(num_adj)]
    
    # Combine into tags
    tags = top_nouns + top_verbs + top_adjs
    
    return tags

# Apply tag generation to all documents
df['pos_tags'] = df['combined_text'].apply(generate_pos_tags)

# Display examples
print("\nPOS-based Tag Generation Examples:")
for i in range(5):
    if i < len(df):
        print(f"\nDocument {i}:")
        print(f"Text: {df['combined_text'].iloc[i][:100]}...")
        print(f"POS Tags: {', '.join(df['pos_tags'].iloc[i])}")

# Compare with the original regex-based tags if they exist
if 'TAGS' in df.columns:
    print("\nComparison with original regex-based tags:")
    for i in range(5):
        if i < len(df):
            print(f"\nDocument {i}:")
            print(f"Original Tags: {df['TAGS'].iloc[i]}")
            print(f"POS Tags: {', '.join(df['pos_tags'].iloc[i])}")

## 5. Frequency Analysis and Collocation Detection

Beyond individual words, understanding how words appear together can provide deeper insights. NLTK offers powerful tools for frequency analysis and collocation detection.

### Frequency Analysis

**What is it?**
Counting how often words appear in a text or corpus.

**Why is it useful for tag generation?**
- Identifies the most common terms in repair descriptions
- Highlights potentially important components or issues
- Provides a baseline for understanding the repair data

**NLTK Implementation:**
- `nltk.FreqDist()` creates a frequency distribution of tokens

### N-grams

**What is it?**
Contiguous sequences of n items (words) from a text.

**Why is it useful for tag generation?**
- Captures multi-word concepts (e.g., "power steering", "front wheel")
- Preserves contextual information lost in single-word analysis
- Often more specific and meaningful than single words

**NLTK Implementation:**
- `nltk.ngrams()` generates n-grams from text

### Collocations

**What is it?**
Words that frequently appear together, beyond what would be expected by chance.

**Why is it useful for tag generation?**
- Identifies meaningful word combinations specific to the domain
- Captures technical terms that consist of multiple words
- More accurate than simple n-gram frequency counting

**NLTK Implementation:**
- `nltk.collocations.BigramCollocationFinder`
- Uses statistical measures like PMI (Pointwise Mutual Information) to find significant word pairs

Let's apply these techniques to our repair data to generate more meaningful tags.

In [None]:
# Create a corpus for frequency analysis
all_words = []
for tokens in df['lemmatized']:
    all_words.extend(tokens)

# Create a frequency distribution
fdist = FreqDist(all_words)

# Plot the 20 most common words
plt.figure(figsize=(12, 6))
fdist.plot(20, title='20 Most Common Words in Repair Descriptions')
plt.show()

# Generate word cloud
wordcloud = WordCloud(
    width=800, 
    height=400,
    background_color='white',
    max_words=100
).generate(' '.join(all_words))

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Repair Description Terms')
plt.show()

# Generate n-grams (bigrams and trigrams)
def generate_ngrams(tokens, n=2):
    """
    Generate n-grams from a list of tokens
    
    Args:
        tokens (list): List of tokens
        n (int): Size of n-grams
        
    Returns:
        list: List of n-grams
    """
    return list(ngrams(tokens, n))

# Combine all preprocessed tokens for n-gram analysis
all_tokens = []
for tokens in df['lemmatized']:
    all_tokens.extend(tokens)

# Generate bigrams and trigrams
bigrams_list = generate_ngrams(all_tokens, 2)
trigrams_list = generate_ngrams(all_tokens, 3)

# Count frequencies
bigram_freq = Counter(bigrams_list)
trigram_freq = Counter(trigrams_list)

# Display the most common n-grams
print("Most Common Bigrams (word pairs):")
for gram, count in bigram_freq.most_common(15):
    print(f"{' '.join(gram)}: {count}")

print("\nMost Common Trigrams (three-word sequences):")
for gram, count in trigram_freq.most_common(15):
    print(f"{' '.join(gram)}: {count}")

# Find collocations (statistically significant bigrams)
# First, prepare the text for collocation finding
all_text_for_collocations = []
for text in df['combined_text']:
    tokens = word_tokenize(text.lower())
    # Remove stopwords and short words
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    all_text_for_collocations.extend(tokens)

# Create a collocation finder
bigram_finder = BigramCollocationFinder.from_words(all_text_for_collocations)

# Apply frequency filter
bigram_finder.apply_freq_filter(3)  # Minimum frequency of 3

# Find top collocations using PMI (Pointwise Mutual Information)
top_collocations = bigram_finder.nbest(BigramAssocMeasures.pmi, 20)

print("\nTop Collocations (Statistically Significant Word Pairs):")
for collocation in top_collocations:
    print(' '.join(collocation))

# Generate tags based on frequency, n-grams, and collocations
def generate_ngram_tags(text, unigram_count=3, bigram_count=3, top_collocations=top_collocations):
    """
    Generate tags based on word frequency and collocations
    
    Args:
        text (str): Input text
        unigram_count (int): Number of single-word tags to include
        bigram_count (int): Number of two-word tags to include
        top_collocations (list): List of top collocations to match against
        
    Returns:
        list: Generated tags
    """
    # Preprocess text
    tokens = preprocess_text(text, tokenize=True, remove_stopwords=True, lemmatize=True)
    
    # Generate tags from single words (unigrams)
    word_counts = Counter(tokens)
    unigram_tags = [word for word, _ in word_counts.most_common(unigram_count)]
    
    # Generate bigrams and check for matches with top collocations
    text_bigrams = list(ngrams(tokens, 2))
    bigram_tags = []
    
    # First check for collocations in the text
    for bigram in text_bigrams:
        if bigram in top_collocations:
            bigram_tags.append(' '.join(bigram))
    
    # If we don't have enough collocation matches, add frequent bigrams
    if len(bigram_tags) < bigram_count:
        bigram_counts = Counter(text_bigrams)
        for bigram, _ in bigram_counts.most_common(bigram_count - len(bigram_tags)):
            bigram_tags.append(' '.join(bigram))
    
    # Combine tags (limit to requested counts)
    tags = unigram_tags[:unigram_count] + bigram_tags[:bigram_count]
    
    return tags

# Apply n-gram tag generation to all documents
df['ngram_tags'] = df['combined_text'].apply(generate_ngram_tags)

# Display examples
print("\nN-gram Based Tag Generation Examples:")
for i in range(5):
    if i < len(df):
        print(f"\nDocument {i}:")
        print(f"Text: {df['combined_text'].iloc[i][:100]}...")
        print(f"N-gram Tags: {', '.join(df['ngram_tags'].iloc[i])}")

# Compare with POS tags
print("\nComparison with POS-based tags:")
for i in range(5):
    if i < len(df):
        print(f"\nDocument {i}:")
        print(f"POS Tags: {', '.join(df['pos_tags'].iloc[i])}")
        print(f"N-gram Tags: {', '.join(df['ngram_tags'].iloc[i])}")

## 6. Comprehensive Tag Generation Using NLTK

Now that we've explored individual NLTK techniques, let's combine them into a comprehensive tag generation approach. This will leverage:

1. **Preprocessing** to clean and normalize the text
2. **POS Tagging** to identify nouns, verbs, and adjectives
3. **N-grams** to capture multi-word concepts
4. **Collocations** to find statistically significant word pairs
5. **Frequency Analysis** to prioritize common terms
6. **Named Entity Recognition** to identify specific components

The goal is to generate more meaningful and contextually relevant tags than the basic regex approach. Our comprehensive approach will:

- Prioritize technical nouns (parts, components)
- Include important descriptive adjectives (conditions, symptoms)
- Add action verbs (problem indicators)
- Incorporate significant word combinations
- Filter out irrelevant or common terms

Let's implement this comprehensive approach and evaluate its effectiveness.

In [None]:
# Define a simplified yet effective tag generation function
def generate_simple_nltk_tags(text, max_tags=6):
    """
    Generate tags using NLTK with a simplified approach
    
    Args:
        text (str): Input text
        max_tags (int): Maximum number of tags to generate
        
    Returns:
        list: Generated tags
    """
    # Ensure text is a string
    if not isinstance(text, str):
        text = str(text)
    
    # Convert to lowercase and clean text
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(nltk.corpus.stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    
    # Lemmatize words
    lemmatizer = nltk.stem.WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(t) for t in tokens]
    
    # Get part-of-speech tags
    tagged = nltk.pos_tag(lemmatized)
    
    # Extract nouns (most important for technical text)
    nouns = [word for word, tag in tagged if tag.startswith('N')]
    
    # Extract verbs and adjectives
    verbs = [word for word, tag in tagged if tag.startswith('V')]
    adjectives = [word for word, tag in tagged if tag.startswith('J')]
    
    # Count word frequencies
    noun_counts = Counter(nouns)
    verb_counts = Counter(verbs)
    adj_counts = Counter(adjectives)
    
    # Create list of bigrams for the most important multi-word terms
    bigrams = list(nltk.bigrams(lemmatized))
    bigram_counts = Counter(bigrams)
    
    # Generate tags in order of priority
    tags = []
    
    # Add top nouns (most important for parts identification)
    tags.extend([word for word, _ in noun_counts.most_common(3)])
    
    # Add top bigrams that appear more than once
    for (w1, w2), count in bigram_counts.most_common(2):
        if count > 1:
            tags.append(f"{w1} {w2}")
    
    # Add top adjectives (for condition description)
    tags.extend([word for word, _ in adj_counts.most_common(1)])
    
    # Add top verbs (for action/issue identification)
    tags.extend([word for word, _ in verb_counts.most_common(1)])
    
    # Remove duplicates while preserving order
    unique_tags = []
    for tag in tags:
        if tag not in unique_tags:
            unique_tags.append(tag)
    
    # Limit to maximum number of tags
    return unique_tags[:max_tags]

# Apply the simplified tag generation to the data
df['simple_nltk_tags'] = df['combined_text'].apply(generate_simple_nltk_tags)

# Display examples
print("Simple NLTK Tag Generation Examples:")
for i in range(5):
    if i < len(df):
        print(f"\nDocument {i}:")
        print(f"Text: {df['combined_text'].iloc[i][:100]}...")
        print(f"Simple NLTK Tags: {', '.join(df['simple_nltk_tags'].iloc[i])}")

# Compare with original regex tags if they exist
if 'TAGS' in df.columns:
    print("\nComparison with original regex-based tags:")
    for i in range(5):
        if i < len(df):
            print(f"\nDocument {i}:")
            print(f"Original Tags: {df['TAGS'].iloc[i]}")
            print(f"Simple NLTK Tags: {', '.join(df['simple_nltk_tags'].iloc[i])}")

# Create a simple visualization to compare tag coverage
# Create a list of common vehicle repair terms
repair_terms = {
    'wheel', 'steering', 'brake', 'engine', 'transmission', 'clutch',
    'battery', 'suspension', 'exhaust', 'filter', 'sensor', 'valve', 
    'electrical', 'vibration', 'noise', 'leak', 'alignment',
    'power', 'door', 'window', 'motor'
}

# Check how many of these terms appear in tags
def count_repair_terms(tags, term_list):
    """Count how many repair terms are in the tags"""
    if isinstance(tags, str):
        # For original tags (comma-separated string)
        tags = [t.strip() for t in tags.split(',')]
    
    count = 0
    for tag in tags:
        for term in term_list:
            if term in tag.lower():
                count += 1
                break
    return count

# Evaluate each method
original_count = 0
nltk_count = 0

if 'TAGS' in df.columns:
    original_count = sum(count_repair_terms(tags, repair_terms) 
                         for tags in df['TAGS'] if pd.notna(tags))

nltk_count = sum(count_repair_terms(tags, repair_terms) 
                for tags in df['simple_nltk_tags'])

# Display results
print("\nRepair Term Coverage:")
if 'TAGS' in df.columns:
    print(f"Original Regex Tags: {original_count} repair terms identified")
print(f"Simple NLTK Tags: {nltk_count} repair terms identified")

# Visualize the comparison if both methods are available
if 'TAGS' in df.columns:
    plt.figure(figsize=(8, 5))
    plt.bar(['Original Regex', 'Simple NLTK'], [original_count, nltk_count], 
            color=['blue', 'green'])
    plt.title('Repair Term Coverage by Tag Generation Method')
    plt.ylabel('Number of Repair Terms Identified')
    plt.grid(axis='y', alpha=0.3)
    plt.show()

# Save the tagged data
df.to_csv('vehicle_repair_simple_nltk_tags.csv', index=False)
print("\nSaved tagged data to 'vehicle_repair_simple_nltk_tags.csv'")

## 7. Why NLTK is Better Than Regex for Tag Generation (Interview Explanation)

If you're asked in an interview why you chose NLTK over simple regex for tag generation, here's a concise explanation:

### 1. Language Understanding vs. Pattern Matching

**Regex:** Simply matches character patterns without any understanding of language.
**NLTK:** Understands language components (nouns, verbs, etc.) and their relationships.

### 2. Context Awareness

**Regex:** Treats each word independently, missing context and relationships.
**NLTK:** Can recognize multi-word phrases and how words relate to each other.

### 3. Built-in Language Resources

**Regex:** No built-in knowledge of language. You have to manually create rules for everything.
**NLTK:** Comes with pre-built resources like stopword lists, lemmatizers, and POS taggers.

### 4. Intelligent Processing

**Regex:** Can only do what you explicitly program it to do.
**NLTK:** Can perform intelligent operations like finding root words, identifying parts of speech, and recognizing named entities.

### 5. Practical Example

**Regex Approach:** Might identify "running" and "ran" as completely different words.
**NLTK Approach:** Can recognize both "running" and "ran" derive from "run" using lemmatization.

### 6. Real-world Benefit in Vehicle Repair Data

**Regex Limitation:** Might just identify "steering" and "wheel" as separate words.
**NLTK Advantage:** Can recognize "steering wheel" as a single important component and prioritize technical nouns over common words.

The simple NLTK approach we've implemented maintains good accuracy while being much easier to explain and understand than complex machine learning models.

In [None]:
# Simple example to demonstrate NLTK vs. regex for an interview
sample_text = "Customer complained about steering wheel vibration when braking. Replaced front rotors and performed wheel alignment."

print("Interview Example: NLTK vs. Regex")
print("Sample text:")
print(sample_text)
print("\n1. Simple Regex Approach (from previous notebook):")

# Simulate regex approach (similar to original notebook)
def regex_tag_generation(text):
    # Convert to lowercase
    text = text.lower()
    # Split into words
    words = re.findall(r'\b[a-z]+\b', text)
    # Remove common words (simplified stopword list)
    stop_words = ['the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'with', 'about', 'when']
    words = [word for word in words if word not in stop_words]
    # Count word frequencies
    word_counts = Counter(words)
    # Get top words as tags
    tags = [word for word, _ in word_counts.most_common(5)]
    return tags

regex_tags = regex_tag_generation(sample_text)
print(f"Regex Tags: {', '.join(regex_tags)}")

print("\n2. NLTK Approach:")
nltk_tags = generate_simple_nltk_tags(sample_text)
print(f"NLTK Tags: {', '.join(nltk_tags)}")

print("\nKey Differences to Highlight in an Interview:")
print("1. NLTK identifies 'steering wheel' as a single component")
print("2. NLTK prioritizes nouns (parts) like 'rotor' over common words")
print("3. NLTK understands words like 'replaced' and 'performed' are verbs (actions)")
print("4. NLTK can normalize different word forms (e.g., 'braking' → 'brake')")
print("5. NLTK requires minimal manual configuration compared to regex")

print("\nSimple Technical Implementation:")
print("1. Import NLTK: 'import nltk'")
print("2. Preprocess text: lowercase, clean, tokenize")
print("3. Remove stopwords using NLTK's built-in list")
print("4. Apply lemmatization to get base word forms")
print("5. Use POS tagging to identify nouns, verbs, etc.")
print("6. Generate tags based on word type and frequency")
print("7. Include important bigrams for multi-word concepts")

## 5. Text Classification with scikit-learn

Now we'll use a supervised learning approach where we train a classifier to predict relevant tags based on text features. This is a more straightforward approach than deep learning but can still be very effective.

For this example, we'll:
1. Use our previously generated TF-IDF tags as "ground truth"
2. Train a simple logistic regression classifier to predict these tags
3. Evaluate the model's performance
4. Generate tags for new documents using the trained model

In [None]:
# For demonstration purposes, we'll use our TF-IDF tags as ground truth
# In a real scenario, you would use manually labeled data
if 'tfidf_tags' in df.columns and len(df['tfidf_tags'].iloc[0]) > 0:
    # We'll create a multi-label classification problem
    # First, get all unique tags
    all_tags = set()
    for tags in df['tfidf_tags']:
        all_tags.update(tags)
    
    print(f"Total unique tags: {len(all_tags)}")
    
    # Convert the tags to a multi-hot encoded format
    mlb = MultiLabelBinarizer()
    tag_matrix = mlb.fit_transform(df['tfidf_tags'])
    
    print(f"Tag matrix shape: {tag_matrix.shape}")
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        tfidf_matrix, tag_matrix, test_size=0.3, random_state=42
    )
    
    print(f"Training data shape: {X_train.shape}")
    print(f"Training labels shape: {y_train.shape}")
    
    # We'll use a LogisticRegression classifier for each tag
    print("\nTraining multi-label classifier...")
    
    # Initialize the multi-output classifier with logistic regression
    clf = MultiOutputClassifier(LogisticRegression(max_iter=1000, C=1.0))
    
    # Train the classifier
    try:
        clf.fit(X_train, y_train)
        
        # Predict on the test set
        y_pred = clf.predict(X_test)
        
        # Calculate accuracy for each tag
        tag_accuracies = []
        for i in range(tag_matrix.shape[1]):
            acc = accuracy_score(y_test[:, i], y_pred[:, i])
            tag_accuracies.append(acc)
        
        print(f"Average tag prediction accuracy: {np.mean(tag_accuracies):.4f}")
        
        # Generate classifier-based tags for all documents
        predicted_tags = clf.predict(tfidf_matrix)
        
        # Convert back to lists of tag names
        classifier_tags = []
        for i in range(len(predicted_tags)):
            tags = mlb.inverse_transform(predicted_tags[i:i+1])[0]
            classifier_tags.append(list(tags))
        
        # Add to dataframe
        df['classifier_tags'] = classifier_tags
        
        # Display examples
        print("\nExamples of classifier-based tags:")
        for i in range(5):
            print(f"\nDocument {i+1}:")
            print(f"Text: {df['combined_text'].iloc[i][:100]}...")
            print(f"Classifier Tags: {', '.join(df['classifier_tags'].iloc[i])}")
            print(f"TF-IDF Tags (Ground Truth): {', '.join(df['tfidf_tags'].iloc[i])}")
            
        # Feature importance for tag prediction
        # Get a sample tag and its corresponding classifier
        if len(all_tags) > 0:
            sample_tag_idx = 0
            sample_tag = list(all_tags)[sample_tag_idx]
            sample_clf = clf.estimators_[sample_tag_idx]
            
            # Get feature importance for this tag
            if hasattr(sample_clf, 'coef_'):
                # Get the feature names
                feature_names = tfidf_vectorizer.get_feature_names_out()
                
                # Get the coefficient values
                coefficients = sample_clf.coef_[0]
                
                # Create a dataframe with feature names and coefficients
                feature_importance = pd.DataFrame({
                    'Feature': feature_names,
                    'Importance': coefficients
                })
                
                # Sort by absolute importance
                feature_importance['Abs_Importance'] = np.abs(feature_importance['Importance'])
                feature_importance = feature_importance.sort_values('Abs_Importance', ascending=False)
                
                # Display the top 10 most important features
                print(f"\nTop 10 most important features for predicting tag '{sample_tag}':")
                print(feature_importance.head(10))
                
                # Visualize feature importance
                plt.figure(figsize=(12, 6))
                top_features = feature_importance.head(15)
                colors = ['green' if x > 0 else 'red' for x in top_features['Importance']]
                plt.barh(top_features['Feature'], top_features['Importance'], color=colors)
                plt.title(f'Feature Importance for Tag: {sample_tag}')
                plt.xlabel('Coefficient Value')
                plt.tight_layout()
                plt.show()
    
    except Exception as e:
        print(f"Error training classifier: {e}")
        print("Skipping classifier-based tag generation")
        df['classifier_tags'] = [[] for _ in range(len(df))]
else:
    print("TF-IDF tags not available or empty. Skipping classifier-based tag generation.")
    df['classifier_tags'] = [[] for _ in range(len(df))]

## 6. Comparing Tag Generation Methods

Now let's compare the effectiveness of the different tag generation methods:
1. Original regex-based approach
2. TF-IDF based approach
3. LDA topic modeling approach
4. K-means clustering approach
5. Classifier-based approach

We'll evaluate the quality of tags and compare the different methods.

In [None]:
# Define the tag columns to compare
tag_columns = []
if 'TAGS' in df.columns:
    tag_columns.append('TAGS')
if 'tfidf_tags' in df.columns:
    tag_columns.append('tfidf_tags')
if 'lda_tags' in df.columns:
    tag_columns.append('lda_tags')
if 'cluster_tags' in df.columns:
    tag_columns.append('cluster_tags')
if 'classifier_tags' in df.columns:
    tag_columns.append('classifier_tags')

# Function to calculate tag metrics
def calculate_tag_metrics(df, tag_columns):
    """
    Calculate metrics for tag columns
    
    Args:
        df: DataFrame with tag columns
        tag_columns: List of column names containing tags
        
    Returns:
        dict: Dictionary of metrics for each tag column
    """
    metrics = {}
    
    for col in tag_columns:
        # Count number of tags per document
        if col == 'TAGS':
            # Original tags might be comma-separated strings
            tag_counts = df[col].apply(lambda x: len(str(x).split(',')) if pd.notna(x) else 0)
        else:
            # Other tag columns are lists
            tag_counts = df[col].apply(lambda x: len(x) if isinstance(x, list) else 0)
        
        # Calculate metrics
        metrics[col] = {
            'avg_tags_per_doc': tag_counts.mean(),
            'median_tags_per_doc': tag_counts.median(),
            'max_tags_per_doc': tag_counts.max(),
            'min_tags_per_doc': tag_counts.min(),
            'docs_with_tags': (tag_counts > 0).sum(),
            'docs_without_tags': (tag_counts == 0).sum()
        }
        
        # Get unique tags
        if col == 'TAGS':
            # Original tags might be comma-separated strings
            all_tags = set()
            for tags in df[col]:
                if pd.notna(tags):
                    all_tags.update([t.strip() for t in str(tags).split(',')])
        else:
            # Other tag columns are lists
            all_tags = set()
            for tags in df[col]:
                if isinstance(tags, list):
                    all_tags.update(tags)
        
        metrics[col]['unique_tags'] = len(all_tags)
    
    return metrics

# Calculate metrics
tag_metrics = calculate_tag_metrics(df, tag_columns)

# Display metrics
print("Tag Generation Method Comparison:\n")
print("{:<20} {:<15} {:<15} {:<15} {:<15} {:<15}".format(
    "Method", "Avg Tags/Doc", "Median Tags/Doc", "Docs With Tags", "Docs Without Tags", "Unique Tags"
))
print("-" * 95)
for col, m in tag_metrics.items():
    print("{:<20} {:<15.2f} {:<15.1f} {:<15} {:<15} {:<15}".format(
        col,
        m['avg_tags_per_doc'],
        m['median_tags_per_doc'],
        m['docs_with_tags'],
        m['docs_without_tags'],
        m['unique_tags']
    ))

# Visualize the comparison
plt.figure(figsize=(12, 6))

# Average tags per document
plt.subplot(1, 2, 1)
avg_tags = [m['avg_tags_per_doc'] for m in tag_metrics.values()]
plt.bar(tag_metrics.keys(), avg_tags)
plt.title('Average Tags per Document')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Average Number of Tags')

# Unique tags
plt.subplot(1, 2, 2)
unique_tags = [m['unique_tags'] for m in tag_metrics.values()]
plt.bar(tag_metrics.keys(), unique_tags)
plt.title('Number of Unique Tags')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

# Display examples comparing all methods
print("\nTag Comparison Examples:")
for i in range(5):
    print(f"\nDocument {i+1}:")
    print(f"Text: {df['combined_text'].iloc[i][:150]}...")
    for col in tag_columns:
        if col == 'TAGS':
            tags = df[col].iloc[i] if pd.notna(df[col].iloc[i]) else ""
            print(f"{col}: {tags}")
        else:
            tags = ', '.join(df[col].iloc[i]) if isinstance(df[col].iloc[i], list) else ""
            print(f"{col}: {tags}")

# Create a function to calculate tag overlap between methods
def calculate_tag_overlap(df, method1, method2):
    """
    Calculate the overlap between tags from two different methods
    
    Args:
        df: DataFrame with tag columns
        method1: First tag method column name
        method2: Second tag method column name
        
    Returns:
        float: Average Jaccard similarity between tag sets
    """
    similarities = []
    
    for i in range(len(df)):
        # Get tags from method 1
        if method1 == 'TAGS':
            tags1 = set(str(df[method1].iloc[i]).split(',')) if pd.notna(df[method1].iloc[i]) else set()
        else:
            tags1 = set(df[method1].iloc[i]) if isinstance(df[method1].iloc[i], list) else set()
        
        # Get tags from method 2
        if method2 == 'TAGS':
            tags2 = set(str(df[method2].iloc[i]).split(',')) if pd.notna(df[method2].iloc[i]) else set()
        else:
            tags2 = set(df[method2].iloc[i]) if isinstance(df[method2].iloc[i], list) else set()
        
        # Calculate Jaccard similarity (intersection over union)
        if tags1 or tags2:  # Avoid division by zero
            similarity = len(tags1.intersection(tags2)) / len(tags1.union(tags2))
            similarities.append(similarity)
    
    return np.mean(similarities) if similarities else 0

# Calculate overlap between methods
if len(tag_columns) > 1:
    print("\nTag Overlap Between Methods (Jaccard Similarity):")
    overlap_matrix = np.zeros((len(tag_columns), len(tag_columns)))
    
    for i, method1 in enumerate(tag_columns):
        for j, method2 in enumerate(tag_columns):
            if i != j:
                overlap = calculate_tag_overlap(df, method1, method2)
                overlap_matrix[i, j] = overlap
    
    # Create a DataFrame for better visualization
    overlap_df = pd.DataFrame(overlap_matrix, index=tag_columns, columns=tag_columns)
    
    # Display the overlap matrix
    print(overlap_df)
    
    # Visualize the overlap
    plt.figure(figsize=(10, 8))
    sns.heatmap(overlap_df, annot=True, cmap='YlGnBu', vmin=0, vmax=1)
    plt.title('Tag Overlap Between Methods (Jaccard Similarity)')
    plt.tight_layout()
    plt.show()

## 7. Conclusion: Advantages of scikit-learn Approaches

Based on our exploration of different tag generation methods, we can draw some conclusions about the advantages of scikit-learn approaches over basic regex:

1. **Statistical Significance**: TF-IDF automatically weights terms based on their importance in the document and corpus.

2. **Topic Discovery**: LDA identifies hidden topics and themes that might not be apparent through simple word counting.

3. **Document Clustering**: K-means groups similar documents together, helping to identify common issues or repair types.

4. **Supervised Learning**: Classifier-based approaches can learn patterns from existing tags to apply to new documents.

5. **Reduced Manual Effort**: These approaches reduce the need for manually crafting rules or stopword lists.

Let's save our enriched dataset with the tags from all methods.

In [None]:
# Save the dataset with all tag columns
output_file = 'vehicle_repair_sklearn_tags.csv'
df.to_csv(output_file, index=False)
print(f"Saved enriched dataset with all tag methods to '{output_file}'")

# Provide a summary of the advantages of each method
tag_methods = {
    'Regex (Original)': [
        'Simple to implement and understand',
        'Fast computation',
        'No training required',
        'Works with limited data'
    ],
    'TF-IDF': [
        'Prioritizes distinctive words',
        'Reduces importance of common words automatically',
        'Captures document-specific terminology',
        'Simple yet effective for document characterization'
    ],
    'LDA Topic Modeling': [
        'Discovers latent topics across the corpus',
        'Groups related terms together',
        'Provides insights into document themes',
        'Works well for category/taxonomy generation'
    ],
    'K-means Clustering': [
        'Groups similar documents together',
        'Identifies common repair issues or types',
        'Provides consistent tags within clusters',
        'Easy to interpret and visualize'
    ],
    'Classifier-based': [
        'Can be trained on expert-labeled tags',
        'Adapts to domain-specific tagging patterns',
        'Consistent tag application across documents',
        'Good for standardized tagging systems'
    ]
}

# Display the summary
print("\nAdvantages of Different Tag Generation Methods:")
for method, advantages in tag_methods.items():
    print(f"\n{method}:")
    for adv in advantages:
        print(f"  • {adv}")

print("\nRecommendation:")
print("For your vehicle repair dataset, a combination approach may work best:")
print("1. Use TF-IDF for initial tag candidates (good baseline)")
print("2. Apply LDA to identify major repair themes/categories")
print("3. Use K-means to group similar repair descriptions")
print("4. If you have labeled data, train a custom classifier for specific tag prediction")
print("\nEach method has strengths and weaknesses - the best approach depends on your specific needs!")

## 8. Practical Implementation Guide

To implement these tag generation approaches in your production environment:

1. **Start Simple**: Begin with TF-IDF based tagging as it provides a good balance of effectiveness and simplicity.

2. **Evaluate Needs**: If you need to discover topics or themes, add LDA. If you need to group similar documents, add K-means clustering.

3. **Add Supervision**: Once you have a good set of tagged documents, train a classifier for more consistent tagging.

4. **Iterative Improvement**: Refine your tag generation system based on feedback and results.

5. **Visualization**: Use the visualization techniques shown in this notebook to communicate findings to stakeholders.

This moderate approach with scikit-learn gives you powerful text analysis capabilities without the complexity of deep learning frameworks.

In [None]:
# Create a summary visualization comparing the different tag generation methods
plt.figure(figsize=(12, 10))

# Subplot 1: Tag count comparison
plt.subplot(2, 2, 1)
method_names = list(tag_metrics.keys())
avg_tags = [m['avg_tags_per_doc'] for m in tag_metrics.values()]
plt.bar(method_names, avg_tags, color='skyblue')
plt.title('Average Tags per Document')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Count')

# Subplot 2: Unique tags comparison
plt.subplot(2, 2, 2)
unique_tags = [m['unique_tags'] for m in tag_metrics.values()]
plt.bar(method_names, unique_tags, color='lightgreen')
plt.title('Unique Tags Generated')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Count')

# Subplot 3: Docs with tags
plt.subplot(2, 2, 3)
docs_with_tags = [m['docs_with_tags'] / (m['docs_with_tags'] + m['docs_without_tags']) * 100 
                  for m in tag_metrics.values()]
plt.bar(method_names, docs_with_tags, color='salmon')
plt.title('Documents with Tags (%)')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Percentage')
plt.ylim(0, 100)

# Subplot 4: Method complexity vs. performance
plt.subplot(2, 2, 4)
# Estimated complexity (subjective rating)
complexity = {
    'TAGS': 1,               # Regex approach (simplest)
    'tfidf_tags': 2,         # TF-IDF (simple)
    'lda_tags': 3,           # LDA (moderate)
    'cluster_tags': 3,       # K-means (moderate)
    'classifier_tags': 4     # Classifier (more complex)
}

# Plot complexity vs. unique tags (as a proxy for performance)
complexity_values = [complexity.get(method, 0) for method in method_names]
plt.scatter(complexity_values, unique_tags, s=100, c='purple', alpha=0.7)

# Add method labels
for i, method in enumerate(method_names):
    plt.annotate(method, (complexity_values[i], unique_tags[i]), 
                 xytext=(5, 5), textcoords='offset points')

plt.title('Method Complexity vs. Performance')
plt.xlabel('Complexity (1=Simple, 4=Complex)')
plt.ylabel('Unique Tags (Performance)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('tag_methods_comparison.png')
plt.show()

print("\nSummary visualization saved as 'tag_methods_comparison.png'")
print("\nConclusion: The scikit-learn based approaches provide a significant improvement over")
print("basic regex tagging while remaining accessible and interpretable. These methods")
print("strike a good balance between complexity and performance for practical applications.")