# Part-of-Speech (POS) Tagging in Natural Language Processing

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/nlp-learning-journey/blob/main/examples/pos-tagging.ipynb)

## Overview

Part-of-Speech (POS) tagging is the process of assigning grammatical categories (such as noun, verb, adjective, etc.) to each word in a text. This fundamental NLP task provides crucial linguistic information that can improve the performance of many downstream applications.

## What You'll Learn

- Understanding POS tags and their importance
- Different POS tagging approaches
- Using NLTK for POS tagging
- Using spaCy for advanced POS tagging
- Rule-based vs statistical approaches
- Evaluation metrics for POS tagging
- Applications of POS tagging

## Prerequisites

Basic understanding of Python, linguistics concepts, and NLP preprocessing techniques.

## Setup and Installation

Let's install the required libraries for this notebook.

In [None]:
# Install required libraries
!pip install nltk spacy pandas matplotlib seaborn
!python -m spacy download en_core_web_sm

In [None]:
# Import required libraries
import nltk
import spacy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import re

# Download NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
nltk.download('brown')
nltk.download('treebank')
nltk.download('wordnet')

# Import NLTK modules
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
from nltk.corpus import brown, treebank
from nltk.chunk import ne_chunk
from nltk.tree import Tree

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

## 1. Understanding POS Tags

Let's start by understanding what POS tags are and their common categories.

In [None]:
# Common POS tag categories and examples
pos_categories = {
    'Noun': {
        'description': 'Words that represent people, places, things, or ideas',
        'tags': ['NN', 'NNS', 'NNP', 'NNPS'],
        'examples': ['cat', 'cats', 'London', 'Americans']
    },
    'Verb': {
        'description': 'Words that express actions, states, or occurrences',
        'tags': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'],
        'examples': ['run', 'ran', 'running', 'eaten', 'eat', 'eats']
    },
    'Adjective': {
        'description': 'Words that describe or modify nouns',
        'tags': ['JJ', 'JJR', 'JJS'],
        'examples': ['big', 'bigger', 'biggest']
    },
    'Adverb': {
        'description': 'Words that modify verbs, adjectives, or other adverbs',
        'tags': ['RB', 'RBR', 'RBS'],
        'examples': ['quickly', 'more quickly', 'most quickly']
    },
    'Pronoun': {
        'description': 'Words that replace nouns',
        'tags': ['PRP', 'PRP$'],
        'examples': ['he', 'she', 'his', 'her']
    },
    'Preposition': {
        'description': 'Words that show relationships between words',
        'tags': ['IN'],
        'examples': ['in', 'on', 'at', 'over']
    },
    'Determiner': {
        'description': 'Words that introduce nouns',
        'tags': ['DT'],
        'examples': ['the', 'a', 'an', 'this', 'that']
    }
}

print("Common Part-of-Speech Categories:")
print("=" * 50)
for category, info in pos_categories.items():
    print(f"\n{category}:")
    print(f"  Description: {info['description']}")
    print(f"  Penn Treebank Tags: {', '.join(info['tags'])}")
    print(f"  Examples: {', '.join(info['examples'])}")

## 2. Sample Text for POS Tagging

Let's use various sentences to demonstrate different POS tagging scenarios.

In [None]:
# Different types of sentences for testing POS tagging
sample_sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Natural Language Processing is fascinating and challenging.",
    "I am learning about machine learning algorithms.",
    "The conference will be held in San Francisco next year.",
    "She quickly ran to the store yesterday.",
    "The beautiful, old building was demolished last week.",
    "Can you help me with this difficult problem?",
    "Time flies like an arrow; fruit flies like a banana.",  # Ambiguous sentence
    "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo."  # Very ambiguous
]

print("Sample sentences for POS tagging analysis:")
for i, sentence in enumerate(sample_sentences, 1):
    print(f"{i}. {sentence}")

## 3. NLTK POS Tagging

NLTK provides several POS taggers with different tagsets.

In [None]:
def analyze_sentence_nltk(sentence):
    """
    Analyze a sentence using NLTK POS tagging
    """
    # Tokenize the sentence
    tokens = word_tokenize(sentence)
    
    # POS tagging with Penn Treebank tagset
    pos_tags = pos_tag(tokens)
    
    # POS tagging with Universal tagset
    universal_tags = pos_tag(tokens, tagset='universal')
    
    return tokens, pos_tags, universal_tags

# Analyze the first few sentences
print("NLTK POS Tagging Analysis:")
print("=" * 50)

for i, sentence in enumerate(sample_sentences[:3], 1):
    print(f"\nSentence {i}: {sentence}")
    tokens, pos_tags, universal_tags = analyze_sentence_nltk(sentence)
    
    print("\nPenn Treebank Tags:")
    for token, tag in pos_tags:
        print(f"  {token:12} -> {tag}")
    
    print("\nUniversal Tags:")
    for token, tag in universal_tags:
        print(f"  {token:12} -> {tag}")
    
    print("-" * 30)

## 4. spaCy POS Tagging

spaCy provides more sophisticated POS tagging with additional linguistic information.

In [None]:
def analyze_sentence_spacy(sentence):
    """
    Analyze a sentence using spaCy POS tagging
    """
    doc = nlp(sentence)
    
    analysis = []
    for token in doc:
        analysis.append({
            'text': token.text,
            'lemma': token.lemma_,
            'pos': token.pos_,
            'tag': token.tag_,
            'dep': token.dep_,
            'shape': token.shape_,
            'is_alpha': token.is_alpha,
            'is_stop': token.is_stop
        })
    
    return analysis

# Analyze sentences with spaCy
print("spaCy POS Tagging Analysis:")
print("=" * 80)

for i, sentence in enumerate(sample_sentences[:2], 1):
    print(f"\nSentence {i}: {sentence}")
    analysis = analyze_sentence_spacy(sentence)
    
    # Create a formatted table
    print(f"\n{'Token':<12} {'Lemma':<12} {'POS':<6} {'Tag':<6} {'Dep':<8} {'Shape':<8} {'Stop?':<5}")
    print("-" * 65)
    
    for token_info in analysis:
        print(f"{token_info['text']:<12} {token_info['lemma']:<12} {token_info['pos']:<6} "
              f"{token_info['tag']:<6} {token_info['dep']:<8} {token_info['shape']:<8} "
              f"{str(token_info['is_stop']):<5}")
    
    print("-" * 65)

## 5. Understanding POS Tag Meanings

Let's create a comprehensive guide to Penn Treebank POS tags.

In [None]:
# Penn Treebank POS tag reference
penn_treebank_tags = {
    'CC': 'Coordinating conjunction',
    'CD': 'Cardinal number',
    'DT': 'Determiner',
    'EX': 'Existential there',
    'FW': 'Foreign word',
    'IN': 'Preposition or subordinating conjunction',
    'JJ': 'Adjective',
    'JJR': 'Adjective, comparative',
    'JJS': 'Adjective, superlative',
    'LS': 'List item marker',
    'MD': 'Modal',
    'NN': 'Noun, singular',
    'NNS': 'Noun, plural',
    'NNP': 'Proper noun, singular',
    'NNPS': 'Proper noun, plural',
    'PDT': 'Predeterminer',
    'POS': 'Possessive ending',
    'PRP': 'Personal pronoun',
    'PRP$': 'Possessive pronoun',
    'RB': 'Adverb',
    'RBR': 'Adverb, comparative',
    'RBS': 'Adverb, superlative',
    'RP': 'Particle',
    'SYM': 'Symbol',
    'TO': 'to',
    'UH': 'Interjection',
    'VB': 'Verb, base form',
    'VBD': 'Verb, past tense',
    'VBG': 'Verb, gerund or present participle',
    'VBN': 'Verb, past participle',
    'VBP': 'Verb, non-3rd person singular present',
    'VBZ': 'Verb, 3rd person singular present',
    'WDT': 'Wh-determiner',
    'WP': 'Wh-pronoun',
    'WP$': 'Possessive wh-pronoun',
    'WRB': 'Wh-adverb'
}

def explain_pos_tags(sentence):
    """
    Explain POS tags for a sentence
    """
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)
    
    print(f"Sentence: {sentence}")
    print("\nPOS Tag Explanations:")
    print("-" * 60)
    
    for token, tag in pos_tags:
        explanation = penn_treebank_tags.get(tag, "Unknown tag")
        print(f"{token:15} {tag:6} -> {explanation}")

# Explain tags for a complex sentence
complex_sentence = "The researchers carefully analyzed the surprisingly complex data."
explain_pos_tags(complex_sentence)

## 6. Handling Ambiguous Cases

Some words can have multiple POS tags depending on context. Let's explore these cases.

In [None]:
# Examples of words with multiple possible POS tags
ambiguous_examples = [
    ("I can open the can.", "'can' as modal vs noun"),
    ("The book is on the table. I book a flight.", "'book' as noun vs verb"),
    ("The light is bright. Light the candle.", "'light' as noun vs verb"),
    ("They saw the movie. He used a saw.", "'saw' as verb vs noun"),
    ("She will present the gift. The present situation.", "'present' as verb vs adjective"),
    ("The fast car. He runs fast.", "'fast' as adjective vs adverb"),
    ("Time flies like an arrow.", "Ambiguous parsing")
]

print("Analyzing Ambiguous POS Cases:")
print("=" * 50)

for sentence, description in ambiguous_examples:
    print(f"\nExample: {description}")
    print(f"Sentence: {sentence}")
    
    # NLTK analysis
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)
    
    print("NLTK Tags:", [(token, tag) for token, tag in pos_tags])
    
    # spaCy analysis
    doc = nlp(sentence)
    spacy_tags = [(token.text, token.pos_, token.tag_) for token in doc]
    
    print("spaCy Tags:", spacy_tags)
    print("-" * 40)

## 7. POS Tag Distribution Analysis

Let's analyze the distribution of POS tags in different types of text.

In [None]:
def analyze_pos_distribution(text, title="Text"):
    """
    Analyze POS tag distribution in text
    """
    # Tokenize and tag the text
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    
    # Count tag frequencies
    tag_counts = Counter([tag for _, tag in pos_tags])
    
    # Convert to universal tags for simpler analysis
    universal_tags = pos_tag(tokens, tagset='universal')
    universal_counts = Counter([tag for _, tag in universal_tags])
    
    return tag_counts, universal_counts

# Analyze different text types
text_samples = {
    "News": """The government announced new policies yesterday. The prime minister held a press conference to discuss economic reforms. Markets responded positively to the announcement.""",
    
    "Scientific": """The researchers conducted experiments to test the hypothesis. Results showed significant correlations between variables. Further analysis is required to validate these findings.""",
    
    "Narrative": """She walked slowly through the dark forest. The wind whispered through the trees as she carefully stepped over fallen branches. Suddenly, she heard a strange noise behind her.""",
    
    "Social Media": """Just saw the most amazing movie! Can't believe how good it was. Definitely recommend watching it. Best film this year!!!"""
}

# Analyze each text type
all_distributions = {}

for text_type, text in text_samples.items():
    tag_counts, universal_counts = analyze_pos_distribution(text, text_type)
    all_distributions[text_type] = universal_counts
    
    print(f"\n{text_type} Text POS Distribution:")
    print(f"Text: {text[:100]}...")
    print("Universal POS tags:")
    for tag, count in universal_counts.most_common():
        percentage = (count / sum(universal_counts.values())) * 100
        print(f"  {tag:8}: {count:3} ({percentage:5.1f}%)")

# Create visualization
plt.figure(figsize=(12, 8))

# Prepare data for plotting
all_tags = set()
for dist in all_distributions.values():
    all_tags.update(dist.keys())

text_types = list(all_distributions.keys())
tag_data = {}

for tag in all_tags:
    tag_data[tag] = [all_distributions[text_type].get(tag, 0) for text_type in text_types]

# Create grouped bar chart
x = range(len(text_types))
width = 0.15
colors = plt.cm.Set3(range(len(all_tags)))

for i, (tag, counts) in enumerate(tag_data.items()):
    offset = width * (i - len(all_tags) / 2)
    plt.bar([xi + offset for xi in x], counts, width, label=tag, color=colors[i])

plt.xlabel('Text Type')
plt.ylabel('Count')
plt.title('POS Tag Distribution Across Text Types')
plt.xticks(x, text_types)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

## 8. Custom POS Tagging Rules

Sometimes we need custom rules for domain-specific text or to handle special cases.

In [None]:
def custom_pos_tagger(sentence):
    """
    Custom POS tagger with domain-specific rules
    """
    # Start with standard POS tagging
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)
    
    # Apply custom rules
    custom_tags = []
    
    for i, (token, tag) in enumerate(pos_tags):
        # Custom rule 1: Technical terms ending in -ing that are nouns
        if token.endswith('ing') and token.lower() in ['programming', 'computing', 'processing', 'learning']:
            custom_tags.append((token, 'NN-TECH'))
        
        # Custom rule 2: URLs and email addresses
        elif re.match(r'https?://\S+', token) or re.match(r'\S+@\S+\.\S+', token):
            custom_tags.append((token, 'URL-EMAIL'))
        
        # Custom rule 3: Hashtags and mentions
        elif token.startswith('#'):
            custom_tags.append((token, 'HASHTAG'))
        elif token.startswith('@'):
            custom_tags.append((token, 'MENTION'))
        
        # Custom rule 4: Numbers with units
        elif re.match(r'\d+(?:\.\d+)?(?:kg|km|mb|gb|%)', token.lower()):
            custom_tags.append((token, 'NUM-UNIT'))
        
        # Default: use standard tag
        else:
            custom_tags.append((token, tag))
    
    return custom_tags

# Test custom POS tagger
custom_sentences = [
    "Machine learning and natural language processing are fascinating fields.",
    "Contact me at john@example.com or visit https://example.com for more info.",
    "Check out #NLP and #MachineLearning. Follow @nlp_expert for updates.",
    "The file size is 150MB and the distance is 2.5km."
]

print("Custom POS Tagging Results:")
print("=" * 50)

for sentence in custom_sentences:
    print(f"\nSentence: {sentence}")
    
    # Standard tagging
    standard_tags = pos_tag(word_tokenize(sentence))
    print("Standard:", standard_tags)
    
    # Custom tagging
    custom_tags = custom_pos_tagger(sentence)
    print("Custom:  ", custom_tags)
    
    # Highlight differences
    differences = [(token, std_tag, cust_tag) for (token, std_tag), (_, cust_tag) 
                   in zip(standard_tags, custom_tags) if std_tag != cust_tag]
    
    if differences:
        print("Changes: ", [(token, f"{std}->{cust}") for token, std, cust in differences])

## 9. Evaluating POS Tagging Accuracy

Let's evaluate POS tagging accuracy using a gold standard dataset.

In [None]:
# Use a small sample from the Brown corpus for evaluation
def evaluate_pos_tagger(test_sentences, tagger_func):
    """
    Evaluate POS tagger accuracy
    """
    correct = 0
    total = 0
    confusion_matrix = defaultdict(lambda: defaultdict(int))
    
    for sentence in test_sentences:
        # Extract words and true tags
        words = [word for word, tag in sentence]
        true_tags = [tag for word, tag in sentence]
        
        # Get predicted tags
        predicted_tags = [tag for word, tag in tagger_func(words)]
        
        # Calculate accuracy
        for true_tag, pred_tag in zip(true_tags, predicted_tags):
            total += 1
            if true_tag == pred_tag:
                correct += 1
            confusion_matrix[true_tag][pred_tag] += 1
    
    accuracy = correct / total if total > 0 else 0
    return accuracy, confusion_matrix

# Get sample sentences from Brown corpus
brown_sentences = brown.tagged_sents()[:100]  # First 100 sentences

# Define different taggers to compare
def nltk_tagger(words):
    return pos_tag(words)

def spacy_tagger(words):
    text = ' '.join(words)
    doc = nlp(text)
    # Convert spaCy tags to Penn Treebank format for comparison
    return [(token.text, token.tag_) for token in doc]

# Evaluate taggers
print("POS Tagging Accuracy Evaluation:")
print("=" * 40)

taggers = {
    'NLTK': nltk_tagger,
    'spaCy': spacy_tagger
}

results = {}
for name, tagger in taggers.items():
    accuracy, confusion = evaluate_pos_tagger(brown_sentences, tagger)
    results[name] = accuracy
    print(f"{name:8}: {accuracy:.3f} ({accuracy*100:.1f}%)")

# Analyze common errors for NLTK tagger
_, nltk_confusion = evaluate_pos_tagger(brown_sentences, nltk_tagger)

print("\nMost Common POS Tagging Errors (NLTK):")
errors = []
for true_tag, predictions in nltk_confusion.items():
    for pred_tag, count in predictions.items():
        if true_tag != pred_tag and count > 2:  # Only show errors that occur multiple times
            errors.append((count, true_tag, pred_tag))

errors.sort(reverse=True)
for count, true_tag, pred_tag in errors[:10]:
    print(f"  {true_tag} -> {pred_tag}: {count} times")

## 10. Applications of POS Tagging

Let's explore practical applications of POS tagging in NLP tasks.

In [None]:
# Application 1: Extract specific word types
def extract_word_types(text, target_pos=['NN', 'NNS', 'NNP', 'NNPS']):
    """
    Extract words of specific POS types (default: nouns)
    """
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    
    return [word for word, tag in pos_tags if tag in target_pos]

# Application 2: Generate reading difficulty metrics
def calculate_text_complexity(text):
    """
    Calculate text complexity based on POS distribution
    """
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    
    # Count different POS types
    pos_counts = Counter([tag for _, tag in pos_tags])
    total_words = len(tokens)
    
    # Calculate complexity metrics
    noun_ratio = (pos_counts.get('NN', 0) + pos_counts.get('NNS', 0) + 
                  pos_counts.get('NNP', 0) + pos_counts.get('NNPS', 0)) / total_words
    
    verb_ratio = (pos_counts.get('VB', 0) + pos_counts.get('VBD', 0) + 
                  pos_counts.get('VBG', 0) + pos_counts.get('VBN', 0) + 
                  pos_counts.get('VBP', 0) + pos_counts.get('VBZ', 0)) / total_words
    
    adj_ratio = (pos_counts.get('JJ', 0) + pos_counts.get('JJR', 0) + 
                 pos_counts.get('JJS', 0)) / total_words
    
    adv_ratio = (pos_counts.get('RB', 0) + pos_counts.get('RBR', 0) + 
                 pos_counts.get('RBS', 0)) / total_words
    
    # Simple complexity score (higher = more complex)
    complexity_score = noun_ratio * 1.5 + adj_ratio * 1.2 + adv_ratio * 1.1 + verb_ratio * 0.8
    
    return {
        'noun_ratio': noun_ratio,
        'verb_ratio': verb_ratio,
        'adjective_ratio': adj_ratio,
        'adverb_ratio': adv_ratio,
        'complexity_score': complexity_score
    }

# Application 3: Extract noun phrases
def extract_noun_phrases(text):
    """
    Extract simple noun phrases using POS patterns
    """
    doc = nlp(text)
    noun_phrases = []
    
    # Pattern: (Determiner)? (Adjective)* Noun+
    current_phrase = []
    
    for token in doc:
        if token.pos_ in ['DET']:  # Determiner
            current_phrase = [token.text]
        elif token.pos_ in ['ADJ'] and current_phrase:  # Adjective
            current_phrase.append(token.text)
        elif token.pos_ in ['NOUN', 'PROPN']:  # Noun
            current_phrase.append(token.text)
            if len(current_phrase) > 1:  # Only phrases with multiple words
                noun_phrases.append(' '.join(current_phrase))
        else:
            current_phrase = []
    
    return noun_phrases

# Test applications
test_text = """
The advanced machine learning algorithms can process natural language efficiently. 
These sophisticated computational methods require extensive training data and 
powerful computing resources. Modern artificial intelligence systems demonstrate 
remarkable performance in complex linguistic tasks.
"""

print("POS Tagging Applications:")
print("=" * 40)

print(f"\nTest Text: {test_text.strip()}")

# Application 1: Extract nouns
nouns = extract_word_types(test_text)
print(f"\nExtracted Nouns: {nouns}")

# Extract adjectives
adjectives = extract_word_types(test_text, ['JJ', 'JJR', 'JJS'])
print(f"Extracted Adjectives: {adjectives}")

# Application 2: Text complexity
complexity = calculate_text_complexity(test_text)
print(f"\nText Complexity Analysis:")
for metric, value in complexity.items():
    print(f"  {metric.replace('_', ' ').title()}: {value:.3f}")

# Application 3: Noun phrases
noun_phrases = extract_noun_phrases(test_text)
print(f"\nExtracted Noun Phrases: {noun_phrases}")

## 11. Advanced POS Tagging Features

Exploring more advanced aspects of POS tagging.

In [None]:
# Morphological analysis using spaCy
def detailed_morphological_analysis(text):
    """
    Perform detailed morphological analysis
    """
    doc = nlp(text)
    
    analysis = []
    for token in doc:
        morph_info = {
            'text': token.text,
            'lemma': token.lemma_,
            'pos': token.pos_,
            'tag': token.tag_,
            'morph': str(token.morph),
            'is_oov': token.is_oov,  # Out of vocabulary
            'is_alpha': token.is_alpha,
            'is_digit': token.is_digit,
            'is_punct': token.is_punct
        }
        analysis.append(morph_info)
    
    return analysis

# POS-based text preprocessing
def pos_based_filtering(text, keep_pos=['NOUN', 'VERB', 'ADJ', 'ADV']):
    """
    Filter text to keep only specific POS types
    """
    doc = nlp(text)
    filtered_tokens = [token.lemma_.lower() for token in doc 
                      if token.pos_ in keep_pos and not token.is_stop and token.is_alpha]
    return ' '.join(filtered_tokens)

# Sentence complexity scoring
def score_sentence_complexity(sentence):
    """
    Score sentence complexity based on POS patterns
    """
    doc = nlp(sentence)
    
    # Count different elements
    pos_counts = Counter([token.pos_ for token in doc])
    
    # Complexity factors
    factors = {
        'length': len(doc),
        'unique_pos': len(set(token.pos_ for token in doc)),
        'subordinate_clauses': sum(1 for token in doc if token.dep_ in ['ccomp', 'advcl']),
        'passive_voice': sum(1 for token in doc if token.dep_ == 'auxpass'),
        'adjective_density': pos_counts.get('ADJ', 0) / len(doc),
        'adverb_density': pos_counts.get('ADV', 0) / len(doc)
    }
    
    # Simple complexity score
    complexity = (factors['length'] * 0.1 + 
                 factors['unique_pos'] * 0.5 +
                 factors['subordinate_clauses'] * 2 +
                 factors['passive_voice'] * 1.5 +
                 factors['adjective_density'] * 10 +
                 factors['adverb_density'] * 8)
    
    return complexity, factors

# Test advanced features
test_sentences = [
    "The cat sat on the mat.",  # Simple
    "The incredibly beautiful, ancient cathedral was being carefully restored by skilled craftsmen.",  # Complex
    "Although machine learning algorithms are powerful, they require extensive computational resources.",  # Complex with subordinate clause
]

print("Advanced POS Tagging Features:")
print("=" * 50)

for sentence in test_sentences:
    print(f"\nSentence: {sentence}")
    
    # Detailed morphological analysis
    morph_analysis = detailed_morphological_analysis(sentence)
    print("\nMorphological features (first 5 tokens):")
    for token_info in morph_analysis[:5]:
        print(f"  {token_info['text']}: POS={token_info['pos']}, Morph={token_info['morph']}")
    
    # POS-based filtering
    filtered = pos_based_filtering(sentence)
    print(f"\nFiltered (content words only): {filtered}")
    
    # Complexity scoring
    complexity, factors = score_sentence_complexity(sentence)
    print(f"\nComplexity Score: {complexity:.2f}")
    print(f"  Length: {factors['length']} words")
    print(f"  Unique POS: {factors['unique_pos']}")
    print(f"  Subordinate clauses: {factors['subordinate_clauses']}")
    
    print("-" * 40)

## 12. Exercises

Practice POS tagging with these exercises:

### Exercise 1: Domain-Specific POS Analysis

Analyze POS distributions in different domains and identify patterns.

In [None]:
# TODO: Implement analysis for different text domains
# Domains to analyze: Academic papers, News articles, Social media posts, Legal documents

def analyze_domain_pos_patterns(texts_by_domain):
    """
    Analyze POS patterns across different domains
    """
    # Your implementation here
    pass

# Sample texts for different domains
domain_texts = {
    'academic': "The study investigates the correlation between variables using statistical analysis.",
    'news': "The government announced new policies yesterday during a press conference.",
    'social_media': "OMG this is so cool! Can't believe it happened! #amazing",
    'legal': "The party shall comply with all applicable laws and regulations."
}

# Your analysis code here:


### Exercise 2: Custom POS Tagger for Technical Text

Create a custom POS tagger that handles technical terminology better.

In [None]:
# TODO: Implement a custom POS tagger for technical text
# Consider: Programming terms, technical jargon, version numbers, file extensions

def technical_pos_tagger(text):
    """
    POS tagger optimized for technical text
    """
    # Your implementation here
    pass

# Test with technical text
technical_text = """
The algorithm uses TensorFlow 2.0 and Python 3.8. The model.py file contains 
the neural network implementation. GPU acceleration with CUDA 11.0 improves 
training speed by 10x.
"""

# Your test code here:


## Key Takeaways

1. **POS tagging is fundamental**: It provides crucial linguistic information for many NLP tasks

2. **Multiple approaches available**:
   - Rule-based: Fast but limited
   - Statistical: Better accuracy, handles unseen words
   - Neural: State-of-the-art performance

3. **Context matters**: Same word can have different POS tags in different contexts

4. **Library comparison**:
   - **NLTK**: Good for learning and basic tasks
   - **spaCy**: Production-ready, more accurate, additional linguistic features

5. **Applications are diverse**:
   - Information extraction
   - Text preprocessing
   - Syntactic analysis
   - Text complexity assessment

## Best Practices

1. **Choose the right tool**: spaCy for production, NLTK for experimentation
2. **Consider domain adaptation**: Customize taggers for specific domains
3. **Evaluate on your data**: Generic accuracy metrics may not reflect performance on your specific text
4. **Handle ambiguity**: Be aware that some tagging decisions are inherently ambiguous
5. **Combine with other features**: POS tags work best when combined with other linguistic features

## Common Use Cases

- **Information Extraction**: Identify entities and relationships
- **Text Summarization**: Focus on content words (nouns, verbs, adjectives)
- **Sentiment Analysis**: Adjectives and adverbs carry sentiment information
- **Question Answering**: POS helps identify question types and answer candidates
- **Machine Translation**: Grammatical information improves translation quality

## Next Steps

- Learn about dependency parsing and syntactic analysis
- Explore named entity recognition (NER)
- Study chunking and phrase extraction
- Practice with domain-specific text from your area of interest
- Experiment with neural POS taggers and their architectures

## Resources

- [Penn Treebank POS Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
- [Universal POS Tags](https://universaldependencies.org/u/pos/)
- [NLTK POS Tagging Documentation](https://www.nltk.org/book/ch05.html)
- [spaCy Linguistic Features](https://spacy.io/usage/linguistic-features#pos-tagging)
- [Natural Language Processing with Python](https://www.nltk.org/book/)
