# Module 10: Named Entity Recognition (NER)

**Sequence Labeling with BiLSTM-CRF**

---

## 1. Objectives

- âœ… Understand NER and BIO tagging
- âœ… Build BiLSTM for token classification
- âœ… Implement CRF layer from scratch
- âœ… Create complete BiLSTM-CRF model

## 2. Prerequisites

- [Module 09: Text Classification](../09_text_classification_rnns/09_text_classification_rnns.ipynb)

## 3. NER Task Definition

### Input/Output
```
Input:  "John works at Google in California"
Output: [B-PER, O, O, B-ORG, O, B-LOC]
```

### BIO Tagging Scheme

| Tag | Meaning |
|-----|--------|
| B-XXX | Beginning of entity XXX |
| I-XXX | Inside entity XXX |
| O | Outside any entity |

**Example:**
```
"New York is a city"
[B-LOC, I-LOC, O, O, O]
```

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from typing import List, Tuple

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

## 4. Sample Data

In [None]:
# Sample NER data
train_data = [
    (["John", "works", "at", "Google"], ["B-PER", "O", "O", "B-ORG"]),
    (["Mary", "lives", "in", "New", "York"], ["B-PER", "O", "O", "B-LOC", "I-LOC"]),
    (["Apple", "is", "in", "California"], ["B-ORG", "O", "O", "B-LOC"]),
]

# Build vocabularies
all_words = set(w for sent, _ in train_data for w in sent)
all_tags = set(t for _, tags in train_data for t in tags)

word2idx = {'<PAD>': 0, '<UNK>': 1}
word2idx.update({w: i+2 for i, w in enumerate(all_words)})

tag2idx = {'<PAD>': 0}
tag2idx.update({t: i+1 for i, t in enumerate(all_tags)})
idx2tag = {v: k for k, v in tag2idx.items()}

print(f"Words: {len(word2idx)}, Tags: {tag2idx}")

## 5. BiLSTM for NER (Without CRF)

In [None]:
class BiLSTMNER(nn.Module):
    """BiLSTM for token classification."""
    
    def __init__(self, vocab_size, tag_size, embed_dim=100, hidden_dim=128):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, tag_size)
    
    def forward(self, x):
        # x: (batch, seq)
        embedded = self.embedding(x)  # (batch, seq, embed)
        lstm_out, _ = self.lstm(embedded)  # (batch, seq, hidden*2)
        logits = self.fc(lstm_out)  # (batch, seq, tag_size)
        return logits

# Test
model = BiLSTMNER(len(word2idx), len(tag2idx))
x = torch.randint(0, len(word2idx), (2, 5))
out = model(x)
print(f"Output shape: {out.shape}  # (batch, seq, tags)")

## 6. CRF Layer (Key Concept!)

### Why CRF?

Independent softmax per token ignores **tag dependencies**:
- `I-PER` can't follow `B-ORG`
- `I-LOC` should follow `B-LOC` or `I-LOC`

CRF learns **transition scores** between tags.

In [None]:
class CRF(nn.Module):
    """Linear-chain CRF layer."""
    
    def __init__(self, num_tags: int):
        super().__init__()
        self.num_tags = num_tags
        
        # Transition scores: transitions[i,j] = score of j -> i
        self.transitions = nn.Parameter(torch.randn(num_tags, num_tags))
        
        # Start and end transitions
        self.start_transitions = nn.Parameter(torch.randn(num_tags))
        self.end_transitions = nn.Parameter(torch.randn(num_tags))
    
    def forward(self, emissions, tags, mask=None):
        """
        Compute negative log likelihood.
        
        Args:
            emissions: (batch, seq, num_tags)
            tags: (batch, seq)
            mask: (batch, seq)
        """
        if mask is None:
            mask = torch.ones_like(tags, dtype=torch.bool)
        
        log_numerator = self._score_sentence(emissions, tags, mask)
        log_denominator = self._compute_log_partition(emissions, mask)
        
        return (log_denominator - log_numerator).mean()
    
    def _score_sentence(self, emissions, tags, mask):
        """Score of a specific tag sequence."""
        batch_size, seq_len, _ = emissions.shape
        
        # Start transition
        score = self.start_transitions[tags[:, 0]]
        
        # Emission for first tag
        score += emissions[:, 0].gather(1, tags[:, 0].unsqueeze(1)).squeeze(1)
        
        for i in range(1, seq_len):
            # Transition score
            score += self.transitions[tags[:, i], tags[:, i-1]] * mask[:, i]
            # Emission score
            score += emissions[:, i].gather(1, tags[:, i].unsqueeze(1)).squeeze(1) * mask[:, i]
        
        # End transition
        last_tag_idx = mask.sum(dim=1) - 1
        last_tags = tags.gather(1, last_tag_idx.unsqueeze(1)).squeeze(1)
        score += self.end_transitions[last_tags]
        
        return score
    
    def _compute_log_partition(self, emissions, mask):
        """Compute log partition function (forward algorithm)."""
        batch_size, seq_len, num_tags = emissions.shape
        
        # Initialize with start transitions + first emissions
        score = self.start_transitions + emissions[:, 0]  # (batch, tags)
        
        for i in range(1, seq_len):
            # score: (batch, tags) -> (batch, tags, 1)
            # transitions: (tags, tags)
            # emissions: (batch, tags)
            broadcast_score = score.unsqueeze(2)  # (batch, tags, 1)
            broadcast_emissions = emissions[:, i].unsqueeze(1)  # (batch, 1, tags)
            
            next_score = broadcast_score + self.transitions + broadcast_emissions
            next_score = torch.logsumexp(next_score, dim=1)  # (batch, tags)
            
            score = torch.where(mask[:, i].unsqueeze(1), next_score, score)
        
        score += self.end_transitions
        return torch.logsumexp(score, dim=1)
    
    def decode(self, emissions, mask=None):
        """Viterbi decoding."""
        if mask is None:
            mask = torch.ones(emissions.shape[:2], dtype=torch.bool, device=emissions.device)
        
        batch_size, seq_len, num_tags = emissions.shape
        
        score = self.start_transitions + emissions[:, 0]
        history = []
        
        for i in range(1, seq_len):
            broadcast_score = score.unsqueeze(2)
            broadcast_emissions = emissions[:, i].unsqueeze(1)
            
            next_score = broadcast_score + self.transitions + broadcast_emissions
            next_score, indices = next_score.max(dim=1)
            
            score = torch.where(mask[:, i].unsqueeze(1), next_score, score)
            history.append(indices)
        
        score += self.end_transitions
        _, best_last_tags = score.max(dim=1)
        
        # Backtrack
        best_tags = [best_last_tags]
        for hist in reversed(history):
            best_last_tags = hist.gather(1, best_last_tags.unsqueeze(1)).squeeze(1)
            best_tags.append(best_last_tags)
        
        best_tags.reverse()
        return torch.stack(best_tags, dim=1)

## 7. BiLSTM-CRF Model

In [None]:
class BiLSTMCRF(nn.Module):
    """BiLSTM-CRF for NER."""
    
    def __init__(self, vocab_size, tag_size, embed_dim=100, hidden_dim=128):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, tag_size)
        self.crf = CRF(tag_size)
    
    def forward(self, x, tags, mask=None):
        """Compute loss."""
        emissions = self._get_emissions(x)
        return self.crf(emissions, tags, mask)
    
    def decode(self, x, mask=None):
        """Predict tags."""
        emissions = self._get_emissions(x)
        return self.crf.decode(emissions, mask)
    
    def _get_emissions(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        return self.fc(lstm_out)

# Test
model = BiLSTMCRF(len(word2idx), len(tag2idx))
x = torch.tensor([[word2idx.get(w, 1) for w in ["John", "works", "at", "Google"]]])
tags = torch.tensor([[tag2idx[t] for t in ["B-PER", "O", "O", "B-ORG"]]])

loss = model(x, tags)
pred = model.decode(x)

print(f"Loss: {loss.item():.4f}")
print(f"Predicted tags: {[idx2tag[i.item()] for i in pred[0]]}")

## 8. ðŸ”¥ Real-World Usage

### NER Solutions (2024)

| Priority | Solution |
|----------|----------|
| **Speed** | SpaCy (rule-based + small models) |
| **Accuracy** | Fine-tuned BERT for token classification |
| **Domain-specific** | BiLSTM-CRF with domain embeddings |

### Production Pattern
```
1. Start with SpaCy + rules
2. Add ML for entities SpaCy misses
3. Ensemble for best results
```

## 9. Interview Questions

**Q1: Why use CRF instead of softmax per token?**
<details><summary>Answer</summary>

CRF models dependencies between adjacent tags. It learns that I-PER should follow B-PER, not B-ORG. Softmax treats each position independently.
</details>

**Q2: What is BIO tagging?**
<details><summary>Answer</summary>

- B-XXX: Beginning of entity type XXX
- I-XXX: Inside/continuation of entity
- O: Outside any entity
This distinguishes multi-word entities ("New York" â†’ B-LOC, I-LOC)
</details>

## 10. Summary

- **NER**: Token-level classification
- **BIO scheme**: B-/I-/O tags for multi-word entities
- **BiLSTM-CRF**: Classic architecture, still competitive
- **CRF**: Models tag transitions, better than independent softmax

## 11. References

- [Neural Architectures for NER](https://arxiv.org/abs/1603.01360)
- [pytorch-crf library](https://pytorch-crf.readthedocs.io/)
- [SpaCy NER](https://spacy.io/usage/linguistic-features#named-entities)

---
**Next:** [Module 11: Language Modeling](../11_language_modeling/11_language_modeling.ipynb)