# Natural Language Processing - Intermediate Level

Welcome to the Natural Language Processing intermediate tasks! This notebook contains three comprehensive tasks to test your understanding of text processing, embeddings, NER, and text generation.

## Tasks Overview:
1. **Task 1: Text Classification with Word Embeddings** - Build a classifier using embeddings
2. **Task 2: Named Entity Recognition (NER)** - Identify entities in text
3. **Task 3: Text Generation with RNNs** - Generate text using recurrent networks

Please refer to `tasks.md` for detailed requirements for each task.

In [None]:
# TODO: Import necessary libraries
# Example:
# import numpy as np

---
## Task 1: Text Classification with Word Embeddings

Build a text classifier using pre-trained word embeddings (Word2Vec, GloVe, or FastText).

**Requirements:**
- Load a text classification dataset (e.g., IMDB reviews, news categories)
- Preprocess text (tokenization, padding, etc.)
- Use pre-trained embeddings or train your own
- Build an LSTM or GRU-based classifier
- Achieve at least 80% accuracy
- Display confusion matrix and classification report

**Hints:**
- Download pre-trained embeddings from available sources
- Use appropriate tokenization and sequence padding methods
- Embedding layer should use pre-trained weights
- Consider using bidirectional LSTMs

In [None]:
# TODO: Step 1 - Load and explore dataset
# Load IMDB, news categories, or another text classification dataset
# Explore the data distribution
# Display sample texts and labels

In [None]:
# TODO: Step 2 - Preprocess text
# Tokenize text using Keras Tokenizer
# Convert text to sequences
# Pad sequences to uniform length

In [None]:
# TODO: Step 3 - Load pre-trained embeddings
# Download and load GloVe, Word2Vec, or FastText embeddings
# Create embedding matrix for your vocabulary
# Map words from tokenizer to embedding vectors

In [None]:
# TODO: Step 4 - Build LSTM/GRU classifier
# Create model with Embedding layer (using pre-trained weights)
# Add LSTM or GRU layers (consider bidirectional)
# Add Dense layers for classification
# Compile the model

In [None]:
# TODO: Step 5 - Train the model
# Train your classifier
# Monitor training and validation metrics
# Aim for at least 80% accuracy

In [None]:
# TODO: Step 6 - Evaluate and visualize results
# Generate predictions on test set
# Create confusion matrix
# Display classification report (precision, recall, F1-score)
# Show sample predictions with actual labels

---
## Task 2: Named Entity Recognition (NER)

Implement a Named Entity Recognition system to identify entities in text.

**Requirements:**
- Use a labeled NER dataset (e.g., CoNLL-2003)
- Build or use a pre-trained NER model
- Identify at least 3 entity types (Person, Location, Organization)
- Calculate precision, recall, and F1-score
- Demonstrate on custom example sentences

**Hints:**
- Consider using pre-trained NER models from popular NLP libraries
- For custom training, use BiLSTM-CRF architecture
- Use BIO tagging scheme
- Evaluate per-entity-type and overall performance

In [None]:
# TODO: Step 1 - Load NER dataset
# Load CoNLL-2003 or similar NER dataset
# Explore entity types and distribution
# Display sample annotated sentences

In [None]:
# TODO: Step 2 - Preprocess data for NER
# Tokenize and prepare sequences
# Convert entity tags to numerical format (BIO scheme)
# Create train/validation/test splits

In [None]:
# TODO: Step 3 - Build or load NER model
# Option 1: Use spaCy pre-trained model
# Option 2: Build BiLSTM-CRF model from scratch
# Configure for sequence tagging task

In [None]:
# TODO: Step 4 - Train or fine-tune the model
# If using custom model, train on NER dataset
# If using pre-trained, you can skip or fine-tune
# Monitor performance during training

In [None]:
# TODO: Step 5 - Evaluate the model
# Calculate precision, recall, F1-score per entity type
# Display overall metrics
# Show confusion matrix if applicable

In [None]:
# TODO: Step 6 - Test on custom sentences
# Create custom example sentences
# Run NER on them
# Highlight identified entities with their types

---
## Task 3: Text Generation with RNNs

Create a text generation model using Recurrent Neural Networks.

**Requirements:**
- Train on a text corpus (e.g., Shakespeare, song lyrics, or code)
- Implement character-level or word-level generation
- Use LSTM or GRU cells
- Generate coherent text samples with temperature sampling
- Compare outputs at different temperature values
- Show the model's ability to learn patterns from the training data

**Hints:**
- Character-level is simpler, word-level produces better coherence
- Use one-hot encoding or embeddings for input
- Temperature controls randomness: low=conservative, high=creative
- Train on enough data for meaningful patterns

In [None]:
# TODO: Step 1 - Load and prepare text corpus
# Load your chosen corpus (Shakespeare, lyrics, code, etc.)
# Clean and preprocess the text
# Decide on character-level or word-level approach

In [None]:
# TODO: Step 2 - Create training sequences
# Generate input-output pairs for training
# Use sliding window approach
# Create vocabulary mapping (char/word to index)

In [None]:
# TODO: Step 3 - Build LSTM/GRU text generation model
# Create model with Embedding (if word-level) or one-hot encoding
# Add LSTM or GRU layers (stacked if needed)
# Output layer with softmax for next token prediction
# Compile with appropriate loss function

In [None]:
# TODO: Step 4 - Train the model
# Train on your prepared sequences
# Monitor loss to ensure learning
# Save model checkpoints

In [None]:
# TODO: Step 5 - Implement text generation with temperature
# Create generation function with temperature parameter
# Sample next token based on probability distribution
# Generate sequences of desired length

In [None]:
# TODO: Step 6 - Generate and compare samples
# Generate text with different temperatures (e.g., 0.2, 0.5, 1.0, 1.5)
# Compare outputs: low temperature = conservative, high = creative
# Display multiple samples from different seed texts
# Analyze the model's learned patterns