# Week 1: Introduction to NLP

This notebook covers:
- NLP Tasks Overview
- Data Preprocessing
- Tokenization
- Prompting and Zero-shot Inference

**Prerequisites**: Install required packages:
```bash
pip install openai anthropic transformers datasets nltk spacy
```

## 1. Common NLP Tasks

Natural Language Processing encompasses various tasks:

1. **Text Classification**: Categorizing text into predefined classes
2. **Named Entity Recognition (NER)**: Identifying entities like names, locations, organizations
3. **Sentiment Analysis**: Determining the emotional tone of text
4. **Question Answering**: Extracting answers from context
5. **Text Summarization**: Creating concise summaries
6. **Machine Translation**: Converting text between languages
7. **Text Generation**: Creating coherent text based on prompts

## 2. Data Preprocessing

Preprocessing is essential for preparing text data for NLP models.

In [None]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('punkt_tab', quiet=True)

# Sample text
text = "Natural Language Processing (NLP) is AMAZING! It helps computers understand human language. #AI #ML"

print("Original text:")
print(text)

In [None]:
# 1. Lowercasing
text_lower = text.lower()
print("\n1. Lowercased:")
print(text_lower)

In [None]:
# 2. Remove URLs, mentions, and hashtags
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove mentions and hashtags
    text = re.sub(r'@\w+|#\w+', '', text)
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

text_clean = clean_text(text_lower)
print("\n2. Cleaned text:")
print(text_clean)

In [None]:
# 3. Tokenization
tokens = word_tokenize(text_clean)
print("\n3. Tokens:")
print(tokens)

In [None]:
# 4. Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
print("\n4. Tokens without stopwords:")
print(filtered_tokens)

In [None]:
# 5. Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in filtered_tokens]
print("\n5. Stemmed tokens:")
print(stemmed)

In [None]:
# 6. Lemmatization (preferred over stemming)
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("\n6. Lemmatized tokens:")
print(lemmatized)

## 3. Modern Tokenization with Transformers

Modern NLP models use subword tokenization for better handling of rare words and morphology.

In [None]:
from transformers import AutoTokenizer

# Different tokenizers for different models
models = [
    "bert-base-uncased",
    "gpt2",
    "t5-small",
    "facebook/bart-base"
]

sample_text = "Tokenization is fundamental for NLP!"

for model_name in models:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokens = tokenizer.tokenize(sample_text)
    token_ids = tokenizer.encode(sample_text)
    
    print(f"\n{model_name}:")
    print(f"  Tokens: {tokens}")
    print(f"  Token IDs: {token_ids}")
    print(f"  Decoded: {tokenizer.decode(token_ids)}")

### Understanding Special Tokens

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print("Special tokens:")
print(f"  PAD token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
print(f"  CLS token: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
print(f"  SEP token: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
print(f"  UNK token: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")
print(f"  Vocab size: {tokenizer.vocab_size}")

# Tokenize with special tokens
encoded = tokenizer.encode_plus(
    "Hello world!",
    add_special_tokens=True,
    return_attention_mask=True,
    padding='max_length',
    max_length=10
)

print("\nEncoded output:")
print(f"  Input IDs: {encoded['input_ids']}")
print(f"  Attention Mask: {encoded['attention_mask']}")

## 4. Prompting and Zero-Shot Inference

Modern LLMs can perform tasks without training by using carefully crafted prompts.

### 4.1 Zero-Shot Text Classification with OpenAI

In [None]:
import os
from openai import OpenAI

# Initialize OpenAI client (requires OPENAI_API_KEY environment variable)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def classify_sentiment_openai(text):
    """Classify sentiment using zero-shot prompting with OpenAI."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a sentiment analysis expert. Classify the sentiment as positive, negative, or neutral."},
            {"role": "user", "content": f"Classify the sentiment of this text: '{text}'\n\nRespond with only one word: positive, negative, or neutral."}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

# Test examples
examples = [
    "I love this product! It's amazing!",
    "This is terrible. I want my money back.",
    "The product arrived on time."
]

print("Sentiment Classification (OpenAI):")
for text in examples:
    sentiment = classify_sentiment_openai(text)
    print(f"  Text: {text}")
    print(f"  Sentiment: {sentiment}\n")

### 4.2 Zero-Shot with Anthropic Claude

In [None]:
import anthropic

# Initialize Anthropic client (requires ANTHROPIC_API_KEY environment variable)
client_anthropic = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def extract_entities_claude(text):
    """Extract named entities using Claude."""
    message = client_anthropic.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": f"Extract all named entities (people, organizations, locations) from this text: '{text}'\n\nProvide the response as a JSON object with keys: people, organizations, locations."}
        ]
    )
    return message.content[0].text

text = "Apple Inc. CEO Tim Cook announced new products in Cupertino, California. Elon Musk from Tesla also attended."

print("Named Entity Recognition (Claude):")
print(extract_entities_claude(text))

### 4.3 Zero-Shot with Hugging Face Transformers

In [None]:
from transformers import pipeline

# Zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

text = "I am excited to learn about natural language processing!"
candidate_labels = ["education", "technology", "sports", "politics"]

result = classifier(text, candidate_labels)

print("Zero-Shot Classification (Hugging Face):")
print(f"Text: {text}\n")
for label, score in zip(result['labels'], result['scores']):
    print(f"  {label}: {score:.4f}")

### 4.4 Multiple NLP Tasks with Zero-Shot

In [None]:
# Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

article = """
Artificial intelligence is transforming healthcare in unprecedented ways. Machine learning algorithms 
can now detect diseases earlier and more accurately than traditional methods. Natural language processing 
helps doctors extract insights from medical records. Computer vision assists in analyzing medical images 
like X-rays and MRIs. These technologies are improving patient outcomes and reducing healthcare costs.
"""

summary = summarizer(article, max_length=50, min_length=20, do_sample=False)
print("Summarization:")
print(summary[0]['summary_text'])

In [None]:
# Question Answering
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

context = "The Eiffel Tower is located in Paris, France. It was completed in 1889 and stands 330 meters tall."
questions = [
    "Where is the Eiffel Tower?",
    "When was it completed?",
    "How tall is it?"
]

print("\nQuestion Answering:")
for question in questions:
    result = qa_pipeline(question=question, context=context)
    print(f"  Q: {question}")
    print(f"  A: {result['answer']} (confidence: {result['score']:.4f})\n")

In [None]:
# Named Entity Recognition
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", aggregation_strategy="simple")

text = "Microsoft was founded by Bill Gates and Paul Allen in Seattle."
entities = ner(text)

print("Named Entity Recognition:")
for entity in entities:
    print(f"  {entity['word']}: {entity['entity_group']} (score: {entity['score']:.4f})")

## 5. Prompt Engineering Basics

Effective prompting is crucial for getting good results from LLMs.

In [None]:
def compare_prompts(prompts, text):
    """Compare different prompt strategies."""
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    for i, prompt in enumerate(prompts, 1):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "user", "content": prompt.format(text=text)}
            ],
            temperature=0
        )
        print(f"\nPrompt {i}:")
        print(f"{prompt.format(text=text)}")
        print(f"\nResponse:")
        print(response.choices[0].message.content)
        print("-" * 80)

text = "The movie had great visuals but the plot was confusing and dragged on."

prompts = [
    # Bad prompt - vague
    "What do you think about: {text}",
    
    # Better prompt - specific task
    "Analyze the sentiment of this movie review: {text}",
    
    # Best prompt - specific task with format
    """Analyze the sentiment of this movie review: {text}
    
Provide:
1. Overall sentiment (positive/negative/mixed)
2. Positive aspects
3. Negative aspects
4. Sentiment score (0-10)"""
]

compare_prompts(prompts, text)

## 6. Practice Exercises

### Exercise 1: Text Preprocessing Pipeline
Create a complete preprocessing function that takes raw text and returns cleaned, tokenized text.

In [None]:
def preprocess_pipeline(text, remove_stopwords=True, lemmatize=True):
    """Complete preprocessing pipeline."""
    # Your code here
    pass

# Test your function
test_text = "RT @user: Check out this AMAZING article! https://example.com #NLP #AI ðŸš€"
# result = preprocess_pipeline(test_text)
# print(result)

### Exercise 2: Zero-Shot Multi-Class Classification
Use zero-shot classification to categorize news articles into topics.

In [None]:
articles = [
    "The stock market reached record highs today as investors reacted positively to earnings reports.",
    "Scientists discovered a new species of marine life in the deep ocean.",
    "The championship game went into overtime with an exciting finish."
]

# Use any method (OpenAI, Claude, or Hugging Face) to classify these articles
# Categories: business, science, sports, politics, entertainment

# Your code here

### Exercise 3: Prompt Engineering
Design prompts for extracting structured information from unstructured text.

In [None]:
job_posting = """
We are looking for a Senior Data Scientist with 5+ years of experience in ML and NLP.
Must have Python, TensorFlow, and PyTorch skills. Salary: $120k-$180k.
Location: San Francisco, CA (Hybrid). Apply by December 31, 2024.
"""

# Create a prompt to extract: position, experience, skills, salary, location, deadline
# Your code here

## Summary

In this notebook, you learned:
1. Common NLP tasks and their applications
2. Traditional text preprocessing techniques
3. Modern tokenization with transformer models
4. Zero-shot inference using OpenAI, Anthropic, and Hugging Face
5. Prompt engineering basics for better results

**Next Steps**: Practice with Quiz 1 and prepare for the live session on zero-shot inference!