# Week 1: Introduction to NLP

This notebook covers:
- NLP Tasks Overview
- Data Preprocessing
- Tokenization
- Prompting and Zero-shot Inference

**Prerequisites**: Install required packages:
```bash
pip install openai anthropic transformers datasets nltk spacy
```

## 1. Common NLP Tasks

Natural Language Processing encompasses various tasks:

1. **Text Classification**: Categorizing text into predefined classes
2. **Named Entity Recognition (NER)**: Identifying entities like names, locations, organizations
3. **Sentiment Analysis**: Determining the emotional tone of text
4. **Question Answering**: Extracting answers from context
5. **Text Summarization**: Creating concise summaries
6. **Machine Translation**: Converting text between languages
7. **Text Generation**: Creating coherent text based on prompts

## 2. Data Preprocessing

Preprocessing is essential for preparing text data for NLP models.

In [7]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [1]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('punkt_tab', quiet=True)

# Sample text
text = "Natural Language Processing (NLP) is AMAZING! It helps computers understand human language. #AI #ML"

print("Original text:")
print(text)

Original text:
Natural Language Processing (NLP) is AMAZING! It helps computers understand human language. #AI #ML


In [2]:
# 1. Lowercasing
text_lower = text.lower()
print("\n1. Lowercased:")
print(text_lower)


1. Lowercased:
natural language processing (nlp) is amazing! it helps computers understand human language. #ai #ml


In [3]:
# 2. Remove URLs, mentions, and hashtags
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove mentions and hashtags
    text = re.sub(r'@\w+|#\w+', '', text)
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

text_clean = clean_text(text_lower)
print("\n2. Cleaned text:")
print(text_clean)


2. Cleaned text:
natural language processing nlp is amazing it helps computers understand human language  


In [4]:
# 3. Tokenization
tokens = word_tokenize(text_clean)
print("\n3. Tokens:")
print(tokens)


3. Tokens:
['natural', 'language', 'processing', 'nlp', 'is', 'amazing', 'it', 'helps', 'computers', 'understand', 'human', 'language']


In [5]:
# 4. Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
print("\n4. Tokens without stopwords:")
print(filtered_tokens)


4. Tokens without stopwords:
['natural', 'language', 'processing', 'nlp', 'amazing', 'helps', 'computers', 'understand', 'human', 'language']


In [6]:
# 5. Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in filtered_tokens]
print("\n5. Stemmed tokens:")
print(stemmed)


5. Stemmed tokens:
['natur', 'languag', 'process', 'nlp', 'amaz', 'help', 'comput', 'understand', 'human', 'languag']


In [7]:
# 6. Lemmatization (preferred over stemming)
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("\n6. Lemmatized tokens:")
print(lemmatized)


6. Lemmatized tokens:
['natural', 'language', 'processing', 'nlp', 'amazing', 'help', 'computer', 'understand', 'human', 'language']


## 3. Modern Tokenization with Transformers

Modern NLP models use subword tokenization for better handling of rare words and morphology.

In [8]:
from transformers import AutoTokenizer

# Different tokenizers for different models
models = [
    "bert-base-uncased",
    "gpt2",
    "t5-small",
    "facebook/bart-base"
]

sample_text = "Tokenization is fundamental for NLP!"

for model_name in models:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokens = tokenizer.tokenize(sample_text)
    token_ids = tokenizer.encode(sample_text)
    
    print(f"\n{model_name}:")
    print(f"  Tokens: {tokens}")
    print(f"  Token IDs: {token_ids}")
    print(f"  Decoded: {tokenizer.decode(token_ids)}")


bert-base-uncased:
  Tokens: ['token', '##ization', 'is', 'fundamental', 'for', 'nl', '##p', '!']
  Token IDs: [101, 19204, 3989, 2003, 8050, 2005, 17953, 2361, 999, 102]
  Decoded: [CLS] tokenization is fundamental for nlp! [SEP]

gpt2:
  Tokens: ['Token', 'ization', 'ƒ†is', 'ƒ†fundamental', 'ƒ†for', 'ƒ†N', 'LP', '!']
  Token IDs: [30642, 1634, 318, 7531, 329, 399, 19930, 0]
  Decoded: Tokenization is fundamental for NLP!

t5-small:
  Tokens: ['‚ñÅTo', 'ken', 'ization', '‚ñÅis', '‚ñÅfundamental', '‚ñÅfor', '‚ñÅN', 'LP', '!']
  Token IDs: [304, 2217, 1707, 19, 4431, 21, 445, 6892, 55, 1]
  Decoded: Tokenization is fundamental for NLP!</s>

facebook/bart-base:
  Tokens: ['Token', 'ization', 'ƒ†is', 'ƒ†fundamental', 'ƒ†for', 'ƒ†N', 'LP', '!']
  Token IDs: [0, 45643, 1938, 16, 6451, 13, 234, 21992, 328, 2]
  Decoded: <s>Tokenization is fundamental for NLP!</s>


### Understanding Special Tokens

In [9]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print("Special tokens:")
print(f"  PAD token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
print(f"  CLS token: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
print(f"  SEP token: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
print(f"  UNK token: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")
print(f"  Vocab size: {tokenizer.vocab_size}")

# Tokenize with special tokens
encoded = tokenizer.encode_plus(
    "Hello world!",
    add_special_tokens=True,
    return_attention_mask=True,
    padding='max_length',
    max_length=10
)

print("\nEncoded output:")
print(f"  Input IDs: {encoded['input_ids']}")
print(f"  Attention Mask: {encoded['attention_mask']}")

Special tokens:
  PAD token: [PAD] (ID: 0)
  CLS token: [CLS] (ID: 101)
  SEP token: [SEP] (ID: 102)
  UNK token: [UNK] (ID: 100)
  Vocab size: 30522

Encoded output:
  Input IDs: [101, 7592, 2088, 999, 102, 0, 0, 0, 0, 0]
  Attention Mask: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]


## 4. Prompting and Zero-Shot Inference

Modern LLMs can perform tasks without training by using carefully crafted prompts.

### 4.1 Zero-Shot Text Classification with OpenAI

In [10]:
import os
from openai import OpenAI
from dotenv import load_dotenv

In [10]:
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

# Initialize OpenAI client (requires OPENAI_API_KEY environment variable)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def classify_sentiment_openai(text):
    """Classify sentiment using zero-shot prompting with OpenAI."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a sentiment analysis expert. Classify the sentiment as positive, negative, or neutral."},
            {"role": "user", "content": f"Classify the sentiment of this text: '{text}'\n\nRespond with only one word: positive, negative, or neutral."}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

# Test examples
examples = [
    "I love this product! It's amazing!",
    "This is terrible. I want my money back.",
    "The product arrived on time."
]

print("Sentiment Classification (OpenAI):")
for text in examples:
    sentiment = classify_sentiment_openai(text)
    print(f"  Text: {text}")
    print(f"  Sentiment: {sentiment}\n")

Sentiment Classification (OpenAI):
  Text: I love this product! It's amazing!
  Sentiment: positive

  Text: This is terrible. I want my money back.
  Sentiment: negative

  Text: The product arrived on time.
  Sentiment: positive



### 4.2 Zero-Shot with Anthropic Claude

In [11]:
import anthropic
from dotenv import load_dotenv
import os

In [14]:
import anthropic
from dotenv import load_dotenv
import os

load_dotenv()

# Initialize Anthropic client (requires ANTHROPIC_API_KEY environment variable)
client_anthropic = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def extract_entities_claude(text):
    """Extract named entities using Claude."""
    message = client_anthropic.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": f"Extract all named entities (people, organizations, locations) from this text: '{text}'\n\nProvide the response as a JSON object with keys: people, organizations, locations."}
        ]
    )
    return message.content[0].text

text = "Apple Inc. CEO Tim Cook announced new products in Cupertino, California. Elon Musk from Tesla also attended."

print("Named Entity Recognition (Claude):")
print(extract_entities_claude(text))

Named Entity Recognition (Claude):
```json
{
  "people": [
    "Tim Cook",
    "Elon Musk"
  ],
  "organizations": [
    "Apple Inc.",
    "Tesla"
  ],
  "locations": [
    "Cupertino",
    "California"
  ]
}
```


### 4.3 Zero-Shot with Hugging Face Transformers

In [12]:
from transformers import pipeline

In [19]:
from transformers import pipeline

# Zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

text = "I am excited to learn about natural language processing!"
candidate_labels = ["education", "technology", "sports", "politics"]

result = classifier(text, candidate_labels)

print("Zero-Shot Classification (Hugging Face):")
print(f"Text: {text}\n")
for label, score in zip(result['labels'], result['scores']):
    print(f"  {label}: {score:.4f}")

Device set to use cuda:0


Zero-Shot Classification (Hugging Face):
Text: I am excited to learn about natural language processing!

  technology: 0.8605
  education: 0.1059
  sports: 0.0197
  politics: 0.0139


### 4.4 Multiple NLP Tasks with Zero-Shot

In [16]:
# Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

article = """
Artificial intelligence is transforming healthcare in unprecedented ways. Machine learning algorithms 
can now detect diseases earlier and more accurately than traditional methods. Natural language processing 
helps doctors extract insights from medical records. Computer vision assists in analyzing medical images 
like X-rays and MRIs. These technologies are improving patient outcomes and reducing healthcare costs.
"""

summary = summarizer(article, max_length=50, min_length=20, do_sample=False)
print("Summarization:")
print(summary[0]['summary_text'])

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Summarization:
Artificial intelligence is transforming healthcare in unprecedented ways. Machine learning algorithms can now detect diseases earlier and more accurately than traditional methods. Natural language processing helps doctors extract insights from medical records. Computer vision assists in analyzing medical images like


In [17]:
# Question Answering
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

context = "The Eiffel Tower is located in Paris, France. It was completed in 1889 and stands 330 meters tall."
questions = [
    "Where is the Eiffel Tower?",
    "When was it completed?",
    "How tall is it?"
]

print("\nQuestion Answering:")
for question in questions:
    result = qa_pipeline(question=question, context=context)
    print(f"  Q: {question}")
    print(f"  A: {result['answer']} (confidence: {result['score']:.4f})\n")

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0



Question Answering:
  Q: Where is the Eiffel Tower?
  A: Paris, France (confidence: 0.9292)

  Q: When was it completed?
  A: 1889 (confidence: 0.9777)

  Q: How tall is it?
  A: 330 meters (confidence: 0.6323)



In [18]:
# Named Entity Recognition
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", aggregation_strategy="simple")

text = "Microsoft was founded by Bill Gates and Paul Allen in Seattle."
entities = ner(text)

print("Named Entity Recognition:")
for entity in entities:
    print(f"  {entity['word']}: {entity['entity_group']} (score: {entity['score']:.4f})")

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Named Entity Recognition:
  Microsoft: ORG (score: 0.9996)
  Bill Gates: PER (score: 0.9918)
  Paul Allen: PER (score: 0.9983)
  Seattle: LOC (score: 0.9982)


## 5. Prompt Engineering Basics

Effective prompting is crucial for getting good results from LLMs.

In [19]:
def compare_prompts(prompts, text):
    """Compare different prompt strategies."""
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    for i, prompt in enumerate(prompts, 1):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "user", "content": prompt.format(text=text)}
            ],
            temperature=0
        )
        print(f"\nPrompt {i}:")
        print(f"{prompt.format(text=text)}")
        print(f"\nResponse:")
        print(response.choices[0].message.content)
        print("-" * 80)

text = "The movie had great visuals but the plot was confusing and dragged on."

prompts = [
    # Bad prompt - vague
    "What do you think about: {text}",
    
    # Better prompt - specific task
    "Analyze the sentiment of this movie review: {text}",
    
    # Best prompt - specific task with format
    """Analyze the sentiment of this movie review: {text}
    
Provide:
1. Overall sentiment (positive/negative/mixed)
2. Positive aspects
3. Negative aspects
4. Sentiment score (0-10)"""
]

compare_prompts(prompts, text)


Prompt 1:
What do you think about: The movie had great visuals but the plot was confusing and dragged on.

Response:
It sounds like you found the movie to be a mixed experience. Great visuals can certainly enhance a film and make it more engaging, but if the plot is confusing and feels drawn out, it can detract from the overall enjoyment. A strong narrative is often essential for connecting with the audience, so it's understandable that a convoluted plot could overshadow the visual appeal. Did any specific scenes or elements stand out to you, either positively or negatively?
--------------------------------------------------------------------------------

Prompt 2:
Analyze the sentiment of this movie review: The movie had great visuals but the plot was confusing and dragged on.

Response:
The sentiment of the movie review is mixed. The reviewer expresses a positive sentiment towards the visuals, indicating that they were impressive or well-done. However, this is contrasted with a nega

## 6. Practice Exercises

### Exercise 1: Text Preprocessing Pipeline
Create a complete preprocessing function that takes raw text and returns cleaned, tokenized text.

In [8]:
def preprocess_pipeline(text, remove_stopwords=True, lemmatize=True):
    """
    Complete preprocessing pipeline for text data.
    
    Args:
        text (str): Raw input text
        remove_stopwords (bool): Whether to remove stopwords
        lemmatize (bool): Whether to lemmatize (True) or stem (False)
    
    Returns:
        list: Cleaned and tokenized text
    """
    # Your code here
    # 1. Lowercase the text
    text_lower = text.lower()
    
    # 2. Remove URLs
    text_clean = re.sub(r'http\S+|www\S+|https\S+', '', text_lower, flags=re.MULTILINE)
    
    # 3. Remove mentions (@user) and hashtags (#tag)
    text_clean = re.sub(r'@\w+', '', text_clean)
    text_clean = re.sub(r'#\w+', '', text_clean)
    
    # 4. Remove RT (retweet indicator)
    text_clean = re.sub(r'\brt\b', '', text_clean)
    
    # 5. Remove special characters, numbers, and punctuation
    text_clean = re.sub(r'[^a-zA-Z\s]', '', text_clean)
    
    # 6. Remove extra whitespace
    text_clean = ' '.join(text_clean.split())
    
    # 7. Tokenization
    tokens = word_tokenize(text_clean)
    
    # 8. Remove stopwords (optional)
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        filtered_tokens = [word for word in tokens if word not in stop_words]
    else:
        filtered_tokens = tokens
    
    # 9. Lemmatization or Stemming
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]
        final_tokens = lemmatized
    else:
        stemmer = PorterStemmer()
        stemmed = [stemmer.stem(word) for word in filtered_tokens]
        final_tokens = stemmed
    
    # 10. Remove empty strings and single characters (optional)
    final_tokens = [word for word in final_tokens if len(word) > 1]
    
    return final_tokens

# Test your function
test_text = "RT @user: Check out this AMAZING article! https://example.com #NLP #AI üöÄ"
result = preprocess_pipeline(test_text)
print("Original text:")
print(test_text)
print("\nPreprocessed tokens:")
print(result)

# Test with different options
print("\n" + "="*80)
print("With stopwords kept:")
result_with_stopwords = preprocess_pipeline(test_text, remove_stopwords=False)
print(result_with_stopwords)

print("\n" + "="*80)
print("With stemming instead of lemmatization:")
result_stemmed = preprocess_pipeline(test_text, lemmatize=False)
print(result_stemmed)


# Test your function
test_text = "RT @user: Check out this AMAZING article! https://example.com #NLP #AI üöÄ"
# result = preprocess_pipeline(test_text)
# print(result)

Original text:
RT @user: Check out this AMAZING article! https://example.com #NLP #AI üöÄ

Preprocessed tokens:
['check', 'amazing', 'article']

With stopwords kept:
['check', 'out', 'this', 'amazing', 'article']

With stemming instead of lemmatization:
['check', 'amaz', 'articl']


### Exercise 2: Zero-Shot Multi-Class Classification
Use zero-shot classification to categorize news articles into topics.

In [14]:
import os
from openai import OpenAI
from dotenv import load_dotenv
import anthropic
from transformers import pipeline

# Load environment variables
load_dotenv()

True

In [15]:
articles = [
    "The stock market reached record highs today as investors reacted positively to earnings reports.",
    "Scientists discovered a new species of marine life in the deep ocean.",
    "The championship game went into overtime with an exciting finish."
]

# Use any method (OpenAI, Claude, or Hugging Face) to classify these articles
# Categories: business, science, sports, politics, entertainment

# Your code here
# Solution 1: Using Hugging Face Transformers (Recommended for this task)
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
candidate_labels = ["business", "science", "sports", "politics", "entertainment"]

print("Zero-Shot Multi-Class Classification Results:")
print("=" * 80)

for i, article in enumerate(articles, 1):
    result = classifier(article, candidate_labels)
    
    print(f"\nArticle {i}:")
    print(f"Text: {article}")
    print(f"\nPredicted Category: {result['labels'][0]} (confidence: {result['scores'][0]:.4f})")
    print("\nAll scores:")
    for label, score in zip(result['labels'], result['scores']):
        print(f"  {label}: {score:.4f}")
    print("-" * 80)

# Solution 2: Using OpenAI (Alternative approach)
print("\n\nUsing OpenAI API:")
print("=" * 80)

def classify_article_openai(article, categories):
    """Classify article using OpenAI."""
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a text classification expert. Classify articles into one of the given categories."},
            {"role": "user", "content": f"""Classify this article into ONE of these categories: {', '.join(categories)}

Article: {article}

Respond with only the category name."""}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

for i, article in enumerate(articles, 1):
    category = classify_article_openai(article, candidate_labels)
    print(f"\nArticle {i}:")
    print(f"Text: {article}")
    print(f"Predicted Category: {category}")
    print("-" * 80)

# Solution 3: Using Anthropic Claude (Alternative approach)
print("\n\nUsing Anthropic Claude:")
print("=" * 80)

def classify_article_claude(article, categories):
    """Classify article using Claude."""
    client_anthropic = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
    
    message = client_anthropic.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=100,
        messages=[
            {"role": "user", "content": f"""Classify this article into ONE of these categories: {', '.join(categories)}

Article: {article}

Respond with only the category name and a confidence score (0-1)."""}
        ]
    )
    return message.content[0].text

for i, article in enumerate(articles, 1):
    result = classify_article_claude(article, candidate_labels)
    print(f"\nArticle {i}:")
    print(f"Text: {article}")
    print(f"Classification: {result}")
    print("-" * 80)

# Bonus: Batch classification with detailed analysis
print("\n\nBonus: Detailed Multi-Label Analysis")
print("=" * 80)

def detailed_classification(article):
    """Get detailed classification with reasoning."""
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a text classification expert."},
            {"role": "user", "content": f"""Classify this article and provide reasoning.

Article: {article}

Categories: business, science, sports, politics, entertainment

Provide:
1. Primary category
2. Confidence score (0-1)
3. Brief reasoning (one sentence)
4. Secondary category (if applicable)

Format as JSON."""}
        ],
        temperature=0
    )
    return response.choices[0].message.content

for i, article in enumerate(articles, 1):
    result = detailed_classification(article)
    print(f"\nArticle {i}:")
    print(f"Text: {article}")
    print(f"\nDetailed Analysis:")
    print(result)
    print("-" * 80)

Device set to use cuda:0


Zero-Shot Multi-Class Classification Results:

Article 1:
Text: The stock market reached record highs today as investors reacted positively to earnings reports.

Predicted Category: business (confidence: 0.6444)

All scores:
  business: 0.6444
  entertainment: 0.1830
  science: 0.0744
  sports: 0.0572
  politics: 0.0410
--------------------------------------------------------------------------------

Article 2:
Text: Scientists discovered a new species of marine life in the deep ocean.

Predicted Category: science (confidence: 0.9729)

All scores:
  science: 0.9729
  entertainment: 0.0121
  business: 0.0071
  sports: 0.0050
  politics: 0.0028
--------------------------------------------------------------------------------

Article 3:
Text: The championship game went into overtime with an exciting finish.

Predicted Category: sports (confidence: 0.6168)

All scores:
  sports: 0.6168
  entertainment: 0.3617
  business: 0.0136
  science: 0.0044
  politics: 0.0036
-------------------------

### Exercise 3: Prompt Engineering
Design prompts for extracting structured information from unstructured text.

In [16]:
import os
from openai import OpenAI
from dotenv import load_dotenv
import json
import anthropic
from transformers import pipeline

In [18]:

# Load environment variables
load_dotenv()

job_posting = """
We are looking for a Senior Data Scientist with 5+ years of experience in ML and NLP.
Must have Python, TensorFlow, and PyTorch skills. Salary: $120k-$180k.
Location: San Francisco, CA (Hybrid). Apply by December 31, 2024.
"""

# Create a prompt to extract: position, experience, skills, salary, location, deadline
# Your code here
# Solution 1: Using OpenAI with structured JSON output
print("Solution 1: OpenAI with Structured JSON Extraction")
print("=" * 80)

def extract_job_info_openai(job_text):
    """Extract structured information from job posting using OpenAI."""
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an expert at extracting structured information from job postings. Always respond with valid JSON."},
            {"role": "user", "content": f"""Extract the following information from this job posting:

Job Posting:
{job_text}

Extract and return a JSON object with these fields:
- position: job title
- experience: years of experience required (as a string, e.g., "5+ years")
- skills: list of required technical skills
- salary: salary range (as a string)
- location: work location and type (e.g., "San Francisco, CA (Hybrid)")
- deadline: application deadline (as a string)

If any field is not found, use null.

Respond ONLY with the JSON object, no additional text."""}
        ],
        temperature=0
    )
    
    return response.choices[0].message.content

# Only run if API key is available
if os.getenv("OPENAI_API_KEY"):
    result = extract_job_info_openai(job_posting)
    print("\nExtracted Information:")
    print(result)
    
    # Parse and pretty print
    try:
        parsed = json.loads(result)
        print("\nParsed JSON:")
        print(json.dumps(parsed, indent=2))
    except json.JSONDecodeError:
        print("\nNote: Response may need cleaning to be valid JSON")
else:
    print("OpenAI API key not found. Skipping OpenAI extraction.")

print("\n" + "=" * 80)

# Solution 2: Using Claude with detailed prompt
print("\nSolution 2: Anthropic Claude with Detailed Extraction")
print("=" * 80)

def extract_job_info_claude(job_text):
    """Extract structured information using Claude."""
    client_anthropic = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
    
    message = client_anthropic.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": f"""Analyze this job posting and extract structured information:

Job Posting:
{job_text}

Please extract the following information and format as JSON:
{{
  "position": "job title",
  "experience": "years required",
  "skills": ["skill1", "skill2", ...],
  "salary": "salary range",
  "location": "location and work type",
  "deadline": "application deadline"
}}

Be precise and extract only what's explicitly stated. Use null for missing fields."""}
        ]
    )
    return message.content[0].text

# Only run if API key is available
if os.getenv("ANTHROPIC_API_KEY"):
    result = extract_job_info_claude(job_posting)
    print("\nExtracted Information:")
    print(result)
else:
    print("Anthropic API key not found. Skipping Claude extraction.")

print("\n" + "=" * 80)

# Solution 3: Multi-step extraction with validation
print("\nSolution 3: Multi-Step Extraction with Validation")
print("=" * 80)

def extract_and_validate_job_info(job_text):
    """Extract job info with validation and confidence scores."""
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    # Step 1: Extract information
    extraction_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an expert at extracting structured information from job postings."},
            {"role": "user", "content": f"""Extract information from this job posting:

{job_text}

Provide a JSON object with:
- position
- experience
- skills (as array)
- salary
- location
- deadline

Also include a "confidence" field (0-1) for each extracted piece of information."""}
        ],
        temperature=0
    )
    
    extraction = extraction_response.choices[0].message.content
    
    # Step 2: Validate and format
    validation_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You validate and format extracted job information."},
            {"role": "user", "content": f"""Review this extracted job information and ensure it's accurate:

Original Job Posting:
{job_text}

Extracted Information:
{extraction}

Validate each field and provide:
1. Corrected JSON if needed
2. Brief notes on any corrections made
3. Overall extraction quality (high/medium/low)"""}
        ],
        temperature=0
    )
    
    return {
        "extraction": extraction,
        "validation": validation_response.choices[0].message.content
    }

# Only run if API key is available
if os.getenv("OPENAI_API_KEY"):
    result = extract_and_validate_job_info(job_posting)
    print("\nStep 1 - Extraction:")
    print(result["extraction"])
    print("\nStep 2 - Validation:")
    print(result["validation"])
else:
    print("OpenAI API key not found. Skipping multi-step extraction.")

print("\n" + "=" * 80)

# Solution 4: Comparison of different prompt strategies
print("\nSolution 4: Comparing Different Prompt Strategies")
print("=" * 80)

def compare_extraction_prompts(job_text):
    """Compare different prompting strategies."""
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    prompts = {
        "Basic": f"Extract job information from: {job_text}",
        
        "Structured": f"""Extract these fields from the job posting:
- Position
- Experience
- Skills
- Salary
- Location
- Deadline

Job Posting: {job_text}""",
        
        "Few-Shot": f"""Here's an example of job information extraction:

Input: "Hiring Junior Developer with 2 years experience. Python required. $60k-$80k. NYC."
Output: {{"position": "Junior Developer", "experience": "2 years", "skills": ["Python"], "salary": "$60k-$80k", "location": "NYC", "deadline": null}}

Now extract from this job posting:
{job_text}""",
        
        "Chain-of-Thought": f"""Let's extract job information step by step:

Job Posting: {job_text}

Step 1: Identify the job title
Step 2: Find experience requirements
Step 3: List all technical skills mentioned
Step 4: Extract salary information
Step 5: Determine location and work arrangement
Step 6: Find application deadline

Provide the final answer as JSON."""
    }
    
    results = {}
    for strategy, prompt in prompts.items():
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        results[strategy] = response.choices[0].message.content
    
    return results

# Only run if API key is available
if os.getenv("OPENAI_API_KEY"):
    comparison = compare_extraction_prompts(job_posting)
    
    for strategy, result in comparison.items():
        print(f"\n{strategy} Prompt Strategy:")
        print("-" * 40)
        print(result)
        print()
else:
    print("OpenAI API key not found. Skipping prompt comparison.")

print("\n" + "=" * 80)

# Bonus: Extract from multiple job postings
print("\nBonus: Batch Processing Multiple Job Postings")
print("=" * 80)

job_postings = [
    """
    We are looking for a Senior Data Scientist with 5+ years of experience in ML and NLP.
    Must have Python, TensorFlow, and PyTorch skills. Salary: $120k-$180k.
    Location: San Francisco, CA (Hybrid). Apply by December 31, 2024.
    """,
    """
    Junior Frontend Developer needed! 1-2 years experience with React and JavaScript.
    $50k-$70k annually. Remote position. No deadline specified.
    """,
    """
    Lead DevOps Engineer - 7+ years required. AWS, Docker, Kubernetes essential.
    Competitive salary $150k-$200k. Austin, TX office. Rolling applications.
    """
]

def batch_extract_jobs(job_list):
    """Extract information from multiple job postings."""
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    all_jobs = "\n\n---\n\n".join([f"Job {i+1}:\n{job}" for i, job in enumerate(job_list)])
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You extract structured information from multiple job postings."},
            {"role": "user", "content": f"""Extract information from these job postings:

{all_jobs}

Return a JSON array where each element contains:
- job_number
- position
- experience
- skills
- salary
- location
- deadline"""}
        ],
        temperature=0
    )
    
    return response.choices[0].message.content

# Only run if API key is available
if os.getenv("OPENAI_API_KEY"):
    batch_results = batch_extract_jobs(job_postings)
    print("\nBatch Extraction Results:")
    print(batch_results)
    
    # Try to parse and format nicely
    try:
        parsed_batch = json.loads(batch_results)
        print("\n\nFormatted Results:")
        for job in parsed_batch:
            print(f"\n{job.get('position', 'Unknown Position')}:")
            for key, value in job.items():
                if key != 'position':
                    print(f"  {key}: {value}")
    except json.JSONDecodeError:
        print("\nNote: Could not parse as JSON")
else:
    print("OpenAI API key not found. Skipping batch extraction.")

print("\n" + "=" * 80)
print("\nKey Takeaways from Exercise 3:")
print("1. Clear, structured prompts yield better results")
print("2. Specifying output format (JSON) improves consistency")
print("3. Few-shot examples can guide the model")
print("4. Chain-of-thought prompting helps with complex extractions")
print("5. Validation steps improve accuracy")
print("6. Different strategies work better for different tasks")

Solution 1: OpenAI with Structured JSON Extraction

Extracted Information:
{
  "position": "Senior Data Scientist",
  "experience": "5+ years",
  "skills": [
    "Python",
    "TensorFlow",
    "PyTorch"
  ],
  "salary": "$120k-$180k",
  "location": "San Francisco, CA (Hybrid)",
  "deadline": "December 31, 2024"
}

Parsed JSON:
{
  "position": "Senior Data Scientist",
  "experience": "5+ years",
  "skills": [
    "Python",
    "TensorFlow",
    "PyTorch"
  ],
  "salary": "$120k-$180k",
  "location": "San Francisco, CA (Hybrid)",
  "deadline": "December 31, 2024"
}


Solution 2: Anthropic Claude with Detailed Extraction

Extracted Information:
```json
{
  "position": "Senior Data Scientist",
  "experience": "5+ years",
  "skills": ["ML", "NLP", "Python", "TensorFlow", "PyTorch"],
  "salary": "$120k-$180k",
  "location": "San Francisco, CA (Hybrid)",
  "deadline": "December 31, 2024"
}
```


Solution 3: Multi-Step Extraction with Validation

Step 1 - Extraction:
```json
{
  "position": {

## Summary

In this notebook, you learned:
1. Common NLP tasks and their applications
2. Traditional text preprocessing techniques
3. Modern tokenization with transformer models
4. Zero-shot inference using OpenAI, Anthropic, and Hugging Face
5. Prompt engineering basics for better results

**Next Steps**: Practice with Quiz 1 and prepare for the live session on zero-shot inference!