# Week 7: Natural Language Processing

## 1. Hugging Face Pipeline API
- Sentiment Analysis
- Zero-shot Classification  
- Text Generation
- Feature Extraction
- Named Entity Recognition
- Question Answering

In [1]:
from transformers import pipeline

In [2]:
sentiment_analyzer = pipeline("sentiment-analysis")
texts = [
    "I absolutely love this product!",
    "This is the worst experience ever.",
    "It's okay, nothing special.",
]
sentiments = sentiment_analyzer(texts)
for text, sentiment in zip(texts, sentiments):
    print(f"Text: {text}")
    print(f"Sentiment: {sentiment['label']}, Score: {sentiment['score']:.3f}\n")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Text: I absolutely love this product!
Sentiment: POSITIVE, Score: 1.000

Text: This is the worst experience ever.
Sentiment: NEGATIVE, Score: 1.000

Text: It's okay, nothing special.
Sentiment: NEGATIVE, Score: 0.819



In [4]:
classifier = pipeline("zero-shot-classification")
labels = ["technology", "sports", "politics"]
text = "The new AI model was released today."
result = classifier(text, labels)
print(f"Text: {text}")
print(f"Best label: {result['labels'][0]}")

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Text: The new AI model was released today.
Best label: technology


In [6]:
result2 = classifier(
    "The course about AI is amazing",
    candidate_labels=["education", "politics", "business"],
)
print(f"Zero-shot: {result2['labels'][0]} (confidence: {result2['scores'][0]:.3f})")

Zero-shot: education (confidence: 0.865)


In [3]:
generator = pipeline("text-generation", model="gpt2")
generated = generator("The future of AI is", max_length=50, num_return_sequences=1)
print("Generated text:", generated[0]["generated_text"])

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generated text: The future of AI is uncertain. While the technology is advancing it continues to evolve. It can be used to drive intelligent machines, but it also can be used to produce artificial intelligence.

In the early stages AI was not so useful because it was used to create complex games. Today it is so developed that it is capable of producing a lot of interesting games. It is also very complex.

AI is very complex because it has to deal with many different factors, such as its environment, its behavior, its environment and its environment's environment. It has to deal with many different variables, such as its personality, its personality's personality, its disposition, its disposition's disposition, its disposition's disposition, its disposition's disposition's disposition, its disposition's disposition's disposition, its disposition's disposition's disposition, its disposition's disposition's disposition, its disposition's disposition's disposition, its disposition's dispos

In [7]:
feature_extractor = pipeline("feature-extraction", model="bert-base-uncased")
embeddings = feature_extractor("Machine learning is fascinating")
print(
    f"Embeddings shape: {len(embeddings[0])} tokens, {len(embeddings[0][0])} dimensions"
)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


Embeddings shape: 6 tokens, 768 dimensions


In [8]:
ner_pipeline = pipeline("ner", aggregation_strategy="simple")
ner_results = ner_pipeline(
    "Apple Inc. was founded by Steve Jobs in Cupertino, California."
)
print("Named Entities:", ner_results)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Named Entities: [{'entity_group': 'ORG', 'score': np.float32(0.999568), 'word': 'Apple Inc', 'start': 0, 'end': 9}, {'entity_group': 'PER', 'score': np.float32(0.9892235), 'word': 'Steve Jobs', 'start': 26, 'end': 36}, {'entity_group': 'LOC', 'score': np.float32(0.97110385), 'word': 'Cupertino', 'start': 40, 'end': 49}, {'entity_group': 'LOC', 'score': np.float32(0.9988753), 'word': 'California', 'start': 51, 'end': 61}]


In [10]:
qa_pipeline = pipeline("question-answering")
context = "Hugging Face is a company based in New York that focuses on natural language processing."
question = "Where is Hugging Face based?"
answer = qa_pipeline(question=question, context=context)
print(f"Q: {question}")
print(f"A: {answer['answer']} (score: {answer['score']:.3f})")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Q: Where is Hugging Face based?
A: New York (score: 0.997)


## 2. BERT embedding

In [9]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

texts = [
    "Machine learning is fascinating",
    "Natural language processing enables computers to understand text",
    "Transformers revolutionized NLP",
]

for text in texts:
    # Tokenize
    inputs = tokenizer(
        text, return_tensors="pt", padding=True, truncation=True, max_length=512
    )

    # Get model outputs
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract embeddings (using [CLS] token for sentence embedding)
    last_hidden_states = outputs.last_hidden_state
    cls_embedding = last_hidden_states[:, 0, :]  # [CLS] token embedding

    print(f"Text: {text}")
    print(f"Embedding shape: {last_hidden_states.shape}")
    print(f"CLS Embedding shape: {cls_embedding.shape}")
    print(f"Sample embedding values: {cls_embedding[0][:5].tolist()}\n")

Text: Machine learning is fascinating
Embedding shape: torch.Size([1, 6, 768])
CLS Embedding shape: torch.Size([1, 768])
Sample embedding values: [-0.04144934192299843, 0.14025096595287323, -0.2169020175933838, 0.17884008586406708, -0.3127765357494354]

Text: Natural language processing enables computers to understand text
Embedding shape: torch.Size([1, 10, 768])
CLS Embedding shape: torch.Size([1, 768])
Sample embedding values: [-0.08259174227714539, -0.12789715826511383, -0.20781317353248596, 0.1806444525718689, -0.27416887879371643]

Text: Transformers revolutionized NLP
Embedding shape: torch.Size([1, 7, 768])
CLS Embedding shape: torch.Size([1, 768])
Sample embedding values: [-0.3289090394973755, -0.09865826368331909, 0.32339486479759216, 0.19192452728748322, -0.43156522512435913]

