# Unit 1 Hands-on: Generative AI & NLP Fundamentals

Welcome to your interactive guide to **Generative AI**. This notebook is designed to be a step-by-step tutorial, explaining not just *how* to code, but *why* we use these tools.


## 1. Introduction & Setup

In this section, we will set up our environment. But first, let's understand the tools we are using.


### What is Hugging Face?

Hugging Face (https://huggingface.co/) is often called the "GitHub of AI". It is a massive repository where researchers and companies share their trained models, datasets, and demos.

Instead of training a model from scratch (which costs millions of dollars), we can download models like GPT-2, BERT, or RoBERTa directly from Hugging Face and use them.


### What is the `transformers` library?

The `transformers` library is the bridge between the models on Hugging Face and your code. It provides APIs to easily download, load, and run state-of-the-art pretrained models.

It supports framework interoperability, meaning you can often move between PyTorch, TensorFlow, and JAX.


### What is `pipeline()`?

The `pipeline()` function is the most powerful high-level tool in the library. It abstracts away the complex math and processing into three simple steps:

1.  **Preprocessing**: Converts your raw text into numbers (Tokens & IDs) that the model can understand.
2.  **Model Inference**: The model processes the numbers and outputs predictions (logits).
3.  **Post-processing**: The raw predictions are converted back into human-readable text (labels, answers, summaries).

With just one line, `pipeline('task-name')` handles all of this for you.


### Import Pipeline
Let's import this powerful function.


In [1]:
!pip install torch transformers tensorflow



In [2]:
from transformers import pipeline, set_seed, GPT2Tokenizer




### Import Utilities
We also need `nltk` for some traditional NLP tasks and `os` for file handling.


In [3]:
import os
import nltk


### Loading the Course Material
We will define the path to our course text file (`unit 1.txt`).


In [4]:
file_path = "/content/unit 1.txt"


Now we read the file. This text will be the 'Knowledge Base' for our tasks later.


In [5]:
try:
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()
    print("File loaded successfully!")
except FileNotFoundError:
    print(f"Error: '{file_path}' not found.")


File loaded successfully!


Let's look at the first 500 characters to make sure we have the right data.


In [6]:
print("--- Data Preview ---")
print(text[:500] + "...")


--- Data Preview ---
Generative AI and Its Applications: A Foundational Briefing

Executive Summary

This document provides a comprehensive overview of Generative AI, synthesizing foundational concepts, technological underpinnings, and practical applications as outlined in the course materials from PES University. Generative AI represents a transformative subset of Artificial Intelligence focused on creating novel content, a capability primarily driven by the advent of Large Language Models (LLMs). The evolution of ...


## 2. Generative AI: Dumb vs. Smart Models

Generative AI creates new content (text, images, audio). But the quality depends heavily on the model's size and training.

We will compare two models:
1.  **`distilgpt2`**: A 'distilled' version. It is smaller, faster, and requires less memory, but it might be less coherent (a "Dumb" model for this comparison).
2.  **`gpt2`**: The standard version (The "Smart" model, though still small by modern standards).

**How to access a model?**
1.  Go to Hugging Face Models page.
2.  Search for a task (e.g., 'Text Generation').
3.  Pick a model (e.g., `gpt2`).
4.  Copy the model name.


### Step 1: Set a Seed

A **seed value** is used to make random results **reproducible**. When we set a seed, the random number generator starts from the same point each time, which means it will produce the **same sequence of random values**.

Try running the code multiple times using the **same seed value** and observe the output.

Now, change the seed value and run the code again. This time, the output **will change** because a different seed creates a different sequence of random numbers.


In [7]:
set_seed(42)


### Step 2: Define a Prompt
Both models will complete this sentence.


In [8]:
prompt = "Generative AI is a revolutionary technology that"


# Text Generation

In [9]:
bert_generator = pipeline('text-generation', model='bert-base-uncased')
output_bert = bert_generator(prompt, max_length=50, num_return_sequences=1)
print(output_bert[0]['generated_text'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Ple

Generative AI is a revolutionary technology that................................................................................................................................................................................................................................................................


In [10]:
roberta_generator = pipeline('text-generation', model='roberta-base')
output_roberta = roberta_generator(prompt, max_length=50, num_return_sequences=1)
print(output_roberta[0]['generated_text'])


If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generative AI is a revolutionary technology that


In [11]:
bart_generator = pipeline('text-generation', model='facebook/bart-base')
output_bart = bart_generator(prompt, max_length=50, num_return_sequences=1)
print(output_bart[0]['generated_text'])


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generative AI is a revolutionary technology that squatSpoilerOtherwise Drawn Shak Shak32 Shak Shak Shak denim df df dfPatrickSel religious au au Shak Shak dfSpoiler slipsino chuck chuck32 df df32 Mavericks32 df Walk Drawn Drawn ShakJan chuck df df Walk slips32 dfOtherwise df df Drawn3232Thor df df chuck df Cr Walk Drawn CrazeazeazeFrames Drawn df Drawn Drawn Drawn37232ino32spawn df df aisle banned3232 Drawn Drawnaxe Drawn Drawneller Drawn Drawn Alvin df Alvin Drawn Drawn df 361 Shak Drawn Drawn workload Drawn Drawn32 df32 dfSpoiler Drawn Drawn debuggeraze df spots df origins Drawn Drawn spots df df slips charged df dfSel Molecular df dfStatusFrames Drawn Drawn Walk Drawn Shak Drawntravel Drawn DrawnSel df Drawn spots spots df spots spots Walk DrawnOtherwise dfFrames Drawn Walk Beet df Beet Walk DrawnPost Drawn Alvin skysc df df Beet Beet Drawnload spots df debugger Alvin df df sure Drawn Drawn origins df Beet debugger dfPost finding df Drawn finding df df futures df spots floral df dfP

# Fill-Mask

In [12]:
mask_filler = pipeline("fill-mask", model="bert-base-uncased")

masked_sentence = "The goal of Generative AI is to create new [MASK]."
preds = mask_filler(masked_sentence)

for p in preds:
    print(f"{p['token_str']}: {p['score']:.2f}")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


applications: 0.06
ideas: 0.05
problems: 0.05
systems: 0.04
information: 0.03


In [13]:
mask_filler = pipeline("fill-mask", model="roberta-base")

masked_sentence = "The goal of Generative AI is to create new <mask>."
preds = mask_filler(masked_sentence)

for p in preds:
    print(f"{p['token_str']}: {p['score']:.2f}")


Device set to use cpu


 AI: 0.07
 agents: 0.06
 intelligence: 0.05
 applications: 0.04
 insights: 0.04


In [14]:
mask_filler = pipeline("fill-mask", model="facebook/bart-base")

masked_sentence = "The goal of Generative AI is to create new <mask>."
preds = mask_filler(masked_sentence)

for p in preds:
    print(f"{p['token_str']}: {p['score']:.2f}")


Device set to use cpu


 ways: 0.16
 AI: 0.10
 and: 0.05
 models: 0.04
,: 0.03


# Question Answering

In [15]:
qa_pipeline = pipeline("question-answering", model="bert-base-uncased")

questions = [
    "What is the fundamental innovation of the Transformer?",
    "What are the risks of using Generative AI?"
]

for q in questions:
    res = qa_pipeline(question=q, context=text[:5000])
    print(f"\nQ: {q}")
    print(f"A: {res['answer']}")


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu



Q: What is the fundamental innovation of the Transformer?
A: not spam"

Q: What are the risks of using Generative AI?
A: not spam"


In [16]:
qa_pipeline = pipeline("question-answering", model="roberta-base")

questions = [
    "What is the fundamental innovation of the Transformer?",
    "What are the risks of using Generative AI?"
]

for q in questions:
    res = qa_pipeline(question=q, context=text[:5000])
    print(f"\nQ: {q}")
    print(f"A: {res['answer']}")


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu



Q: What is the fundamental innovation of the Transformer?
A: it

Q: What are the risks of using Generative AI?
A: it


In [17]:
qa_pipeline = pipeline("question-answering", model="facebook/bart-base")

questions = [
    "What is the fundamental innovation of the Transformer?",
    "What are the risks of using Generative AI?"
]

for q in questions:
    res = qa_pipeline(question=q, context=text[:5000])
    print(f"\nQ: {q}")
    print(f"A: {res['answer']}")


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu



Q: What is the fundamental innovation of the Transformer?
A: LLMs). The evolution of these models, from early

Q: What are the risks of using Generative AI?
A: LLMs). The evolution of these models, from early
