# **Introduction to Generative AI and Large Language Models (LLMs)**

## **1. Introduction**

### **What is Generative AI?**
Generative AI refers to a class of artificial intelligence models that can generate new content, such as text, images, music, or code, based on patterns learned from existing data. Unlike traditional AI, which focuses on classification or prediction, generative AI creates **new data** that resembles the training data.

### **What are Large Language Models (LLMs)?**
Large Language Models (LLMs) are a subset of generative AI models specifically designed to understand and generate human-like text. They are trained on massive datasets and can perform tasks like:
- Text generation
- Translation
- Summarization
- Question answering
- Code generation

### **Objective**
To provide a comprehensive introduction to Generative AI and LLMs, including their applications, key players, and technical foundations. We’ll also demonstrate how to use an LLM for text generation using Hugging Face's `transformers` library.

---

## **2. Metadata and Dataset Overview**

### **Dataset Used**
- **Dataset Name**: IMDB Movie Reviews
- **Source**: [Hugging Face Datasets](https://huggingface.co/datasets/imdb)
- **Description**: 50,000 movie reviews labeled as positive (1) or negative (0).


### **Acknowledgement**
This dataset is publicly available via Hugging Face and is widely used for training and evaluating LLMs.

---

## **3. Technical Concepts**

### **How Do LLMs Work?**
LLMs are based on the **Transformer architecture**, which uses self-attention mechanisms to process input sequences. Key components include:
1. **Self-Attention**: Captures relationships between words in a sequence.
2. **Scaled Dot-Product Attention**: Computes attention weights efficiently.
3. **Multi-Head Attention**: Parallelizes attention mechanisms for richer context.

### **Training LLMs**
1. **Pre-training**: The model is trained on a large corpus of text to predict the next word in a sequence (causal language modeling).
2. **Fine-tuning**: The pre-trained model is fine-tuned on specific tasks (e.g., summarization, translation).

### **Applications of LLMs**
1. **Chatbots**: Generate human-like responses in conversational AI.
2. **Content Creation**: Write articles, stories, or marketing copy.
3. **Code Generation**: Assist developers by generating code snippets.
4. **Translation**: Translate text between languages.
5. **Summarization**: Generate concise summaries of long documents.

---

## **4. Setup and Data Preprocessing**

### **Install Dependencies**


In [2]:
!pip install transformers datasets torch
import warnings
warnings.filterwarnings('ignore')

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupt

### **Explanation**
- **`transformers`**: Hugging Face's library for pre-trained models and tokenizers. It provides easy access to state-of-the-art models like GPT-2, BERT, and T5.
- **`datasets`**: A library for loading and processing datasets. It supports a wide range of datasets, including IMDB, OpenWebText, and more.
- **`torch`**: PyTorch, a deep learning framework used for building and training neural networks.


### **Load Dataset**

In [3]:
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')
# Load IMDB dataset
dataset = load_dataset("imdb")
print(dataset["train"][0])  # Sample review

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

### **Explanation**
- **`load_dataset`**: This function loads the IMDB dataset from Hugging Face's dataset hub. The dataset contains 50,000 movie reviews labeled as positive (1) or negative (0).
- **`dataset["train"][0]`**: We inspect the first training example to understand the data format. Each example contains:
  - `text`: The movie review text.
  - `label`: The sentiment label (0 for negative, 1 for positive).


### **Tokenization**


In [5]:
from transformers import AutoTokenizer

# Load GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize a sample text
# Access the 'train' split and then the first element
sample_text = dataset["train"][0]["text"]
tokenized_input = tokenizer(
    sample_text,
    return_tensors="pt",
    truncation=True,
    max_length=512  # Limit sequence length
)
print(tokenized_input)

{'input_ids': tensor([[   40, 26399,   314,  3001,   327, 47269, 20958,    12,    56, 23304,
          3913,   422,   616,  2008,  3650,   780,   286,   477,   262, 10386,
           326, 11191,   340,   618,   340,   373,   717,  2716,   287, 15904,
            13,   314,   635,  2982,   326,   379,   717,   340,   373, 12000,
           416,   471,    13,    50,    13, 17112,   611,   340,  1683,  3088,
           284,  3802,   428,  1499,    11,  4361,   852,   257,  4336,   286,
          7328,  3177,   366,  3642, 46927,     1,   314,  1107,   550,   284,
           766,   428,   329,  3589, 29847,  1671,  1220,  6927,  1671, 11037,
           464,  7110,   318, 19254,  1088,   257,  1862, 14023, 10512,  3710,
          3706, 44131,   508,  3382,   284,  2193,  2279,   673,   460,   546,
          1204,    13,   554,  1948,   673,  3382,   284,  2962,   607, 32649,
           507,   284,  1642,   617,  3297,   286, 11648,   319,   644,   262,
          2811,  7289,    68,  1807,  

### **Explanation**
- **`AutoTokenizer`**: This class automatically selects the appropriate tokenizer for the model (in this case, GPT-2).
- **`tokenizer`**: Converts text into numerical input that the model can process. It performs the following steps:
  - **Tokenization**: Splits the text into individual tokens (words, subwords, or characters).
  - **Encoding**: Converts tokens into numerical IDs.
  - **Padding/Truncation**: Ensures all sequences have the same length (512 tokens in this case).
- **`return_tensors="pt"`**: Returns PyTorch tensors instead of Python lists.
- **`truncation=True`**: Truncates sequences longer than 512 tokens.
- **`max_length=512`**: Limits the sequence length to 512 tokens.


## **5. Build a Text Generation Model**

### **Code: Load Pre-Trained GPT-2**


In [6]:
from transformers import AutoModelForCausalLM

# Load pre-trained GPT-2 model
model = AutoModelForCausalLM.from_pretrained("gpt2")

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### **Explanation**
- **`AutoModelForCausalLM`**: This class loads a pre-trained model for causal language modeling (predicting the next word in a sequence).
- **`from_pretrained("gpt2")`**: Loads the GPT-2 model, which is a popular LLM for text generation tasks.
- GPT-2 is a **decoder-only Transformer model**, meaning it generates text one token at a time based on previous tokens.


### **Code: Generate Text**

In [12]:
# Generate text
input_text = "Due to heavy rain in the past few days"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

# Generate output
output = model.generate(
    input_ids,
    max_length=50,  # Limit output length
    num_return_sequences=1,  # Generate one sequence
    no_repeat_ngram_size=2,  # Avoid repeating phrases
    top_k=50,  # Limit sampling to top-k tokens
    top_p=0.95,  # Nucleus sampling
    temperature=0.7  # Control randomness
)

# Decode output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Due to heavy rain in the past few days, the city has been forced to close its streets and streetscape.

The city's mayor, John Tory, said the closure was necessary to protect the public from the storm. "We're going


### **Explanation**
- **`input_text`**: The starting prompt for text generation.
- **`tokenizer(input_text, return_tensors="pt")`**: Tokenizes the input text and returns PyTorch tensors.
- **`model.generate`**: Generates text based on the input. Key parameters:
  - **`max_length=50`**: Limits the generated text to 50 tokens.
  - **`num_return_sequences=1`**: Generates only one sequence.
  - **`no_repeat_ngram_size=2`**: Prevents the model from repeating 2-grams (pairs of words).
  - **`top_k=50`**: Limits sampling to the top 50 most likely tokens at each step.
  - **`top_p=0.95`**: Uses nucleus sampling, which selects tokens from the smallest set whose cumulative probability exceeds 0.95.
  - **`temperature=0.7`**: Controls randomness. Lower values make the output more deterministic.
- **`tokenizer.decode`**: Converts the generated token IDs back into human-readable text.
- **`skip_special_tokens=True`**: Removes special tokens like `<end>` from the output.


## **6. Applications of LLMs**

### **1. Chatbots**
LLMs like GPT-3 and ChatGPT are used to build conversational AI systems that can answer questions, provide recommendations, and engage in natural dialogue.

### **2. Content Creation**
LLMs can generate articles, stories, and marketing copy. For example:
- **Blog Writing**: Generate blog posts on specific topics.
- **Creative Writing**: Write short stories or poems.

### **3. Code Generation**
LLMs like GitHub Copilot assist developers by generating code snippets, completing functions, and debugging.

### **4. Translation**
LLMs can translate text between languages with high accuracy. For example:
- **English to French**: Translate sentences or documents.

### **5. Summarization**
LLMs can generate concise summaries of long documents, making it easier to extract key information.

---

## **7. Key Players in Generative AI**

### **1. OpenAI**
- **Models**: GPT-3, GPT-4, ChatGPT, DALL·E.
- **Applications**: Text generation, image generation, conversational AI.

### **2. Hugging Face**
- **Models**: BERT, GPT-2, T5, DistilBERT.
- **Applications**: Open-source NLP tools, model hosting, fine-tuning.

### **3. Google**
- **Models**: BERT, T5, LaMDA.
- **Applications**: Search, translation, conversational AI.

### **4. Meta (Facebook)**
- **Models**: LLaMA, OPT.
- **Applications**: Research, open-source contributions.

---

## **8. Advantages and Disadvantages**

### **Advantages**
1. **Versatility**: LLMs can perform a wide range of tasks.
2. **Scalability**: Can handle large datasets and complex tasks.
3. **Human-Like Output**: Generates high-quality, coherent text.

### **Disadvantages**
1. **Compute-Intensive**: Requires GPUs/TPUs for training and inference.
2. **Bias**: May inherit biases from training data.
3. **Data Hunger**: Needs massive datasets for training.

---

## **9. Conclusion**

### **Key Learnings**
- Generative AI and LLMs are transforming industries by automating content creation, translation, and more.
- The Transformer architecture is the foundation of modern LLMs.
- Hugging Face's `transformers` library makes it easy to use pre-trained LLMs for text generation.

---

## **10. References**
- [Attention Is All You Need (Paper)](https://arxiv.org/abs/1706.03762)
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [OpenAI GPT-3](https://openai.com/research/gpt-3)
- [OpenWebText Dataset](https://huggingface.co/datasets/openwebtext)
