
## A Crash Course in Using LLMs to Build GenAI Powered Applications 🚀

### Introduction 👋
Welcome to this crash course on leveraging Large Language Models (LLMs) to infuse intelligence into your applications. In this 30-minute session, we'll explore the key use cases of LLMs and how you can quickly get started with building GenAI powered applications. Get ready to dive in! 🌟

#### Agenda 📋
- Setting up the environment ⚙️
- Zero-shot Chat 🎯
    - What is zero-shot learning?
    - Why use zero-shot learning?
- Few-shot Learning 🎓
    - What is few-shot learning?
    - Why use few-shot learning?
- Retrieval Augmented Generation (RAG) 🔍
    - What is RAG?
    - Why use RAG?
- Fine-tuning LLMs 🚀
    - What is fine-tuning?
    - Why use fine-tuning?
___

#### Setting up the Environment ⚙️
First, let's install the necessary library:

In [73]:
!pip install predictionguard 2>&1 > /dev/null

Next, import the required modules:


In [4]:
import os
import json

# FREE access token for usage at: tinyurl.com/pg-intel-hack
import predictionguard as pg
from getpass import getpass

Set up your Prediction Guard access token:

In [5]:
pg_access_token = getpass('Enter your Prediction Guard access token: ')
os.environ['PREDICTIONGUARD_TOKEN'] = pg_access_token

Enter your Prediction Guard access token: ··········


___
#### Zero-shot Chat 🎯

Zero-shot learning allows LLMs to perform tasks without any explicit training or examples. It leverages the model's pre-existing knowledge to generate responses based on the given prompt.

Why use zero-shot learning?

- Quick and easy to implement
- No need for task-specific training data
- Suitable for simple and straightforward tasks

Let's try a simple example of zero-shot chat using the Prediction Guard API:

In [75]:
messages = [
{
"role": "system",
"content": """You are a Question answer bot and will give an Answer based on the question you get.
It is critical to limit your answers to the question and dont print anything else.
If you cannot answer the question, respond with 'Sorry, I dont know.'"""
},
{
"role": "user",
"content": "I am going to meet my friend for a night out on the town."
}
]

result = pg.Chat.create(
    model="Neural-Chat-7B",
    messages=messages
)

print(result['choices'][0]['message']['content'].split('\n')[0])

Enjoy your night out with your friend.


___
#### Few-shot Learning 🎓

Few-shot learning involves providing a small number of examples to guide the model's output. It allows LLMs to adapt to specific tasks or styles by learning from a few representative examples.

Why use few-shot learning?

- Enables task-specific customization
- Improves the model's performance on desired tasks
- Requires minimal training data

Let's explore few-shot learning for linguistic style transfer with the Prediction Guard API:





In [101]:
examples = """Neutral: "I'm looking for directions to the nearest bank, can you help me?"
Yoda: "Directions to the nearest bank, you seek. Help you, I can."

Neutral: "It's a pleasure to meet you. What's your name?"
Yoda: "A pleasure to meet you, it is. Yoda, my name is."

Neutral: "I've lost my way, could you point me in the right direction?"
Yoda: "Lost your way, you have. Point you in the right direction, I will."

Neutral: "This weather is wonderful, isn't it?"
Yoda: "Wonderful, this weather is. Agree, do you not?"

Neutral: "I'm feeling a bit under the weather today."
Yoda: "Under the weather, you are feeling today. Better soon, you will be."

Neutral: "Could you please lower the volume? It's quite loud."
Yoda: "Lower the volume, could you please? Quite loud, it is."

Neutral: "I'm here to collect the documents you mentioned."
Yoda: "The documents I mentioned, collect them, you are here to."

Neutral: "Thank you for your assistance. I really appreciate it."
Yoda: "For your assistance, thank you. Appreciate it, I really do."

Neutral: "I'm sorry, I didn't catch your last sentence."
Yoda: "Sorry, I am. Your last sentence, catch it, I did not."

Neutral: "Let's schedule a meeting for next week to discuss the project."
Yoda: "A meeting for next week, schedule, let us. The project, discuss, we will."""

messages = [
{
"role": "system",
"content": "You are a text editor that takes in Neutral text from the user and outputs modified text in the way Yoda would speak similar to these examples:\n\n" + examples
},
{
"role": "user",
"content": 'Neutral: "I am going to meet my friend for a night out on the town."\nYoda:'
}
]

for model in ["Neural-Chat-7B","Hermes-2-Pro-Mistral-7B", "Yi-34B-Chat"]:
    result = pg.Chat.create(
        model,
        messages=messages
    )
    print("="*71)
    print(f"Using Model: {model}")
    print(f"Neutral Text: I am going to meet my friend for a night out on the town.")
    lines = result['choices'][0]['message']['content'].split('\n')
    print(f"How would Yoda say this?: {lines[0]}")
    print("="*71)


Using Model: Neural-Chat-7B
Neutral Text: I am going to meet my friend for a night out on the town.
How would Yoda say this?: "A night out on the town, to meet my friend, I am going for."
Using Model: Hermes-2-Pro-Mistral-7B
Neutral Text: I am going to meet my friend for a night out on the town.
How would Yoda say this?: "A night out on the town, you're going. Meet your friend, you will." 
Using Model: Yi-34B-Chat
Neutral Text: I am going to meet my friend for a night out on the town.
How would Yoda say this?: "A night out on the town, to meet your friend, you are going. Enjoy, may you." 


___
#### Retrieval Augmented Generation (RAG) 🔍

RAG combines LLMs with external knowledge retrieval to generate more informative and accurate responses. It allows models to access and incorporate relevant information from external sources during the generation process.

**Why use RAG?**

- Enhances the model's knowledge beyond its training data
- Generates more accurate and informative responses
- Enables the model to handle a wider range of topics and domains


##### 🚀 Building a Question Answering RAG pipeline with Sentence Transformers and FAISS 📚

Install required libraries:

In [83]:
!echo "installing required libraries..."
!pip install faiss-cpu > /dev/null  # for indexing
!pip install sentence_transformers > /dev/null. # for generating embeddings
!echo "installing completed..."


installing required libraries...
installing completed...


In this section, we'll explore how to build a powerful question answering system using Sentence Transformers for generating embeddings and FAISS for efficient similarity search. Let's dive in! 🌊

###### 📥 Importing the Required Libraries
First, let's import the necessary libraries:

In [84]:
import predictionguard as pg
from sentence_transformers import SentenceTransformer
import faiss

###### 🏗️ Defining the Knowledge Base
Next, we'll define our simplified knowledge base as a list of strings:*italicized text*

In [106]:
knowledge_base = [
    "Prediction Guard is an AI company that provides APIs for language models.",
    "Prediction Guard is an Intel Liftoff startup",
    "Intel Liftoff is Intel's premier startup accelerator program for early stage startups",
    "Prediction Guard offers a variety of models for different tasks like text generation, classification, and question answering.",
    "Prediction Guard's APIs are easy to use and integrate into your applications.",
    "Prediction Guard is deployed on the Intel Developer Cloud using Intel Habana Gaudi 2 machines.",
    "Intel Habana Gaudi 2 is a purpose-built AI processor designed for high-performance deep learning training and inference.",
    "Gaudi 2 offers high efficiency, scalability, and ease of use for AI workloads.",
    "By leveraging Gaudi 2 on the Intel Developer Cloud, Prediction Guard can provide powerful and efficient AI capabilities to its users."
]

###### 💬 Creating the Question Answering Prompt Template
We'll create a simple prompt template for our question answering system, optionally if you use Langchain you can use the PromptTemplate class to build a composible prompt:

In [126]:
prompt_template = f"""
### Instruction:
Read the below input context and respond with a short answer to the given question.
Use only the information in the below input to answer the question.
It is critical to limit your answers to the question and dont print anything else.
If you cannot answer the question, respond with "Sorry, I don't know."

### Input:
Context: {{}}
Question: {{}}

### Response:
"""

###### 🤖 Loading the Sentence Transformer Model
We'll load the Sentence Transformer model for generating embeddings:

In [108]:
model = SentenceTransformer("all-MiniLM-L6-v2") # a small and fast embedding model

###### 🎨 Generating Embeddings for the Knowledge Base
Using the loaded model, we'll generate embeddings for our knowledge base:

In [109]:
kb_embeddings = model.encode(knowledge_base)

###### 🔍 Initializing the FAISS Index

We'll initialize a FAISS index for efficient similarity search:

In [110]:
index = faiss.IndexFlatL2(kb_embeddings.shape[1]) # 384
index.add(kb_embeddings)

###### 🎯 Defining the Question Answering Function
Let's define a function that takes a question, finds the most relevant chunk from the knowledge base, and generates an answer using the language model:

In [127]:
def rag_answer(question):
    try:
        # Generate embedding for the question
        question_embedding = model.encode([question])

        # Find the most similar text from the knowledge base using FAISS
        _, most_relevant_idx = index.search(question_embedding, 1)
        relevant_chunk = knowledge_base[most_relevant_idx[0][0]]
        # Format our prompt with the question and relevant context using f-strings
        prompt=prompt_template.format(relevant_chunk, question)

        # Get a response from the language model
        result = pg.Completion.create(
            model="Neural-Chat-7B", #"Nous-Hermes-Llama2-13B",
            prompt=prompt
        )
        return result['choices'][0]['text']
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return "Sorry, something went wrong. Please try again later."

🧪 Testing the Question Answering System
Let's test our question answering system with a couple of examples:

In [128]:
print("="*71)
question1 = "What hardware does Prediction Guard use for its AI services?"
response1 = rag_answer(question1)
print(f"Question 1: {question1}")
print(f"Response 1: {response1}")
print("="*71)

question2 = "Where is Prediction Guard's headquarters located?"
response2 = rag_answer(question2)
print(f"Question 2: {question2}")
print(f"Response 2: {response2}")
print("="*71)

question3 = "What is Intel Liftoff and what is prediction guards relatioship to liftoff?"
response3 = rag_answer(question3)
print(f"Question 3: {question3}")
print(f"Response 3: {response3}")
print("="*71)

Question 1: What hardware does Prediction Guard use for its AI services?
Response 1: Prediction Guard uses Intel Habana Gaudi 2 machines for its AI services.
Question 2: Where is Prediction Guard's headquarters located?
Response 2: Sorry, I don't know.
Question 1: What is Intel Liftoff and what is prediction guards relatioship to liftoff?
Response 1: Intel Liftoff is a startup incubator that supports promising technology companies. Prediction Guard is one such startup that is a part of the Intel Liftoff program. The relationship between Prediction Guard and Intel Liftoff is that Prediction Guard is an Intel Liftoff startup which is being nurtured and supported by the incubator program.


As we can see, our question answering system provides a relevant answer for the first question, which can be found in the knowledge base. For the second question, since the information is not present in the knowledge base, it responds with "Sorry, I don't know." 🙌

___
#### Fine-tuning LLMs 🚀

Fine-tuning involves training LLMs on task-specific data to adapt them to particular domains or applications. It allows for more precise control over the model's behavior and can lead to better performance on specialized tasks.

**Why use fine-tuning?**

- Achieves the best performance on specific tasks
- Enables customization to match desired output style and format
- Suitable for complex and domain-specific applications

In [131]:
!pip install -U --no-cache datasets transformers

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting transformers
  Downloading transformers-4.39.0-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
Collecting fsspec[http]<=2024.2.0,>=2023.1.0 (from datasets)
  Downloading fsspec-2024.2.0-py3-none-any.whl (170 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m170.9/170.9 kB[0m [31m142.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fsspec, transformers, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.3.1
    Uninstalling fsspec-2024.3.1:
      Successfully uninstalled fsspec-2024.3.1
  Attempting uninstall: transformers
    Found existing installation: transformers 4.36.2
    Uninstalling transformers-4.36.2:
      S

##### Load the Dataset 📚

First, let's subset of the Dolly 15k dataset using the Hugging Face Datasets library that only contains question & answers:

In [None]:
from datasets import load_dataset

dataset = load_dataset("dave-does-data/databricks-dolly-qa-subset-7k", split="train")

##### Load the Model and Tokenizer 🤖
Next, we load the distilgpt2 model and tokenizer from Hugging Face:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)

##### Perform Inference Before Fine-tuning 🔍
Let's select a question from the dataset and perform inference before fine-tuning:

In [None]:
from transformers import pipeline

question = dataset[42]["instruction"]

print("Inference before fine-tuning:")
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
output = generator(question, max_length=100)
print("Question:", question)
print("Answer:", output[0]["generated_text"])

##### Tokenize the Dataset 🔠
We need to tokenize the dataset before fine-tuning:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["instruction"], examples["response"], truncation=True, padding="max_length", max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)

##### Set Up Training Arguments 🎛️
Let's set up the training arguments for fine-tuning:

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    max_steps=50,
    bf16=True,
    per_device_train_batch_size=8,
    learning_rate=5e-5,
    save_total_limit=2,
)

##### Create the Trainer 🏋️‍♀️
Now, we create the trainer object using the model, training arguments, and tokenized dataset:

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

##### Fine-tune the Model 🚀
It's time to fine-tune the model:

In [None]:
trainer.train()

##### Save the Fine-tuned Model 💾
After fine-tuning, let's save the fine-tuned model:

In [None]:
trainer.save_model("./fine_tuned_model")

##### Perform Inference After Fine-tuning 🔍
Finally, let's load the fine-tuned model and perform inference again:

In [None]:
fine_tuned_model = AutoModelForCausalLM.from_pretrained("./fine_tuned_model")

print("\nInference after fine-tuning:")
generator = pipeline("text-generation", model=fine_tuned_model, tokenizer=tokenizer)
output = generator(question, max_length=100)
print("Question:", question)
print("Answer:", output[0]["generated_text"])

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, pipeline
from datasets import load_dataset

# Load the Dolly 15k dataset
dataset = load_dataset("dave-does-data/databricks-dolly-qa-subset-7k", split="train")

# Load the distilgpt2 model and tokenizer from Hugging Face
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)

# Select a question from the Dolly 15k dataset
question = dataset[42]["instruction"]

# Perform inference before fine-tuning
print("Inference before fine-tuning:")
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
output = generator(question, max_length=100)
print("Question:", question)
print("Answer:", output[0]["generated_text"])

# Tokenize the dataset
def tokenize_function(examples):
    # Adding a prompt to each question and answer pair
    prompt_examples = ["Question: " + instr + " Answer: " + resp for instr, resp in zip(examples['instruction'], examples['response'])]
    # Tokenize the prompt examples
    tokenized_examples = tokenizer(prompt_examples, truncation=True, padding="max_length", max_length=128)
    # The labels should be the same as input_ids for causal language modeling
    tokenized_examples["labels"] = tokenized_examples["input_ids"].copy()
    return tokenized_examples

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)

# Set up the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    max_steps=50,
    bf16=True,
    per_device_train_batch_size=8,
    learning_rate=5e-5,
    save_total_limit=2,
)

# Create the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./fine_tuned_model")

# Load the fine-tuned model
fine_tuned_model = AutoModelForCausalLM.from_pretrained("./fine_tuned_model")

# Perform inference after fine-tuning
print("\nInference after fine-tuning:")
generator = pipeline("text-generation", model=fine_tuned_model, tokenizer=tokenizer)
output = generator(question, max_length=100)
print("Question:", question)
print("Answer:", output[0]["generated_text"])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Inference before fine-tuning:
Question: What happens when someone throws a phone onto a mattress?
Answer: What happens when someone throws a phone onto a mattress? What happens when people are sleeping? What happens when your phone hits the floor?


It looks like the real estate industry can't even get past that. As The Post reports, "Most Americans are in a better position than in the past."
The company also says it is aware this is not a big deal.
But we've learned from the Post and can still be confident it will follow in the coming months.
What


Map:   0%|          | 0/7706 [00:00<?, ? examples/s]

Map:   0%|          | 0/7706 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
