In [None]:
%%html
<style>
/* Notebook-wide styles */
.notebook-container {
    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
    line-height: 1.6;
    color: #333;
    background-color: #f5f5f5;
}

/* Headings */
h1, h2, h3, h4, h5, h6 {
    font-weight: bold;
    margin-top: 40px;
    margin-bottom: 20px;
    color: #1d1d1f;
}

/* Images */
img {
    display: block;
    margin: 30px auto;
    border-radius: 10px;
    box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1), 0 1px 3px rgba(0, 0, 0, 0.08);
    transition: transform 0.3s ease;
    border: 2px solid #fff;
}

img:hover {
    transform: translateY(-5px);
}

/* Image captions */
.image-caption {
    text-align: center;
    font-size: 14px;
    color: #666;
    margin-top: -10px;
    margin-bottom: 30px;
}

/* Links */
a {
    color: #007aff;
    text-decoration: none;
}

a:hover {
    text-decoration: underline;
}

/* Code cells */
.input_area pre {
    background-color: #1d1d1f;
    color: #fff;
    padding: 15px;
    border-radius: 10px;
}

/* Markdown cells */
.text_cell_render {
    padding: 30px;
    background-color: #fff;
    border-radius: 10px;
    box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1), 0 1px 3px rgba(0, 0, 0, 0.08);
}
</style>


## A Crash Course in Using LLMs to Build GenAI Powered Applications 🚀

### Introduction 👋
Welcome to this crash course on leveraging Large Language Models (LLMs) to infuse intelligence into your applications. In this 30-minute session, we'll explore the key use cases of LLMs and how you can quickly get started with building GenAI powered applications. Get ready to dive in! 🌟

#### Agenda 📋
- Getting Starting with the Intel Developer Cloud ☁️
- Setting up the environment ⚙️
- Zero-shot Chat 🎯
    - What is zero-shot learning?
    - Why use zero-shot learning?
- Few-shot Learning 🎓
    - What is few-shot learning?
    - Why use few-shot learning?
- Retrieval Augmented Generation (RAG) 🔍
    - What is RAG?
    - Why use RAG?
- Fine-tuning LLMs 🚀
    - What is fine-tuning?
    - Why use fine-tuning?
___

___
#### 🌟 Getting Started with Intel Developer Cloud 🌟
To prototype and build your GenAI solution on Intel Developer Cloud, you'll first need to create a free account. Here's how:

1. Go to [cloud.intel.com/hoohacks](https://cloud.intel.com/hoohacks) and click "Get Started" if you don't have an account yet.
   
   <img src="./images/idc_home_page.png" alt="Intel Developer Cloud homepage" width="500">
   <div class="image-caption"">Intel Developer Cloud homepage</div>

2. Register for a Standard (free) account by filling out the sign-up form.
   
   <img src="./images/idc_service_tiers.png" alt="Intel Developer Cloud service tiers" width="500">
   <div class="image-caption">Intel Developer Cloud service tiers</div>

3. Once registered, log in with your new credentials. You'll land on the Intel Developer Cloud Console home page at [https://console.cloud.intel.com](https://console.cloud.intel.com).
   
   <img src="./images/console_genai_essentials.png" alt="Intel Developer Cloud Console - GenAI Essentials" width="500">
   <div class="image-caption">Intel Developer Cloud Console - GenAI Essentials</div>

4. From the home page or the left sidebar, navigate to **"Training"** and look for the "Gen AI Essentials" content. This is where you'll find hands-on tutorials for building GenAI applications.

#### 🤖 Prototyping Fast with Intel Hardware 🤖
When you launch a Jupyter notebook from GenAI Essentials, it will have shared access to a system with:

- **112 CPU cores**
- **512 GB RAM**
- **1 Intel Data Center GPU Max with 48GB VRAM**

This notebook environment is perfect for building your GenAI solution prototype. Go wild and have fun experimenting! 😄

If you are developing in pytorch, select the `pytorch-gpu` kernel and to use intel GPUs:

```python
import torch
import intel_extension_for_pytorch as ipex
print(f"GPU is: {torch.xpu.get_device_name()}")
```
#### 🚀 Deploying Your Solution 🚀
Once you have a working prototype and are ready to deploy it for others to use, come talk to me. I can provide you with cloud credits to launch a dedicated CPU VM to host your deployed app.

To expose the API endpoint, you can use a tool like ngrok to create a public URL.

To expose your GenAI application to the internet, you can use ngrok and create a simple Flask server. Here's how:


1. Install ngrok and Flask:

```python
pip install flask ngrok
```
3. Create a simple Flask server in a new python file (Default port is 5000):
```python
   from flask import Flask, request, jsonify
    app = Flask(__name__)

    @app.route('/api', methods=['POST'])
    def api():
    data = request.get_json()
    # Process the request and generate a response
    response = {'message': 'Hello, World!'}
    return jsonify(response)

if __name__ == '__main__':
    app.run()
```

3. In a new python file, set up ngrok to expose the Flask server:

```python
from pyngrok import ngrok

# Start ngrok tunnel
public_url = ngrok.connect(5000)
print(f'Public URL: {public_url}')
```
4. Run the Flask server and ngrok python files. The public URL printed by ngrok can be used to access your GenAI application from anywhere on the internet.

The provided example demonstrates how to create a simple Flask server and expose it using ngrok. You can build upon this foundation to integrate your GenAI application and handle the necessary requests and responses.

> **Note:** This is a basic example to get you started. You'll need to modify the Flask server to integrate your GenAI application logic and handle the appropriate requests and responses.

This will allow users to access your GenAI application over the internet.

Let me know if you have any other questions! I'm excited to see what you build 🙌

#### Setting up the Environment ⚙️
First, let's install the necessary library:

In [None]:
import os
import site
import sys

!echo "installing required python libraries, please wait..."
!{sys.executable} -m pip install --upgrade predictionguard #> /dev/null # for accessing LLM APIs
!{sys.executable} -m pip install --upgrade  "transformers>=4.38.*" #> /dev/null
!{sys.executable} -m pip install --upgrade  "datasets>=2.18.*" #> /dev/null
!{sys.executable} -m pip install --upgrade "accelerate>=0.28.*" #> /dev/null
!{sys.executable} -m pip install --upgrade faiss-cpu #> /dev/null  # for indexing
!{sys.executable} -m pip install --upgrade sentence_transformers #> /dev/null # for generating embeddings
!echo "installation complete..."

# add the location where we installed these libraries to the python pkg path (~/.local/lib/python3.9/*)
# Get the site-packages directory
site_packages_dir = site.getsitepackages()[0]

# add the site pkg directory where these pkgs are insalled to the top of sys.path
if not os.access(site_packages_dir, os.W_OK):
    user_site_packages_dir = site.getusersitepackages()
    if user_site_packages_dir in sys.path:
        sys.path.remove(user_site_packages_dir)
    sys.path.insert(0, user_site_packages_dir)
else:
    if site_packages_dir in sys.path:
        sys.path.remove(site_packages_dir)
    sys.path.insert(0, site_packages_dir)


# adding ~/.local/bin to PATH as well
home_dir = os.path.expanduser('~')
bin_path = os.path.join(home_dir, '.local', 'bin')
os.environ['PATH'] += os.pathsep + bin_path

Next, import the required modules:


In [None]:
import os
import json

# FREE access token for usage at: tinyurl.com/pg-intel-hack
import predictionguard as pg
from getpass import getpass

Set up your Prediction Guard access token:

In [None]:
pg_access_token = getpass('Enter your Prediction Guard access token: ')
os.environ['PREDICTIONGUARD_TOKEN'] = pg_access_token

___
#### Zero-shot Chat 🎯

Zero-shot learning allows LLMs to perform tasks without any explicit training or examples. It leverages the model's pre-existing knowledge to generate responses based on the given prompt.

Why use zero-shot learning?

- Quick and easy to implement
- No need for task-specific training data
- Suitable for simple and straightforward tasks

Let's try a simple example of zero-shot chat using the Prediction Guard API:

In [None]:
messages = [
{
"role": "system",
"content": """You are a Question answer bot and will give an Answer based on the question you get.
It is critical to limit your answers to the question and dont print anything else.
If you cannot answer the question, respond with 'Sorry, I dont know.'"""
},
{
"role": "user",
"content": "I am going to meet my friend for a night out on the town."
}
]

result = pg.Chat.create(
    model="Neural-Chat-7B",
    messages=messages
)

print(result['choices'][0]['message']['content'].split('\n')[0])

___
#### Few-shot Learning 🎓

Few-shot learning involves providing a small number of examples to guide the model's output. It allows LLMs to adapt to specific tasks or styles by learning from a few representative examples.

Why use few-shot learning?

- Enables task-specific customization
- Improves the model's performance on desired tasks
- Requires minimal training data

Let's explore few-shot learning for linguistic style transfer with the Prediction Guard API:





In [None]:
examples = """Neutral: "I'm looking for directions to the nearest bank, can you help me?"
Yoda: "Directions to the nearest bank, you seek. Help you, I can."

Neutral: "It's a pleasure to meet you. What's your name?"
Yoda: "A pleasure to meet you, it is. Yoda, my name is."

Neutral: "I've lost my way, could you point me in the right direction?"
Yoda: "Lost your way, you have. Point you in the right direction, I will."

Neutral: "This weather is wonderful, isn't it?"
Yoda: "Wonderful, this weather is. Agree, do you not?"

Neutral: "I'm feeling a bit under the weather today."
Yoda: "Under the weather, you are feeling today. Better soon, you will be."

Neutral: "Could you please lower the volume? It's quite loud."
Yoda: "Lower the volume, could you please? Quite loud, it is."

Neutral: "I'm here to collect the documents you mentioned."
Yoda: "The documents I mentioned, collect them, you are here to."

Neutral: "Thank you for your assistance. I really appreciate it."
Yoda: "For your assistance, thank you. Appreciate it, I really do."

Neutral: "I'm sorry, I didn't catch your last sentence."
Yoda: "Sorry, I am. Your last sentence, catch it, I did not."

Neutral: "Let's schedule a meeting for next week to discuss the project."
Yoda: "A meeting for next week, schedule, let us. The project, discuss, we will."""

messages = [
{
"role": "system",
"content": "You are a text editor that takes in Neutral text from the user and outputs modified text in the way Yoda would speak similar to these examples:\n\n" + examples
},
{
"role": "user",
"content": 'Neutral: "I am going to meet my friend for a night out on the town."\nYoda:'
}
]

for model in ["Neural-Chat-7B","Hermes-2-Pro-Mistral-7B", "Yi-34B-Chat"]:
    result = pg.Chat.create(
        model,
        messages=messages
    )
    print("="*71)
    print(f"Using Model: {model}")
    print(f"Neutral Text: I am going to meet my friend for a night out on the town.")
    lines = result['choices'][0]['message']['content'].split('\n')
    print(f"How would Yoda say this?: {lines[0]}")
    print("="*71)


___
#### Retrieval Augmented Generation (RAG) 🔍

RAG combines LLMs with external knowledge retrieval to generate more informative and accurate responses. It allows models to access and incorporate relevant information from external sources during the generation process.

**Why use RAG?**

- Enhances the model's knowledge beyond its training data
- Generates more accurate and informative responses
- Enables the model to handle a wider range of topics and domains


##### 🚀 Building a Question Answering RAG pipeline with Sentence Transformers and FAISS 📚

Install required libraries:

In [None]:
#!echo "installing required libraries..."
#!pip install faiss-cpu > /dev/null  # for indexing
#!pip install sentence_transformers > /dev/null. # for generating embeddings
#!echo "installing completed..."


In this section, we'll explore how to build a powerful question answering system using Sentence Transformers for generating embeddings and FAISS for efficient similarity search. Let's dive in! 🌊

###### 📥 Importing the Required Libraries
First, let's import the necessary libraries:

In [None]:
import predictionguard as pg
from sentence_transformers import SentenceTransformer
import faiss

###### 🏗️ Defining the Knowledge Base
Next, we'll define our simplified knowledge base as a list of strings:*italicized text*

In [None]:
knowledge_base = [
    "Prediction Guard is an AI company that provides APIs for language models.",
    "Prediction Guard is an Intel Liftoff startup",
    "Intel Liftoff is Intel's premier startup accelerator program for early stage startups",
    "Prediction Guard offers a variety of models for different tasks like text generation, classification, and question answering.",
    "Prediction Guard's APIs are easy to use and integrate into your applications.",
    "Prediction Guard is deployed on the Intel Developer Cloud using Intel Habana Gaudi 2 machines.",
    "Intel Habana Gaudi 2 is a purpose-built AI processor designed for high-performance deep learning training and inference.",
    "Gaudi 2 offers high efficiency, scalability, and ease of use for AI workloads.",
    "By leveraging Gaudi 2 on the Intel Developer Cloud, Prediction Guard can provide powerful and efficient AI capabilities to its users."
]

###### 💬 Creating the Question Answering Prompt Template
We'll create a simple prompt template for our question answering system, optionally if you use Langchain you can use the PromptTemplate class to build a composible prompt:

In [None]:
prompt_template = f"""
### Instruction:
Read the below input context and respond with a short answer to the given question.
Use only the information in the below input to answer the question.
It is critical to limit your answers to the question and dont print anything else.
If you cannot answer the question, respond with "Sorry, I don't know."

### Input:
Context: {{}}
Question: {{}}

### Response:
"""

###### 🤖 Loading the Sentence Transformer Model
We'll load the Sentence Transformer model for generating embeddings:

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2") # a small and fast embedding model

###### 🎨 Generating Embeddings for the Knowledge Base
Using the loaded model, we'll generate embeddings for our knowledge base:

In [None]:
kb_embeddings = model.encode(knowledge_base)

###### 🔍 Initializing the FAISS Index

We'll initialize a FAISS index for efficient similarity search:

In [None]:
index = faiss.IndexFlatL2(kb_embeddings.shape[1]) # 384
index.add(kb_embeddings)

###### 🎯 Defining the Question Answering Function
Let's define a function that takes a question, finds the most relevant chunk from the knowledge base, and generates an answer using the language model:

In [None]:
def rag_answer(question):
    try:
        # Generate embedding for the question
        question_embedding = model.encode([question])

        # Find the most similar text from the knowledge base using FAISS
        _, most_relevant_idx = index.search(question_embedding, 1)
        relevant_chunk = knowledge_base[most_relevant_idx[0][0]]
        # Format our prompt with the question and relevant context using f-strings
        prompt=prompt_template.format(relevant_chunk, question)

        # Get a response from the language model
        result = pg.Completion.create(
            model="Neural-Chat-7B", #"Nous-Hermes-Llama2-13B",
            prompt=prompt
        )
        return result['choices'][0]['text']
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return "Sorry, something went wrong. Please try again later."

🧪 Testing the Question Answering System
Let's test our question answering system with a couple of examples:

In [None]:
print("="*71)
question1 = "What hardware does Prediction Guard use for its AI services?"
response1 = rag_answer(question1)
print(f"Question 1: {question1}")
print(f"Response 1: {response1}")
print("="*71)

question2 = "Where is Prediction Guard's headquarters located?"
response2 = rag_answer(question2)
print(f"Question 2: {question2}")
print(f"Response 2: {response2}")
print("="*71)

question3 = "What is Intel Liftoff and what is prediction guards relatioship to liftoff?"
response3 = rag_answer(question3)
print(f"Question 3: {question3}")
print(f"Response 3: {response3}")
print("="*71)

As we can see, our question answering system provides a relevant answer for the first question, which can be found in the knowledge base. For the second question, since the information is not present in the knowledge base, it responds with "Sorry, I don't know." 🙌

___
#### Fine-tuning LLMs 🚀

Fine-tuning involves training LLMs on task-specific data to adapt them to particular domains or applications. It allows for more precise control over the model's behavior and can lead to better performance on specialized tasks.

**Why use fine-tuning?**

- Achieves the best performance on specific tasks
- Enables customization to match desired output style and format
- Suitable for complex and domain-specific applications

In [None]:
import warnings
warnings.filterwarnings("ignore")
import torch
import intel_extension_for_pytorch   # to add intel GPU namespace (torch.xpu) to pytor|ch

if torch.xpu.is_available():
    def get_memory_usage():
        memory_reserved = round(torch.xpu.memory_reserved() / 1024**3, 3)
        memory_allocated = round(torch.xpu.memory_allocated() / 1024**3, 3)
        max_memory_reserved = round(torch.xpu.max_memory_reserved() / 1024**3, 3)
        max_memory_allocated = round(torch.xpu.max_memory_allocated() / 1024**3, 3)
        return memory_reserved, memory_allocated, max_memory_reserved, max_memory_allocated
   
    def print_memory_usage():
        device_name = torch.xpu.get_device_name()
        print(f"XPU available!! \n - Name: {device_name}")
        memory_reserved, memory_allocated, max_memory_reserved, max_memory_allocated = get_memory_usage()
        memory_usage_text = f" - XPU Memory: Reserved={memory_reserved} GB, Allocated={memory_allocated} GB, Max Reserved={max_memory_reserved} GB, Max Allocated={max_memory_allocated} GB"
        print(f"\r{memory_usage_text}", end="", flush=True)
    
    print_memory_usage()
    torch.xpu.empty_cache()

    

##### Load the Dataset 📚

First, let's subset of the Databricks Dolly 15k dataset using the Hugging Face Datasets library:

In [None]:
from datasets import load_dataset

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

##### Load the Model and Tokenizer 🤖
Next, we load the `microsoft/phi-1_5` model and tokenizer from Hugging Face:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
device = "cpu" #"xpu" if torch.xpu.is_available() else "cpu"
model_name = "microsoft/phi-1_5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name).to(device) # move to xpu device if available, otherwise use cpu

##### Perform Inference Before Fine-tuning 🔍
Let's select a question from the dataset and perform inference before fine-tuning:

In [None]:
from transformers import pipeline

question = dataset[42]["instruction"]

print("Inference before fine-tuning:")
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, truncation=True)
output = generator(question, max_length=42)
print("Question:", question)
print("Answer:", output[0]["generated_text"])

##### Format and Tokenize the Dataset 🔠
We need to format and tokenize the dataset before fine-tuning:

In [None]:
def format_dataset(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    sample["text"] = f"{prompt}{tokenizer.eos_token}"
    return sample

dataset = dataset.map(format_dataset)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)

# Split the dataset into train and test subsets
train_test_split = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

##### 🔧 Configuring PEFT (Parameter-Efficient Fine-Tuning)

We will use parameter efficient fine tuning to tweak the model here:

In [None]:
from peft import get_peft_model, LoraConfig, TaskType

config = LoraConfig(
    r=16,
    lora_alpha=2,
    target_modules=["fc1", "fc2","Wqkv", "out_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
#model.gradient_checkpointing_enable()  # enable if low on VRAM

##### Set Up Training Arguments 🎛️
Let's set up the training arguments for fine-tuning:

##### Create the Trainer 🏋️‍♀️
Now, we create the trainer object using the model, training arguments, and tokenized dataset:

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
        output_dir="output",
        bf16=True,
        use_ipex=True,
        max_grad_norm=0.6,
        weight_decay=0.01,
        group_by_length=True,
        optim="adamw_hf",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=2,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=30,
        max_steps=200,
        #num_train_epochs=3,
        report_to="wandb"
    )

In [None]:
from transformers import DataCollatorForLanguageModeling 
from transformers import Trainer
import os
os.environ["TOKENIZERS_PARALLELISM"] = "0"  # prevent warnings from training on process forking
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

##### Fine-tune the Model 🚀
It's time to fine-tune the model:

In [None]:
if torch.xpu.is_available():
    torch.xpu.empty_cache()
results = trainer.train()

##### Save the Fine-tuned Model 💾
After fine-tuning, let's save the fine-tuned model:

In [None]:
trainer.save_model("./fine_tuned_model")

##### Perform Inference After Fine-tuning 🔍
Finally, let's load the fine-tuned model and perform inference again:

In [None]:
print("\nInference after fine-tuning with context:")
del fine_tuned_model
if torch.xpu.is_available():
    torch.xpu.empty_cache()
fine_tuned_model = AutoModelForCausalLM.from_pretrained("./fine_tuned_model").to("xpu")
prompt = "What is oneAPI?"
context = "Intel's OneAPI is a unified programming model that simplifies development across diverse architectures, including CPUs, GPUs, FPGAs, and other accelerators. It provides a set of tools and libraries for high-performance computing, AI, and machine learning workloads."
input_text = f"### Instruction\n{prompt}\n\n### Context\n{context}\n\n### Answer\n"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("xpu")

outputs = fine_tuned_model.generate(input_ids=input_ids, max_length=100, num_return_sequences=1, temperature=0.1, do_sample=True, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

#### Conclusion 🎉
Congratulations on completing this crash course on using Large Language Models (LLMs) to build GenAI powered applications! 🙌

Throughout this session, we've covered a range of techniques and approaches:

- We explored zero-shot learning, where you learned how to leverage pre-trained LLMs without additional training data. This powerful technique allows you to tackle a wide range of tasks out of the box. 🎯
- Next, we delved into few-shot learning, where you discovered how providing a few examples can significantly improve the performance and accuracy of LLMs on specific tasks. 🎓
- We then saw Retrieval Augmented Generation (RAG), a technique that combines the power of LLMs with external knowledge retrieval to generate more informed and contextually relevant responses. 🔍
- Finally, we went through fine-tuning, where you learned how to adapt pre-trained LLMs to specific domains or tasks by training them on a smaller dataset. This enables you to build highly specialized and efficient GenAI applications. 🚀
  
With the knowledge and skills you've gained in this crash course, you're now well-equipped to use the potential of LLMs and build intelligent, GenAI-powered applications.


Thank you for joining this crash course, and I hope you found it informative and engaging. If you have any further questions or want to dive deeper into any of the topics we covered, feel free to reach out or explore the vast array of resources available online.

Happy hacking, and may your GenAI applications be intelligent, intuitive, and impactful! 🚀✨