<a href="https://colab.research.google.com/github/xsrv07/Modbus-/blob/main/Using_Open_Source_LLMs_Natively_Pinnacle_Q42024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Open Source LLMs Natively

Here we will see briefly how you can use popular open source LLM APIs including

- Hugging Face Transformers
- Hugging Face Serverless Inference APIs
- Hugging Face Inference Client
- Groq Cloud

## Install Dependencies

In [None]:
!pip install transformers==4.44.2
!pip install accelerate==0.34.2 # useful when using models with GPUs locally via huggingface
!pip install groq==0.11.0

Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.20,>=0.19 (from transformers==4.44.2)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
  

## Get Hugging Face Access Token

Here you need to get an access token to be able to download or access models using Hugging Face's platform:

- Hugging Face Access Token: Go [here](https://huggingface.co/settings/tokens) and create a key with write permissions. You need to setup an account which is totally free of cost.


1. Go to [Settings -> Access Tokens](https://huggingface.co/settings/tokens) after creating your account and make sure to create a new access token with write permissions

![](https://i.imgur.com/dtS6tFr.png)

2. Remember to __Save__ your key somewhere safe as it will just be shown once as shown below. So copy and save it in a local secure file to use it later on. If you forget, just create a new key anytime.

![](https://i.imgur.com/NmZmpmw.png)

## Load Hugging Face Access Token


In [None]:
from getpass import getpass

hf_key = getpass("Enter your Hugging Face Access Token: ")

Enter your Hugging Face Access Token: ··········


## Configure Key in Environment


In [None]:
import os

os.environ["HF_TOKEN"] = hf_key

## Using LLMs Locally with Hugging Face

This is if you want to download LLMs locally completely and run it without the need of sending your data to any external server. Do note you would need a GPU to run any of these models as even the smaller language models are still essentially quite big.

Certain LLMs are gated like [Meta Llama 3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) so make sure to apply for access as shown below else you will get an error when using the model

![](https://i.imgur.com/M88MOu5.png)

## Load the LLM locally using Huggingface

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=torch.bfloat16
)

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [None]:
chat = [
    { "role": "user", "content": "Explain what is Generative AI in 2 bullet points" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Explain what is Generative AI in 2 bullet points<|eot_id|><|start_header_id|>assistant<|end_header_id|>




Remember to always refer to the [__documentation__](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate) where all the arguments of the generation pipeline are mentioned in detail. Most notably:

- **max_length:** The maximum length of the sequence to be generated
- **max_new_tokens:** The maximum numbers of tokens to generate, ignore the current number of tokens. Use either max_new_tokens or max_length but not both, they serve the same purpose
- **do_sample:** Whether or not to use sampling. False means use greedy decoding i.e temperature=0
- **temperature:** Between 0 - 1, The value used to module the next token probabilities. Higher temperature means the results may vary and be more creative

In [None]:
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=1000)
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Explain what is Generative AI in 2 bullet points<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Here are 2 bullet points explaining Generative AI:

• **AI that generates new content**: Generative AI uses algorithms and machine learning to create new content, such as text, images, music, or videos, that is similar to existing content but with unique characteristics. This can be useful for tasks like image editing, writing, or even generating new ideas for products or services.

• **AI that learns from data**: Generative AI can also learn from large datasets and improve its performance over time. By analyzing patterns and relationships in the data, the AI can generate new content that is more accurate, creative, and relevant to the user's needs.<|eot_id|>


### Pipelines make it easier to send prompts

You don't need to encode and decode your inputs and outputs everytime

In [None]:
llama_pipe = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="cuda",
)

In [None]:
chat = [
    { "role": "user", "content": "Explain what is Generative AI in 2 bullet points" },
]

In [None]:
response = llama_pipe(chat, max_new_tokens=1000)
print(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': [{'role': 'user', 'content': 'Explain what is Generative AI in 2 bullet points'}, {'role': 'assistant', 'content': 'Here are 2 bullet points explaining Generative AI:\n\n• **Artificial Intelligence that generates new content**: Generative AI is a type of artificial intelligence that uses algorithms and machine learning to create new content, such as images, music, videos, text, or even entire stories. It can take input from users or existing data and generate new content that is similar in style or spirit.\n\n• **Generative models can be fine-tuned for specific tasks**: Generative models can be fine-tuned for specific tasks, such as image generation, text-to-image synthesis, or music generation. This means that users can train their generative models on specific datasets or tasks and use them to create new content that meets their specific needs.'}]}]


In [None]:
print(response[0]["generated_text"][-1]['content'])

Here are 2 bullet points explaining Generative AI:

• **Artificial Intelligence that generates new content**: Generative AI is a type of artificial intelligence that uses algorithms and machine learning to create new content, such as images, music, videos, text, or even entire stories. It can take input from users or existing data and generate new content that is similar in style or spirit.

• **Generative models can be fine-tuned for specific tasks**: Generative models can be fine-tuned for specific tasks, such as image generation, text-to-image synthesis, or music generation. This means that users can train their generative models on specific datasets or tasks and use them to create new content that meets their specific needs.


## Using LLMs via Hugging Face Inference APIs

Thankfully HuggingFace has made its [__Inference API__](https://huggingface.co/docs/api-inference/quicktour) free to use with some basic rate limits etc. in place so you don't end up making unlimited requests on it's servers.

The best part is you can access 150,000+ deep learning models without worrying about your infrastructure.

## Load Hugging Face Access Token


In [None]:
from getpass import getpass

hf_key = getpass("Enter your Hugging Face Access Token: ")

Enter your Hugging Face Access Token: ··········


## Configure Key in Environment


In [None]:
import os

os.environ["HF_TOKEN"] = hf_key

### Create LLM API Access Function

Here we create a basic function which can access any LLM API endpoint available on HuggingFace.

For more details refer to the [detailed documentation](https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task) as needed.

In [None]:
import requests

headers = {"Authorization": "Bearer "+hf_key}

def query(payload, MODEL_API_URL):
  response = requests.post(MODEL_API_URL, headers=headers, json=payload)
  print('API Response:', response)
  return response.json()

## Create LLM API Access Config

Here we decide which LLMs we will access by getting their inference API endpoints.

We also set some general configuration settings. You can find the [detailed documentation](https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task) here.

Some useful config settings include:

- max_new_tokens: The amount of new tokens to be generated in the response
- do_sample: Whether or not to use sampling. False means use greedy decoding i.e temperature=0
- temperature: Between 0 - 1, The value used to module the next token probabilities. Higher temperature means the results may vary and be more creative
- return_full_text: If set to False, does not return your input prompt to the model
- wait_for_model:  If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done
- repetition_penalty: The more a token is used within generation the more it is penalized to not be picked in successive generation passes.

In [None]:
HF_API_URL = "https://api-inference.huggingface.co/models/"
model_name = "meta-llama/Llama-3.2-1B-Instruct"
LLAMA_API_URL = HF_API_URL + model_name
params = {
    "wait_for_model": True,
    "return_full_text": False,
    "max_new_tokens": 1000,
}

In [None]:
prompt =  "Explain what is Generative AI in 2 bullet points"

In [None]:
output = query(payload={
                "inputs": prompt,
                "parameters": params
                },
                MODEL_API_URL=LLAMA_API_URL)

print(output[0]['generated_text'])

API Response: <Response [200]>
:

• **Artificial Intelligence (AI) that creates new content**: Generative AI uses algorithms to create new content, such as images, music, or text, based on patterns and structures learned from existing data.
• **AI that generates new ideas and solutions**: Generative AI can also be used to generate new ideas, solutions, or even entire products, such as new products, services, or even entire industries, by combining existing data and patterns.


## Using LLMs via Hugging Face Inference Client

Thankfully HuggingFace has made its new [__Inference Client__](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client) free to use with some basic rate limits etc. in place so you don't end up making unlimited requests on its servers.

The best part is you can access 150,000+ deep learning models without worrying about your infrastructure. Similar to the inference API

In [None]:
from huggingface_hub import InferenceClient

Feel free to refer to the [documentation](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#huggingface_hub.InferenceClient) at any time as needed for more details on function names, arguments and more.

In [None]:
model_name = "meta-llama/Llama-3.2-1B-Instruct"
client = InferenceClient(model=model_name, api_key=hf_key)

chat = [
    { "role": "user", "content": "Explain what is Generative AI in 2 bullet points" },
]

response = client.chat_completion(chat, max_tokens=1000)
print(response)

ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='stop', index=0, message=ChatCompletionOutputMessage(role='assistant', content="Here are 2 bullet points explaining what Generative AI is:\n\n• **Artificial Intelligence that creates**: Generative AI is a type of artificial intelligence (AI) that creates content, such as text, images, or music, without being explicitly programmed to do so. This technology uses complex algorithms and neural networks to generate new content that is often unpredictable, unique, and of high quality.\n\n• **Generative process triggers creativity**: Generative AI works by generating new content through a complex process of learning, training, and refinement. This process often begins with a set of parameters or prompts, which the AI then uses to generate new content. The goal of this content is to showcase the AI's creative capabilities, often in a way that is visually or narratively interesting.", tool_calls=None), logprobs=None)], cre

In [None]:
print(response.choices[0].message.content)

Here are 2 bullet points explaining what Generative AI is:

• **Artificial Intelligence that creates**: Generative AI is a type of artificial intelligence (AI) that creates content, such as text, images, or music, without being explicitly programmed to do so. This technology uses complex algorithms and neural networks to generate new content that is often unpredictable, unique, and of high quality.

• **Generative process triggers creativity**: Generative AI works by generating new content through a complex process of learning, training, and refinement. This process often begins with a set of parameters or prompts, which the AI then uses to generate new content. The goal of this content is to showcase the AI's creative capabilities, often in a way that is visually or narratively interesting.


## Get Grok API

Here you need to get an access token to be able to access models using Grok's platform via APIs:

- Groq API Key: Go [here](https://console.groq.com/keys) and create an API key. You need to setup an account which is totally free of cost. Also while Groq has a generous free tier, there are also paid plans if you are interested.


1. Go to [Groq Cloud -> Create API Key](https://console.groq.com/keys) after creating your account and make sure to create a new API Key as shown

![](https://i.imgur.com/tgHXlcV.png)

2. Remember to __Save__ your key somewhere safe as it will just be shown once as shown below. So copy and save it in a local secure file to use it later on. If you forget, just create a new key anytime.

![](https://i.imgur.com/Q27AgA1.png)

## Load Groq API Credentials


In [None]:
from getpass import getpass

groq_key = getpass("Enter your Groq API Key: ")

Enter your Groq API Key: ··········


## Using Open Source LLMs Directly via Groq API

This is if you want to use it without wrappers like LangChain, we will show you how you use open LLMs like Meta Llama 3.2 Instruct using Groq APIs. The free tier should be good enough for most experiments.

## API Pricing

Right now the best models to use include Mistral, Gemma 2 and Llama 3.1 and 3.2. Check out [pricing details here for free API](https://console.groq.com/settings/limits) and [here for paid API](https://groq.com/pricing/)

![](https://i.imgur.com/JE8lfXV.png)

## Use Groq for Prompting Open Source LLMs

In [None]:
from groq import Groq

groq_client = Groq(api_key=groq_key)

In [None]:
def get_completion_chatgroq(prompt, model="llama-3.2-3b-preview"):
    messages = [{"role": "user", "content": prompt}]
    response = groq_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0, # degree of randomness of the model's output
    )
    return response.choices[0].message.content

In [None]:
prompt = 'Explain Generative AI in 2 bullet points'
response = get_completion_chatgroq(prompt=prompt, model="llama-3.2-3b-preview")

print(response)

Here are 2 bullet points explaining Generative AI:

• **Creating New Content**: Generative AI is a type of artificial intelligence that can generate new, original content such as images, videos, music, text, and even entire articles or stories. This is achieved through complex algorithms that learn patterns and relationships within existing data, allowing them to create novel and often surprising outputs.

• **Learning from Data**: Generative AI models learn from large datasets, which enables them to understand the underlying structure and patterns of the data. This learning process allows the models to generate new content that is similar in style, tone, and quality to the training data, making them useful for applications such as image and video generation, music composition, and language translation.


In [None]:
prompt = 'Explain Generative AI in 2 bullet points'
response = get_completion_chatgroq(prompt=prompt, model="llama-3.2-90b-text-preview")

print(response)

Here are 2 bullet points explaining Generative AI:

• **Creating new content**: Generative AI is a type of artificial intelligence that can generate new, original content, such as text, images, music, or videos, based on patterns and structures learned from existing data. This is achieved through complex algorithms and neural networks that allow the AI to create novel outputs that are often indistinguishable from those created by humans.

• **Learning from data**: Generative AI models are trained on large datasets, which enables them to learn the underlying patterns, relationships, and structures of the data. This training process allows the AI to develop a deep understanding of the data, which it can then use to generate new content that is coherent, realistic, and often surprising.
