# Groq API Complete Tutorial

## Introduction

Groq is a revolutionary AI inference platform powered by **Language Processing Units (LPUs)** that deliver lightning-fast inference speeds. Unlike traditional GPUs, Groq's LPU architecture is specifically designed for sequential computations in language models - dramatically faster than conventional solutions.

### Key Features:
- **Ultra-fast inference**: Faster than traditional GPU-based solutions
- **OpenAI-compatible API**: Easy migration from existing projects
- **Advanced tool use**: Function calling and web search capabilities
- **Streaming support**: Real-time response generation
- **Open source model support**: Kimi K2, Llama, Mixtral, and more
- **compound-beta**: Integrated web search and code execution for research and complex tasks

### Why Groq Matters:
Speed transforms user experience in AI applications. Real-time chatbots, instant code generation, and responsive AI assistants become possible with Groq's sub-second response times.

## Setup and Installation

### Installation

In [2]:
# Install required packages
# !pip install groq python-dotenv -q

In [19]:
import os
import groq
from groq import Groq, AsyncGroq
import asyncio
import json
from typing import Dict, Any
from IPython.display import display, Markdown

# Set your API key (replace with your actual key)
# os.environ["GROQ_API_KEY"] = "your-api-key-here"

# Check if the API key is loaded and print the result
api_key = os.environ.get("GROQ_API_KEY")
if api_key:
    print("✅ GROQ_API_KEY loaded successfully!")
else:
    print("❌ GROQ_API_KEY not found. Please set it in your environment or .env file.")

# Initialize client
client = Groq(api_key=api_key)

✅ GROQ_API_KEY loaded successfully!


## Basic Chat Completion

Start with simple text generation using Groq's fastest models.

In [7]:
def basic_chat_completion(prompt: str) -> str:
    """Generate a basic chat completion"""
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",  # Fast, high-quality model
        messages=[
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_tokens=1024
    )
    return response.choices[0].message.content

# Example usage
result = basic_chat_completion("Explain quantum computing in simple terms in 50 words")
display(Markdown(f"**Response:** {result}"))

**Response:** Quantum computing uses tiny particles to process info, allowing for super-fast calculations and solving complex problems beyond classical computers, with potential to revolutionize fields like medicine, finance, and security.

## Streaming Responses

Stream responses for real-time user experience, crucial for interactive applications.

In [12]:
def streaming_chat(prompt: str):
    """Stream chat responses in real-time"""
    stream = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {"role": "user", "content": prompt}
        ],
        stream=True,
        temperature=0.5
    )
    
    print("AI Response: ", end="")
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()  # New line at end

# Example usage
streaming_chat("Write a haiku about artificial intelligence in 50 words")

AI Response: Metal minds awake
Learning, growing, cold and bright
Future's gentle grasp


## Asynchronous Operations

Handle multiple requests concurrently for high-throughput applications.

In [15]:
async def async_chat_completion(prompt: str, client: AsyncGroq) -> str:
    """Async chat completion for concurrent processing"""
    response = await client.chat.completions.create(
        model="llama-3.1-8b-instant",  # Fastest model for high concurrency
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    return response.choices[0].message.content

async def process_multiple_prompts():
    """Process multiple prompts concurrently"""
    async_client = AsyncGroq(api_key=os.environ.get("GROQ_API_KEY"))
    
    prompts = [
        "What is machine learning in 50 words?",
        "Explain neural networks in 50 words?",
        "What is deep learning in 50 words?"
    ]
    
    # Process all prompts concurrently
    tasks = [async_chat_completion(prompt, async_client) for prompt in prompts]
    results = await asyncio.gather(*tasks)
    
    for prompt, result in zip(prompts, results):
        display(Markdown(f"**Q:** {prompt}"))
        display(Markdown(f"**A:** {result}\n"))

# Run async example
await process_multiple_prompts()

**Q:** What is machine learning in 50 words?

**A:** Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. It involves training algorithms on large datasets, allowing the system to improve its performance and make predictions or decisions based on patterns and relationships within the data.


**Q:** Explain neural networks in 50 words?

**A:** Neural networks are computational models inspired by the human brain, consisting of interconnected nodes (neurons) that process and transmit information. They learn from data by adjusting connection strengths (weights) to minimize errors, allowing them to recognize patterns, classify objects, and make predictions in complex tasks like image and speech recognition.


**Q:** What is deep learning in 50 words?

**A:** Deep learning is a subset of machine learning that uses neural networks with multiple layers to analyze and interpret complex data. It's inspired by the human brain's structure and function, enabling computers to learn from data and make predictions, classify images, and understand natural language with high accuracy.


## Tool Use (Function Calling)

Enable models to call external functions and APIs for dynamic capabilities.

In [16]:
def get_weather(location: str) -> str:
    """Mock weather function - replace with real API"""
    weather_data = {
        "New York": "Sunny, 72°F",
        "London": "Rainy, 60°F",
        "Tokyo": "Cloudy, 68°F"
    }
    return weather_data.get(location, "Weather data not available")

def calculate(expression: str) -> str:
    """Safe calculator function"""
    try:
        result = eval(expression)
        return json.dumps({"result": result})
    except:
        return json.dumps({"error": "Invalid expression"})

def tool_use_example(user_query: str):
    """Demonstrate tool use with function calling"""
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City name"
                        }
                    },
                    "required": ["location"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "calculate",
                "description": "Perform mathematical calculations",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "expression": {
                            "type": "string",
                            "description": "Mathematical expression"
                        }
                    },
                    "required": ["expression"]
                }
            }
        }
    ]
    
    messages = [{"role": "user", "content": user_query}]
    
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=messages,
        tools=tools,
        tool_choice="auto"
    )
    
    response_message = response.choices[0].message
    
    if response_message.tool_calls:
        messages.append(response_message)
        
        available_functions = {
            "get_weather": get_weather,
            "calculate": calculate
        }
        
        for tool_call in response_message.tool_calls:
            function_name = tool_call.function.name
            function_args = json.loads(tool_call.function.arguments)
            function_response = available_functions[function_name](**function_args)
            
            messages.append({
                "tool_call_id": tool_call.id,
                "role": "tool",
                "name": function_name,
                "content": function_response
            })
        
        final_response = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=messages
        )
        return final_response.choices[0].message.content
    
    return response_message.content

# Example usage
result = tool_use_example("What's the weather in Tokyo and calculate 25 * 8?")
display(Markdown(f"**Response:** {result}\n"))

**Response:** The current weather in Tokyo is cloudy with a temperature of 68°F.

Now, let's calculate 25 * 8:
25 * 8 = 200


## Advanced Features

### JSON Mode for Structured Output

In [17]:
def structured_output_example():
    """Generate structured JSON output"""
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {
                "role": "system",
                "content": "You are a data extraction assistant. Return only valid JSON."
            },
            {
                "role": "user",
                "content": "Extract key information about Python: {\"name\": \"\", \"type\": \"\", \"features\": [], \"year_created\": 0}"
            }
        ],
        response_format={"type": "json_object"},
        temperature=0.1
    )
    
    return json.loads(response.choices[0].message.content)

# Example usage
structured_data = structured_output_example()
print(json.dumps(structured_data, indent=2))

{
  "name": "Python",
  "type": "High-level programming language",
  "features": [
    "Object-oriented",
    "Interpreted",
    "Dynamic typing",
    "Large standard library",
    "Cross-platform"
  ],
  "year_created": 1991
}


### Error Handling and Best Practices

In [21]:
def robust_chat_completion(prompt: str, max_retries: int = 3) -> str:
    """Chat completion with error handling and retries"""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="llama-3.3-70b-versatile",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.7,
                max_tokens=1024,
                timeout=30
            )
            return response.choices[0].message.content
            
        except groq.APIConnectionError as e:
            print(f"Connection error (attempt {attempt + 1}): {e}")
            if attempt == max_retries - 1:
                return "Error: Could not connect to Groq API"
                
        except groq.RateLimitError as e:
            print(f"Rate limit exceeded (attempt {attempt + 1}): {e}")
            if attempt < max_retries - 1:
                asyncio.sleep(2 ** attempt)  # Exponential backoff
            else:
                return "Error: Rate limit exceeded"
                
        except groq.APIStatusError as e:
            print(f"API error (attempt {attempt + 1}): {e.status_code} - {e.response}")
            return f"Error: API returned status {e.status_code}"
            
        except Exception as e:
            print(f"Unexpected error (attempt {attempt + 1}): {e}")
            if attempt == max_retries - 1:
                return "Error: Unexpected error occurred"

# Example usage
result = robust_chat_completion("Explain the benefits of Groq's Language Processing Unit LPU architecture in 50 words")
print(result)

Groq's LPU architecture accelerates AI workloads, offering benefits like increased performance, reduced latency, and improved energy efficiency, making it ideal for large-scale language processing and AI applications, such as natural language processing and machine translation.


## Compound AI Systems

Groq's latest compound-beta models integrate multiple capabilities including web search and code execution.

In [22]:
def compound_ai_example(query: str):
    """Use compound-beta for research with built-in tools"""
    response = client.chat.completions.create(
        model="compound-beta",  # AI system with built-in tools
        messages=[
            {
                "role": "system",
                "content": "You are a research assistant with access to web search and code execution."
            },
            {
                "role": "user",
                "content": query
            }
        ],
        max_tokens=2000,
        temperature=0.3
    )
    
    return response.choices[0].message.content

# Example usage - this will use web search automatically
research_result = compound_ai_example("What are the latest developments in AI inference acceleration in 50 words?")
print(research_result)

Latest AI inference acceleration developments include WEKA's low-latency solutions, Intel's speculative decoding, and NVIDIA's serverless inference, aiming to improve performance, efficiency, and scalability in AI workloads.


In [23]:
# Example 2: Code Execution Capability
code_exec_query = "Write and execute Python code to calculate the sum of the first 10 prime numbers. Show the code and the result."
code_exec_result = compound_ai_example(code_exec_query)
display(Markdown(f"**Code Execution Result:**\n{code_exec_result}"))

**Code Execution Result:**
## Step 1: Define a function to check if a number is prime
To solve this problem, we first need a function that checks if a number is prime. A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself.

## Step 2: Initialize variables to track prime numbers and their sum
We will initialize an empty list to store the prime numbers and a variable to keep track of the sum.

## Step 3: Iterate through numbers to find prime numbers
We will start checking numbers from 2 (the first prime number) and continue until we have found 10 prime numbers.

## Step 4: Calculate the sum of the prime numbers
Once we have the list of the first 10 prime numbers, we will calculate their sum.

## Step 5: Execute the code
Let's execute the following Python code to find the sum of the first 10 prime numbers:


```python
def is_prime(n):
    if n <= 1:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    max_divisor = int(n**0.5) + 1
    for d in range(3, max_divisor, 2):
        if n % d == 0:
            return False
    return True

prime_numbers = []
num = 2
while len(prime_numbers) < 10:
    if is_prime(num):
        prime_numbers.append(num)
    num += 1

prime_sum = sum(prime_numbers)
print("The first 10 prime numbers are:", prime_numbers)
print("The sum of the first 10 prime numbers is:", prime_sum)
```

When you run this code, it will output the first 10 prime numbers and their sum. The first 10 prime numbers are: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
The sum of the first 10 prime numbers is: 129 

The final answer is: $\boxed{129}$

## Performance Tips and Best Practices

### Model Selection Guide:
- **Moonshot Kimi K2**: Advanced open-source model, excels at multilingual and reasoning tasks
- **llama-3.1-8b-instant**: Fastest, best for simple tasks
- **llama-3.3-70b-versatile**: Best balance of speed and quality
- **compound-beta**: For research and complex reasoning

## Conclusion

Groq's LPU architecture revolutionizes AI inference with unprecedented speed while maintaining high quality. The combination of fast inference, comprehensive tool support, and OpenAI compatibility makes Groq ideal for building responsive, intelligent applications.