# Groq API Complete Tutorial

## Introduction

Groq is a revolutionary AI inference platform powered by **Language Processing Units (LPUs)** that deliver lightning-fast inference speeds. Unlike traditional GPUs, Groq's LPU architecture is specifically designed for sequential computations in language models, achieving **up to 750 tokens/second** - dramatically faster than conventional solutions.

### Key Features:
- **Ultra-fast inference**: 10x faster than traditional GPU-based solutions
- **OpenAI-compatible API**: Easy migration from existing projects
- **Advanced tool use**: Function calling and web search capabilities
- **Streaming support**: Real-time response generation
- **Multiple model support**: Llama 3.3, Mixtral, and more

### Why Groq Matters:
Speed transforms user experience in AI applications. Real-time chatbots, instant code generation, and responsive AI assistants become possible with Groq's sub-second response times.

## Setup and Installation

### Installation

In [None]:
# Install required packages
!pip install groq python-dotenv -q

In [None]:
import os
from groq import Groq, AsyncGroq
import asyncio
import json
from typing import Dict, Any

# Set your API key (replace with your actual key)
# os.environ["GROQ_API_KEY"] = "your-api-key-here"

# Initialize client
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

## Basic Chat Completion

Start with simple text generation using Groq's fastest models.

In [None]:
def basic_chat_completion(prompt: str) -> str:
    """Generate a basic chat completion"""
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",  # Fast, high-quality model
        messages=[
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_tokens=1024
    )
    return response.choices[0].message.content

# Example usage
result = basic_chat_completion("Explain quantum computing in simple terms")
print(result)

## Streaming Responses

Stream responses for real-time user experience, crucial for interactive applications.

In [None]:
def streaming_chat(prompt: str):
    """Stream chat responses in real-time"""
    stream = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {"role": "user", "content": prompt}
        ],
        stream=True,
        temperature=0.5
    )
    
    print("AI Response: ", end="")
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()  # New line at end

# Example usage
streaming_chat("Write a haiku about artificial intelligence")

## Asynchronous Operations

Handle multiple requests concurrently for high-throughput applications.

In [None]:
async def async_chat_completion(prompt: str, client: AsyncGroq) -> str:
    """Async chat completion for concurrent processing"""
    response = await client.chat.completions.create(
        model="llama-3.1-8b-instant",  # Fastest model for high concurrency
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    return response.choices[0].message.content

async def process_multiple_prompts():
    """Process multiple prompts concurrently"""
    async_client = AsyncGroq(api_key=os.environ.get("GROQ_API_KEY"))
    
    prompts = [
        "What is machine learning?",
        "Explain neural networks",
        "What is deep learning?"
    ]
    
    # Process all prompts concurrently
    tasks = [async_chat_completion(prompt, async_client) for prompt in prompts]
    results = await asyncio.gather(*tasks)
    
    for prompt, result in zip(prompts, results):
        print(f"Q: {prompt}")
        print(f"A: {result[:100]}...\n")

# Run async example
await process_multiple_prompts()

## Tool Use (Function Calling)

Enable models to call external functions and APIs for dynamic capabilities.

In [None]:
def get_weather(location: str) -> str:
    """Mock weather function - replace with real API"""
    weather_data = {
        "New York": "Sunny, 72°F",
        "London": "Rainy, 60°F",
        "Tokyo": "Cloudy, 68°F"
    }
    return weather_data.get(location, "Weather data not available")

def calculate(expression: str) -> str:
    """Safe calculator function"""
    try:
        result = eval(expression)
        return json.dumps({"result": result})
    except:
        return json.dumps({"error": "Invalid expression"})

def tool_use_example(user_query: str):
    """Demonstrate tool use with function calling"""
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City name"
                        }
                    },
                    "required": ["location"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "calculate",
                "description": "Perform mathematical calculations",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "expression": {
                            "type": "string",
                            "description": "Mathematical expression"
                        }
                    },
                    "required": ["expression"]
                }
            }
        }
    ]
    
    messages = [{"role": "user", "content": user_query}]
    
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=messages,
        tools=tools,
        tool_choice="auto"
    )
    
    response_message = response.choices[0].message
    
    if response_message.tool_calls:
        messages.append(response_message)
        
        available_functions = {
            "get_weather": get_weather,
            "calculate": calculate
        }
        
        for tool_call in response_message.tool_calls:
            function_name = tool_call.function.name
            function_args = json.loads(tool_call.function.arguments)
            function_response = available_functions[function_name](**function_args)
            
            messages.append({
                "tool_call_id": tool_call.id,
                "role": "tool",
                "name": function_name,
                "content": function_response
            })
        
        final_response = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=messages
        )
        return final_response.choices[0].message.content
    
    return response_message.content

# Example usage
result = tool_use_example("What's the weather in Tokyo and calculate 25 * 8?")
print(result)

## Advanced Features

### JSON Mode for Structured Output

In [None]:
def structured_output_example():
    """Generate structured JSON output"""
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {
                "role": "system",
                "content": "You are a data extraction assistant. Return only valid JSON."
            },
            {
                "role": "user",
                "content": "Extract key information about Python: {\"name\": \"\", \"type\": \"\", \"features\": [], \"year_created\": 0}"
            }
        ],
        response_format={"type": "json_object"},
        temperature=0.1
    )
    
    return json.loads(response.choices[0].message.content)

# Example usage
structured_data = structured_output_example()
print(json.dumps(structured_data, indent=2))

### Error Handling and Best Practices

In [None]:
import groq

def robust_chat_completion(prompt: str, max_retries: int = 3) -> str:
    """Chat completion with error handling and retries"""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="llama-3.3-70b-versatile",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.7,
                max_tokens=1024,
                timeout=30
            )
            return response.choices[0].message.content
            
        except groq.APIConnectionError as e:
            print(f"Connection error (attempt {attempt + 1}): {e}")
            if attempt == max_retries - 1:
                return "Error: Could not connect to Groq API"
                
        except groq.RateLimitError as e:
            print(f"Rate limit exceeded (attempt {attempt + 1}): {e}")
            if attempt < max_retries - 1:
                asyncio.sleep(2 ** attempt)  # Exponential backoff
            else:
                return "Error: Rate limit exceeded"
                
        except groq.APIStatusError as e:
            print(f"API error (attempt {attempt + 1}): {e.status_code} - {e.response}")
            return f"Error: API returned status {e.status_code}"
            
        except Exception as e:
            print(f"Unexpected error (attempt {attempt + 1}): {e}")
            if attempt == max_retries - 1:
                return "Error: Unexpected error occurred"

# Example usage
result = robust_chat_completion("Explain the benefits of Groq's LPU architecture")
print(result)

## Compound AI Systems

Groq's latest compound-beta models integrate multiple capabilities including web search and code execution.

In [None]:
def compound_ai_example(query: str):
    """Use compound-beta for research with built-in tools"""
    response = client.chat.completions.create(
        model="compound-beta",  # AI system with built-in tools
        messages=[
            {
                "role": "system",
                "content": "You are a research assistant with access to web search and code execution."
            },
            {
                "role": "user",
                "content": query
            }
        ],
        max_tokens=2000,
        temperature=0.3
    )
    
    return response.choices[0].message.content

# Example usage - this will use web search automatically
research_result = compound_ai_example("What are the latest developments in AI inference acceleration?")
print(research_result)

## Performance Tips and Best Practices

### Model Selection Guide:
- **llama-3.1-8b-instant**: Fastest, best for simple tasks
- **llama-3.3-70b-versatile**: Best balance of speed and quality
- **compound-beta**: For research and complex reasoning

### Optimization Strategies:
1. **Use streaming** for better user experience
2. **Implement async** for concurrent requests
3. **Choose appropriate models** based on task complexity
4. **Handle errors gracefully** with retry logic
5. **Leverage tool use** for dynamic capabilities

### Rate Limits:
- Free tier: Generous limits for experimentation
- Paid tiers: Higher throughput for production use
- Monitor usage via the Groq console

## Conclusion

Groq's LPU architecture revolutionizes AI inference with unprecedented speed while maintaining high quality. The combination of fast inference, comprehensive tool support, and OpenAI compatibility makes Groq ideal for building responsive, intelligent applications.

Start building with Groq today at [console.groq.com](https://console.groq.com)!