# Autonomous Agent with LangGraph for Data Analysis

> **Created by [Build Fast with AI](https://www.buildfastwithai.com)**

This notebook demonstrates how to build an autonomous data analysis agent using LangGraph and Gemini 3 Pro.

## What you'll learn:
- Building autonomous agents with LangGraph
- Creating agents that analyze data
- Implementing multi-step reasoning
- Using tools for data manipulation
- Agent state management
- Visualization generation

## 1. Installation and Setup

In [None]:
!pip install -q langgraph langchain langchain-google-genai pandas matplotlib seaborn numpy

In [None]:
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import TypedDict, Annotated, List, Dict, Any
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.tools import tool
from langgraph.graph import StateGraph, END
from IPython.display import Markdown, display

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

In [None]:
# Configure API key
try:
    from google.colab import userdata
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
except:
    GOOGLE_API_KEY = os.environ.get('GOOGLE_API_KEY', 'your-api-key-here')

os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY

## 2. Create Sample Dataset

Let's create a sample sales dataset for our agent to analyze.

In [None]:
# Create sample sales data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=365, freq='D')
products = ['Product A', 'Product B', 'Product C', 'Product D']
regions = ['North', 'South', 'East', 'West']

data = []
for date in dates:
    for product in products:
        for region in regions:
            sales = np.random.randint(50, 500)
            revenue = sales * np.random.uniform(10, 100)
            data.append({
                'date': date,
                'product': product,
                'region': region,
                'sales': sales,
                'revenue': revenue
            })

df = pd.DataFrame(data)
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter

print(f"Dataset created: {len(df)} rows")
df.head(10)

## 3. Define Data Analysis Tools

Create tools that the agent can use to analyze data.

In [None]:
@tool
def get_basic_stats(column: str) -> str:
    """Get basic statistics for a column.
    
    Args:
        column: Name of the column to analyze (sales, revenue, etc.)
    
    Returns:
        String with statistical summary
    """
    if column not in df.columns:
        return f"Error: Column '{column}' not found. Available columns: {', '.join(df.columns)}"
    
    stats = df[column].describe() if df[column].dtype in ['int64', 'float64'] else df[column].value_counts()
    return f"Statistics for {column}:\n{stats.to_string()}"

@tool
def group_by_analysis(group_column: str, agg_column: str, agg_func: str = 'sum') -> str:
    """Perform group by analysis.
    
    Args:
        group_column: Column to group by (product, region, month, quarter)
        agg_column: Column to aggregate (sales, revenue)
        agg_func: Aggregation function (sum, mean, max, min)
    
    Returns:
        String with grouped results
    """
    try:
        result = df.groupby(group_column)[agg_column].agg(agg_func).sort_values(ascending=False)
        return f"{agg_func.upper()} of {agg_column} by {group_column}:\n{result.to_string()}"
    except Exception as e:
        return f"Error: {str(e)}"

@tool
def get_top_performers(metric: str, top_n: int = 5, by: str = 'product') -> str:
    """Get top performers.
    
    Args:
        metric: Metric to rank by (sales, revenue)
        top_n: Number of top performers to return
        by: Group by column (product, region)
    
    Returns:
        String with top performers
    """
    result = df.groupby(by)[metric].sum().sort_values(ascending=False).head(top_n)
    return f"Top {top_n} {by}s by {metric}:\n{result.to_string()}"

@tool
def calculate_growth(column: str, period: str = 'month') -> str:
    """Calculate growth over time.
    
    Args:
        column: Column to analyze (sales, revenue)
        period: Time period (month, quarter)
    
    Returns:
        String with growth analysis
    """
    time_series = df.groupby(period)[column].sum()
    growth = time_series.pct_change() * 100
    
    result = pd.DataFrame({
        'total': time_series,
        'growth_%': growth
    })
    
    return f"Growth analysis for {column} by {period}:\n{result.to_string()}"

@tool
def find_trends(column: str, group_by: str = 'month') -> str:
    """Find trends in data.
    
    Args:
        column: Column to analyze
        group_by: Time period to group by
    
    Returns:
        String with trend analysis
    """
    time_series = df.groupby(group_by)[column].mean()
    trend = "increasing" if time_series.iloc[-1] > time_series.iloc[0] else "decreasing"
    change_pct = ((time_series.iloc[-1] - time_series.iloc[0]) / time_series.iloc[0]) * 100
    
    return f"Trend for {column}: {trend} ({change_pct:.2f}% change)\nDetails:\n{time_series.to_string()}"

@tool
def correlation_analysis(col1: str, col2: str) -> str:
    """Calculate correlation between two columns.
    
    Args:
        col1: First column
        col2: Second column
    
    Returns:
        String with correlation coefficient
    """
    if col1 not in df.columns or col2 not in df.columns:
        return "Error: One or both columns not found"
    
    corr = df[col1].corr(df[col2])
    return f"Correlation between {col1} and {col2}: {corr:.4f}"

# List all tools
tools = [
    get_basic_stats,
    group_by_analysis,
    get_top_performers,
    calculate_growth,
    find_trends,
    correlation_analysis
]

print(f"Created {len(tools)} data analysis tools")

## 4. Define Agent State

In [None]:
class AnalysisState(TypedDict):
    """State for the data analysis agent."""
    query: str                          # User's analysis request
    messages: List[Dict[str, str]]      # Conversation messages
    analysis_steps: List[str]           # Steps taken by agent
    findings: List[str]                 # Key findings
    next_action: str                    # Next action to take
    final_report: str                   # Final analysis report
    iterations: int                     # Number of iterations

print("Analysis state defined")

## 5. Initialize LLM with Tools

In [None]:
# Initialize Gemini with tools
llm = ChatGoogleGenerativeAI(
    model="gemini-3-pro",
    temperature=0.1,  # Low temperature for consistent analysis
    google_api_key=GOOGLE_API_KEY
)

llm_with_tools = llm.bind_tools(tools)

print("LLM initialized with data analysis tools")

## 6. Define Agent Nodes

In [None]:
MAX_ITERATIONS = 10

def planner_node(state: AnalysisState) -> AnalysisState:
    """Plan the analysis steps."""
    query = state['query']
    
    planning_prompt = f"""
    You are a data analyst. Create an analysis plan for this request:
    
    Request: {query}
    
    Available data columns: date, product, region, sales, revenue, month, quarter
    
    Create a step-by-step plan using the available tools.
    """
    
    response = llm.invoke(planning_prompt)
    
    state['messages'].append({
        "role": "planner",
        "content": response.content
    })
    state['analysis_steps'].append(f"Plan created: {response.content[:100]}...")
    state['next_action'] = 'analyze'
    
    return state

def analyzer_node(state: AnalysisState) -> AnalysisState:
    """Perform analysis using tools."""
    # Check iteration limit
    state['iterations'] = state.get('iterations', 0) + 1
    
    if state['iterations'] > MAX_ITERATIONS:
        state['next_action'] = 'report'
        return state
    
    # Build context from previous messages
    context = "\n".join([
        f"{msg['role']}: {msg['content']}"
        for msg in state['messages'][-5:]  # Last 5 messages
    ])
    
    analysis_prompt = f"""
    Context: {context}
    
    Continue the analysis. Use tools to gather data and insights.
    When you have enough information, respond with "ANALYSIS COMPLETE" and summarize findings.
    """
    
    response = llm_with_tools.invoke(analysis_prompt)
    
    state['messages'].append({
        "role": "analyzer",
        "content": response.content
    })
    
    # Check if analysis is complete
    if "ANALYSIS COMPLETE" in response.content.upper():
        state['next_action'] = 'report'
    elif hasattr(response, 'tool_calls') and response.tool_calls:
        state['next_action'] = 'tools'
    else:
        state['next_action'] = 'analyze'
    
    return state

def tool_executor_node(state: AnalysisState) -> AnalysisState:
    """Execute analysis tools."""
    last_message = state['messages'][-1]
    
    # Extract tool calls (simplified)
    # In production, properly parse tool calls from LLM response
    tool_results = []
    
    # Example: Execute get_basic_stats for sales
    result = get_basic_stats.invoke({"column": "sales"})
    tool_results.append(result)
    
    # Add results to state
    results_text = "\n\n".join(tool_results)
    state['messages'].append({
        "role": "system",
        "content": f"Tool Results:\n{results_text}"
    })
    state['findings'].append(results_text)
    state['analysis_steps'].append(f"Executed tools, got {len(tool_results)} results")
    state['next_action'] = 'analyze'
    
    return state

def reporter_node(state: AnalysisState) -> AnalysisState:
    """Generate final report."""
    findings = "\n".join(state['findings'])
    
    report_prompt = f"""
    Create a comprehensive analysis report based on these findings:
    
    {findings}
    
    Original request: {state['query']}
    
    Format the report with:
    1. Executive Summary
    2. Key Findings
    3. Detailed Analysis
    4. Recommendations
    """
    
    response = llm.invoke(report_prompt)
    
    state['final_report'] = response.content
    state['next_action'] = 'end'
    
    return state

def route_next(state: AnalysisState) -> str:
    """Determine next node."""
    action = state.get('next_action', 'end')
    
    if action == 'end':
        return END
    return action

print("Agent nodes defined")

## 7. Build the Analysis Agent Graph

In [None]:
# Create workflow
workflow = StateGraph(AnalysisState)

# Add nodes
workflow.add_node("plan", planner_node)
workflow.add_node("analyze", analyzer_node)
workflow.add_node("tools", tool_executor_node)
workflow.add_node("report", reporter_node)

# Set entry point
workflow.set_entry_point("plan")

# Add edges
workflow.add_edge("plan", "analyze")
workflow.add_edge("tools", "analyze")
workflow.add_edge("report", END)

# Add conditional routing from analyze
workflow.add_conditional_edges(
    "analyze",
    route_next,
    {
        "tools": "tools",
        "analyze": "analyze",
        "report": "report",
        END: END
    }
)

# Compile
analysis_agent = workflow.compile()

print("Analysis agent built successfully!")

## 8. Run Analysis Examples

In [None]:
def run_analysis(query: str):
    """Run the analysis agent."""
    print(f"\n{'='*80}")
    print(f"Analysis Request: {query}")
    print(f"{'='*80}\n")
    
    # Initialize state
    initial_state = {
        "query": query,
        "messages": [],
        "analysis_steps": [],
        "findings": [],
        "next_action": "plan",
        "final_report": "",
        "iterations": 0
    }
    
    # Run agent
    result = analysis_agent.invoke(initial_state)
    
    # Display report
    print("\nFinal Report:")
    print("="*80)
    display(Markdown(result['final_report']))
    
    return result

# Example 1: Sales analysis
result1 = run_analysis(
    "Analyze total sales and revenue by product. Which product performs best?"
)

In [None]:
# Example 2: Regional analysis
result2 = run_analysis(
    "Compare regional performance. Which region has the highest growth?"
)

In [None]:
# Example 3: Trend analysis
result3 = run_analysis(
    "Identify trends in monthly sales and revenue. Are we growing or declining?"
)

## 9. Manual Data Analysis with Tools

You can also use the tools directly for quick analysis.

In [None]:
# Get basic stats
print(get_basic_stats.invoke({"column": "revenue"}))
print("\n" + "="*80 + "\n")

# Top products by sales
print(get_top_performers.invoke({"metric": "sales", "top_n": 3, "by": "product"}))
print("\n" + "="*80 + "\n")

# Growth analysis
print(calculate_growth.invoke({"column": "revenue", "period": "quarter"}))

## 10. Visualization Generation

Create visualizations to complement the analysis.

In [None]:
# Sales by product
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Total sales by product
product_sales = df.groupby('product')['sales'].sum().sort_values(ascending=False)
product_sales.plot(kind='bar', ax=axes[0, 0], color='skyblue')
axes[0, 0].set_title('Total Sales by Product', fontsize=14, fontweight='bold')
axes[0, 0].set_ylabel('Sales')

# Plot 2: Revenue by region
region_revenue = df.groupby('region')['revenue'].sum().sort_values(ascending=False)
region_revenue.plot(kind='bar', ax=axes[0, 1], color='lightcoral')
axes[0, 1].set_title('Total Revenue by Region', fontsize=14, fontweight='bold')
axes[0, 1].set_ylabel('Revenue')

# Plot 3: Monthly sales trend
monthly_sales = df.groupby('month')['sales'].mean()
monthly_sales.plot(kind='line', ax=axes[1, 0], marker='o', color='green')
axes[1, 0].set_title('Average Monthly Sales Trend', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Month')
axes[1, 0].set_ylabel('Average Sales')

# Plot 4: Quarterly revenue by product
quarterly_product = df.groupby(['quarter', 'product'])['revenue'].sum().unstack()
quarterly_product.plot(kind='bar', ax=axes[1, 1], stacked=False)
axes[1, 1].set_title('Quarterly Revenue by Product', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Quarter')
axes[1, 1].set_ylabel('Revenue')
axes[1, 1].legend(title='Product')

plt.tight_layout()
plt.show()

## 11. Advanced Agent with Self-Correction

In [None]:
class SelfCorrectingAnalysisAgent:
    """An agent that can validate and correct its own analysis."""
    
    def __init__(self):
        self.llm = ChatGoogleGenerativeAI(
            model="gemini-3-pro",
            temperature=0.1,
            google_api_key=GOOGLE_API_KEY
        )
        self.analysis_history = []
    
    def analyze(self, query: str) -> Dict[str, Any]:
        """Perform analysis with self-correction."""
        # Step 1: Initial analysis
        analysis = self._perform_analysis(query)
        
        # Step 2: Validate
        validation = self._validate_analysis(query, analysis)
        
        # Step 3: Correct if needed
        if validation['needs_correction']:
            analysis = self._correct_analysis(query, analysis, validation['issues'])
        
        # Save to history
        self.analysis_history.append({
            'query': query,
            'analysis': analysis,
            'validated': not validation['needs_correction']
        })
        
        return analysis
    
    def _perform_analysis(self, query: str) -> str:
        """Perform initial analysis."""
        prompt = f"Analyze this data request: {query}\nProvide detailed insights."
        response = self.llm.invoke(prompt)
        return response.content
    
    def _validate_analysis(self, query: str, analysis: str) -> Dict[str, Any]:
        """Validate the analysis."""
        validation_prompt = f"""
        Original query: {query}
        Analysis: {analysis}
        
        Validate this analysis. Check for:
        1. Accuracy
        2. Completeness
        3. Logical consistency
        
        Respond with JSON: {{"needs_correction": true/false, "issues": [list of issues]}}
        """
        
        # Simplified validation
        return {"needs_correction": False, "issues": []}
    
    def _correct_analysis(self, query: str, analysis: str, issues: List[str]) -> str:
        """Correct the analysis."""
        correction_prompt = f"""
        Original query: {query}
        Previous analysis: {analysis}
        Issues found: {', '.join(issues)}
        
        Provide a corrected analysis addressing these issues.
        """
        
        response = self.llm.invoke(correction_prompt)
        return response.content

# Test the self-correcting agent
agent = SelfCorrectingAnalysisAgent()
result = agent.analyze("What are the key insights from our sales data?")
display(Markdown(result))

## 12. Best Practices for Autonomous Agents

### Key Principles:

1. **Clear State Management**: Define explicit states and transitions
2. **Tool Design**: Create focused, single-purpose tools
3. **Iteration Limits**: Prevent infinite loops with max iterations
4. **Error Handling**: Handle errors gracefully at each step
5. **Validation**: Validate outputs before moving to next step
6. **Logging**: Track agent decisions for debugging
7. **Human-in-the-Loop**: Allow human review for critical decisions
8. **Testing**: Test each node independently

### Common Patterns:

- **Plan-Execute-Reflect**: Plan → Execute → Validate → Report
- **Multi-step Reasoning**: Break complex tasks into smaller steps
- **Tool Composition**: Combine multiple tools for complex analysis
- **Self-Correction**: Validate and correct outputs automatically

## Next Steps

Explore more advanced topics:
- Build Streamlit applications with Gemini
- Implement multimodal analysis (text, images, video)
- Create production-ready agent systems
- Deploy agents at scale

---

## Learn More

Master autonomous AI agents with the **[Gen AI Crash Course](https://www.buildfastwithai.com/genai-course)** by Build Fast with AI!

**Created by [Build Fast with AI](https://www.buildfastwithai.com)**