# 🤖 GenAI for Data Analysis: Ask Questions, Get SQL, Understand Results

Welcome! This project demonstrates how to use **Google Gemini AI** to enhance the data analysis process with:
- AI-generated **questions and SQL queries**
- Natural language **summaries of query results**
- Support for **custom CSV uploads** or a built-in **Kaggle dataset**

🎯 Goal: Show how GenAI can act like a smart analyst assistant for any dataset.


# 1. 🔧 Setup: Libraries, API Key, and Imports

In [1]:
# Gemini + API Setup
from google import genai
from google.genai import types
from kaggle_secrets import UserSecretsClient
import warnings

  warn(


In [2]:
# For API retry handling
from google.api_core import retry

In [3]:
# Displaying Data
import pandas as pd

In [4]:
# SQLite DB
import sqlite3

In [5]:
# Librairies for the inputed csv:
from IPython.display import display
import ipywidgets as widgets
import io

In [6]:
# API Error 
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

In [7]:
# getting API key
GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
client = genai.Client(api_key=GOOGLE_API_KEY)

# 2. 📂 Load a Dataset (Kaggle OR Upload your own CSV)

## Loading kaggle data

In [8]:
# Data Set: 
data = pd.read_csv('/kaggle/input/amazon-sales-2025/amazon_sales_data 2025.csv')
data

Unnamed: 0,Order ID,Date,Product,Category,Price,Quantity,Total Sales,Customer Name,Customer Location,Payment Method,Status
0,ORD0001,14-03-25,Running Shoes,Footwear,60,3,180,Emma Clark,New York,Debit Card,Cancelled
1,ORD0002,20-03-25,Headphones,Electronics,100,4,400,Emily Johnson,San Francisco,Debit Card,Pending
2,ORD0003,15-02-25,Running Shoes,Footwear,60,2,120,John Doe,Denver,Amazon Pay,Cancelled
3,ORD0004,19-02-25,Running Shoes,Footwear,60,3,180,Olivia Wilson,Dallas,Credit Card,Pending
4,ORD0005,10-03-25,Smartwatch,Electronics,150,3,450,Emma Clark,New York,Debit Card,Pending
...,...,...,...,...,...,...,...,...,...,...,...
245,ORD0246,17-03-25,T-Shirt,Clothing,20,2,40,Daniel Harris,Miami,Debit Card,Cancelled
246,ORD0247,30-03-25,Jeans,Clothing,40,1,40,Sophia Miller,Dallas,Debit Card,Cancelled
247,ORD0248,05-03-25,T-Shirt,Clothing,20,2,40,Chris White,Denver,Debit Card,Cancelled
248,ORD0249,08-03-25,Smartwatch,Electronics,150,3,450,Emily Johnson,New York,Debit Card,Cancelled


## Loading input data

> 👉 **Tip**: You can upload your own `.csv` file using the widget below. If you skip this step, the notebook will use a default Amazon Sales dataset.


In [9]:
# Upload CSV widget
upload = widgets.FileUpload(accept='.csv', multiple=False)
display(upload)

# Handle upload
def handle_upload():
    if upload.value:
        for filename in upload.value:
            content = upload.value[filename]['content']
            df = pd.read_csv(io.BytesIO(content))
            df.columns = [col.strip().replace(" ", "_") for col in df.columns]
            return df
    return None  # No file uploaded

# Try user upload
df = handle_upload()

# Fallback if nothing is uploaded
if df is None:
    print("⚠️ No user file uploaded — using default Kaggle dataset instead.")
    df = pd.read_csv('/kaggle/input/amazon-sales-2025/amazon_sales_data 2025.csv')
    df.columns = [col.strip().replace(" ", "_") for col in df.columns]


FileUpload(value=(), accept='.csv', description='Upload')

⚠️ No user file uploaded — using default Kaggle dataset instead.


# 3. 🧠 Ask AI: What Should We Explore?

## prompt for the kaggle data set

In [10]:
# Prompt:
sample = data.head(5).to_markdown()  # Only show a small sample in the prompt
prompt = f"""
Here is a a few rows of our dataset:

{sample}

Based on this dataset, what are some useful questions we should ask during further data analysis?
"""

short_config = types.GenerateContentConfig(max_output_tokens=200)

response = client.models.generate_content(
    model='gemini-2.0-flash',
    config=short_config,
    contents=prompt
)
print(response.text)

Okay, based on the sample data provided, here's a breakdown of useful questions we can ask during further data analysis, categorized by theme:

**1. Sales Performance & Trends:**

*   **Overall Sales:**
    *   What is the total revenue generated?
    *   What is the average order value?
    *   How many orders were placed in total?
*   **Temporal Analysis (Time-Based):**
    *   What are the monthly/quarterly/yearly sales trends? (Look for seasonality or growth patterns).  Is there a particular month that has higher sales?
    *   Are there specific dates or periods with unusually high or low sales? (e.g., holiday promotions).
    *   How has sales performance changed over time? (Year-over-year growth, etc.).
*   **Product Performance:**
    *   Which products are the best sellers?
    *   Which product categories are the


## Prompt for the input data

In [11]:
# Prompt:
sample_1 = df.head(5).to_markdown()

prompt_1 = f"""
Here is a few rows of our dataset:

{sample_1}

Based on this dataset, what are some useful questions we should ask during further data analysis?
"""

response = client.models.generate_content(
    model='gemini-2.0-flash',
    config=short_config,
    contents=prompt_1
)

print(response.text)


Okay, based on the provided dataset, here are some useful questions to ask during further data analysis, categorized for clarity:

**Sales Performance & Trends:**

*   **Overall Sales Performance:**
    *   What is the total revenue generated during the period covered by the data?
    *   What is the average order value?
    *   What is the distribution of order values? (Are there many small orders or a few large ones?)
    *   What are the minimum and maximum values for 'Price', 'Quantity', and 'Total_Sales'?

*   **Temporal Trends:**
    *   How do sales fluctuate over time (daily, weekly, monthly)? (Requires more data than currently visible, but is a critical question.)
    *   Are there any seasonal patterns in sales?
    *   Are there any trends of products becoming more or less popular?

*   **Product Performance:**
    *   Which products generate the most revenue?


# 4. 🧾 Generate & Run SQL Queries from Natural Language

In [12]:
# description function for both kaggle and user inputed csv:
def describe_table(table_name: str):
    cursor = db_conn.cursor()
    cursor.execute(f"PRAGMA table_info({table_name});")
    return [(col[1], col[2]) for col in cursor.fetchall()]

In [13]:
# Query function for both kaggle and user inputed csv: 
def execute_query(sql: str) -> list[list[str]]:
    """Execute an SQL statement, returning the results."""
    print(f' - DB CALL: execute_query({sql})')

    cursor = db_conn.cursor()

    cursor.execute(sql)
    return cursor.fetchall()

## running SQL Queries for the kaggle data set

In [14]:
# Using sqlite3 to create the database
data.columns = [col.strip().replace(" ", "_") for col in data.columns]  # Clean column names

db_conn = sqlite3.connect("sample.db")
data.to_sql("data", db_conn, if_exists="replace", index=False)


250

In [15]:
describe_table("data")

[('Order_ID', 'TEXT'),
 ('Date', 'TEXT'),
 ('Product', 'TEXT'),
 ('Category', 'TEXT'),
 ('Price', 'INTEGER'),
 ('Quantity', 'INTEGER'),
 ('Total_Sales', 'INTEGER'),
 ('Customer_Name', 'TEXT'),
 ('Customer_Location', 'TEXT'),
 ('Payment_Method', 'TEXT'),
 ('Status', 'TEXT')]

In [16]:
execute_query("select * from data where Category == 'Footwear'")

 - DB CALL: execute_query(select * from data where Category == 'Footwear')


[('ORD0001',
  '14-03-25',
  'Running Shoes',
  'Footwear',
  60,
  3,
  180,
  'Emma Clark',
  'New York',
  'Debit Card',
  'Cancelled'),
 ('ORD0003',
  '15-02-25',
  'Running Shoes',
  'Footwear',
  60,
  2,
  120,
  'John Doe',
  'Denver',
  'Amazon Pay',
  'Cancelled'),
 ('ORD0004',
  '19-02-25',
  'Running Shoes',
  'Footwear',
  60,
  3,
  180,
  'Olivia Wilson',
  'Dallas',
  'Credit Card',
  'Pending'),
 ('ORD0019',
  '22-03-25',
  'Running Shoes',
  'Footwear',
  60,
  3,
  180,
  'Olivia Wilson',
  'Houston',
  'Credit Card',
  'Completed'),
 ('ORD0046',
  '06-03-25',
  'Running Shoes',
  'Footwear',
  60,
  2,
  120,
  'David Lee',
  'Houston',
  'Debit Card',
  'Cancelled'),
 ('ORD0053',
  '24-03-25',
  'Running Shoes',
  'Footwear',
  60,
  4,
  240,
  'Emily Johnson',
  'Los Angeles',
  'PayPal',
  'Completed'),
 ('ORD0079',
  '09-03-25',
  'Running Shoes',
  'Footwear',
  60,
  2,
  120,
  'Emily Johnson',
  'Denver',
  'Gift Card',
  'Cancelled'),
 ('ORD0080',
  '23-02

## Running SQL Queries from the input data

In [17]:
# 🛢️ Save uploaded data to SQLite
db_conn = sqlite3.connect("sample.db")
df.to_sql("df", db_conn, if_exists="replace", index=False)

250

In [18]:
describe_table("df")

[('Order_ID', 'TEXT'),
 ('Date', 'TEXT'),
 ('Product', 'TEXT'),
 ('Category', 'TEXT'),
 ('Price', 'INTEGER'),
 ('Quantity', 'INTEGER'),
 ('Total_Sales', 'INTEGER'),
 ('Customer_Name', 'TEXT'),
 ('Customer_Location', 'TEXT'),
 ('Payment_Method', 'TEXT'),
 ('Status', 'TEXT')]

In [19]:
"""Uncomment the line below to execute a query on the user-uploaded dataset."""
# execute_query()

'Uncomment the line below to execute a query on the user-uploaded dataset.'

# Agents

In [20]:
ex# Agents function for both kaggle and user inputed csv:
def agent_loop(df, table_name="data"):
    print("🔍 Ask a question about the dataset (or type 'exit'):")

    # Dynamically build schema from df
    schema = f"Table: {table_name}\nColumns:\n"
    for col in df.columns:
        schema += f"- {col}\n"

    # Start interaction loop
    while True:
        user_input = input("\n🧍 You: ")
        if user_input.lower() == "exit":
            break

        # Prompt for SQL
        prompt = f"""
        You are a helpful assistant that answers data questions by generating SQL queries.
        Here is the table schema:
        {schema}
        Question: {user_input}
        Only respond with a valid SQL query.
        """
        response = client.models.generate_content(
            model='gemini-2.0-flash',
            config=short_config,
            contents=prompt
        )
        sql = response.text.strip().replace("```sql", "").replace("```", "").strip()
        print(f"\n🧾 Cleaned SQL:\n{sql}")

        try:
            result = execute_query(sql)
            print("\n📊 Query Results:")
            for row in result:
                print(row)

            # Explanation
            summary_prompt = f"""
            Here is the result of the SQL query:
            {result}
            Explain this result in plain English for a data analyst.
            """
            summary = client.models.generate_content(
                model='gemini-2.0-flash',
                config=short_config,
                contents=summary_prompt
            )
            print(f"\n🗣️ Summary:\n{summary.text}")
        except Exception as e:
            print(f"❌ Error: {e}")


## 5. 🤖 Agent Mode: Chat with Your Data W kaggle data and input data

In [21]:
agent_loop(data) 

🔍 Ask a question about the dataset (or type 'exit'):



🧍 You:  exit


In [22]:
"""Uncomment the line below to execute the agent_loop on the user-uploaded dataset."""
# agent_loop(df, table_name="df")  

'Uncomment the line below to execute the agent_loop on the user-uploaded dataset.'