# 1. 🔧 Setup: Libraries, API Key, and Imports

In [1]:
# Core Libraries
import pandas as pd
import sqlite3
import io
import warnings

In [2]:
# Widgets for file upload
from IPython.display import display
import ipywidgets as widgets

In [3]:
# Gemini API Setup
from google import genai
from google.genai import types
%pip install google.api_core
from google.api_core import retry
from google.api_core.exceptions import GoogleAPIError
from google.generativeai.types import GenerationConfig


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
# 🔧 Setup: Load API Key + Gemini Configuration
import os
from dotenv import load_dotenv
%pip install google.generativeai
import google.generativeai as genai

# Load your .env file
load_dotenv()

# Grab the key
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

# Configure Gemini
genai.configure(api_key=GOOGLE_API_KEY)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
from google.api_core import retry

# Retry handler for transient API errors
is_retriable = lambda e: (
    isinstance(e, genai.errors.APIError) and e.code in {429, 503}
)

genai.GenerativeModel.generate_content = retry.Retry(
    predicate=is_retriable
)(genai.GenerativeModel.generate_content)


# 2. 📂 Load a Dataset (Kaggle OR Upload your own CSV)

## Loading kaggle data

In [6]:
# Data Set: 
data = pd.read_csv('amazon_sales_data 2025.csv')
data

Unnamed: 0,Order ID,Date,Product,Category,Price,Quantity,Total Sales,Customer Name,Customer Location,Payment Method,Status
0,ORD0001,14-03-25,Running Shoes,Footwear,60,3,180,Emma Clark,New York,Debit Card,Cancelled
1,ORD0002,20-03-25,Headphones,Electronics,100,4,400,Emily Johnson,San Francisco,Debit Card,Pending
2,ORD0003,15-02-25,Running Shoes,Footwear,60,2,120,John Doe,Denver,Amazon Pay,Cancelled
3,ORD0004,19-02-25,Running Shoes,Footwear,60,3,180,Olivia Wilson,Dallas,Credit Card,Pending
4,ORD0005,10-03-25,Smartwatch,Electronics,150,3,450,Emma Clark,New York,Debit Card,Pending
...,...,...,...,...,...,...,...,...,...,...,...
245,ORD0246,17-03-25,T-Shirt,Clothing,20,2,40,Daniel Harris,Miami,Debit Card,Cancelled
246,ORD0247,30-03-25,Jeans,Clothing,40,1,40,Sophia Miller,Dallas,Debit Card,Cancelled
247,ORD0248,05-03-25,T-Shirt,Clothing,20,2,40,Chris White,Denver,Debit Card,Cancelled
248,ORD0249,08-03-25,Smartwatch,Electronics,150,3,450,Emily Johnson,New York,Debit Card,Cancelled


## Loading input data

> 👉 **Tip**: You can upload your own `.csv` file using the widget below. If you skip this step, the notebook will use a default Amazon Sales dataset.


In [8]:
# Upload CSV widget
upload = widgets.FileUpload(accept='.csv', multiple=False)
display(upload)

# Ask user if they want to upload a CSV
use_upload = input("📤 Do you want to upload your own CSV file? (y/n): ").lower()

if use_upload == 'y':
    # Show the upload widget
    upload = widgets.FileUpload(accept='.csv', multiple=False)
    display(upload)
    print("📤 Please upload your file using the widget above, then run the next cell.")
else:
    print("📂 Skipping upload. We'll use the default dataset instead.")


FileUpload(value=(), accept='.csv', description='Upload')

FileUpload(value=(), accept='.csv', description='Upload')

📤 Please upload your file using the widget above, then run the next cell.


In [13]:
def handle_upload():
    if 'upload' in globals() and upload.value:
        for file_info in upload.value:
            content = file_info['content']  # Access the content attribute of the file info
            df = pd.read_csv(io.BytesIO(content))
            df.columns = [col.strip().replace(" ", "_") for col in df.columns]
            return df
    return None  # No file uploaded

df = handle_upload()

if df is None:
    print("⚠️ No user file uploaded — using default Kaggle dataset instead.")
    df = pd.read_csv('amazon_sales_data 2025.csv')
    df.columns = [col.strip().replace(" ", "_") for col in df.columns]

df.head()


Unnamed: 0,Unnamed:_0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


# 3. 🧠 Ask AI: What Should We Explore?

## prompt for the kaggle data set

In [14]:
# Create the model
model = genai.GenerativeModel("gemini-2.0-flash")

## Prompt example
sample = data.head(5).to_markdown()
prompt = f"""
Here is a few rows of our dataset:

{sample}

Based on this dataset, what are some useful questions we should ask during further data analysis?
"""

# Config
short_config = GenerationConfig(max_output_tokens=200)

# Generate
response = model.generate_content(prompt, generation_config=short_config)
print(response.text)


Okay, based on the provided dataset columns, here are some useful questions to explore during further data analysis, categorized for clarity:

**Sales & Revenue Analysis:**

*   **What are the overall total sales?** (Simple, but foundational)
*   **What are the average sales per order?**
*   **Which product category generates the most revenue?**
*   **Which products are the top sellers (by quantity and revenue)?**
*   **What is the distribution of order quantities?  Are most orders for single items, or are customers buying multiple items?**
*   **What is the average price per product category?**
*   **How does the average order value vary by payment method?**
*   **How does the sales performance change over time? (Requires more data points, but essential for understanding trends).**
*   **What is the distribution of 'Total Sales'?**

**Customer Behavior & Segmentation:**

*


## Prompt for the input data

In [15]:
# Prompt:
sample_1 = df.head(5).to_markdown()

prompt_1 = f"""
Here is a few rows of our dataset:

{sample_1}

Based on this dataset, what are some useful questions we should ask during further data analysis?
"""

# Config
short_config = GenerationConfig(max_output_tokens=200)

# Generate
response = model.generate_content(prompt, generation_config=short_config)
print(response.text)


Okay, based on the provided dataset, here are some useful questions to ask during further data analysis, categorized for clarity:

**I. Sales Performance & Trends:**

*   **Overall Sales:**
    *   What is the total revenue generated?
    *   What is the average order value?
    *   What are the minimum and maximum sales amounts?
    *   What are the total units sold?
*   **Temporal Analysis:**
    *   What are the sales trends over time (daily, weekly, monthly)?  Can we identify any seasonality?
    *   Are there any peak sales periods?
    *   How does sales performance vary across different months/quarters of the year?
*   **Product Analysis:**
    *   Which products are the top sellers?
    *   Which products are the least popular?
    *   What is the average quantity sold per product?
    *   What are the sales trends for each


# 4. 🧾 Generate & Run SQL Queries from Natural Language

In [16]:
# description function for both kaggle and user inputed csv:
def describe_table(conn, table_name: str):
    cursor = conn.cursor()
    cursor.execute(f"PRAGMA table_info({table_name});")
    return [(col[1], col[2]) for col in cursor.fetchall()]

In [17]:
# Query function for both kaggle and user inputed csv: 
def execute_query(conn, sql: str) -> list[list[str]]:
    print(f' - DB CALL: execute_query({sql})')
    cursor = conn.cursor()
    cursor.execute(sql)
    return cursor.fetchall()

## running SQL Queries for the kaggle data set

In [18]:
kaggle_conn = sqlite3.connect("sample.db")
data.columns = [col.strip().replace(" ", "_") for col in data.columns]
data.to_sql("data", kaggle_conn, if_exists="replace", index=False)

250

In [19]:
describe_table(kaggle_conn, "data")

[('Order_ID', 'TEXT'),
 ('Date', 'TEXT'),
 ('Product', 'TEXT'),
 ('Category', 'TEXT'),
 ('Price', 'INTEGER'),
 ('Quantity', 'INTEGER'),
 ('Total_Sales', 'INTEGER'),
 ('Customer_Name', 'TEXT'),
 ('Customer_Location', 'TEXT'),
 ('Payment_Method', 'TEXT'),
 ('Status', 'TEXT')]

In [20]:
execute_query(kaggle_conn, "select * from data where Category == 'Footwear'")

 - DB CALL: execute_query(select * from data where Category == 'Footwear')


[('ORD0001',
  '14-03-25',
  'Running Shoes',
  'Footwear',
  60,
  3,
  180,
  'Emma Clark',
  'New York',
  'Debit Card',
  'Cancelled'),
 ('ORD0003',
  '15-02-25',
  'Running Shoes',
  'Footwear',
  60,
  2,
  120,
  'John Doe',
  'Denver',
  'Amazon Pay',
  'Cancelled'),
 ('ORD0004',
  '19-02-25',
  'Running Shoes',
  'Footwear',
  60,
  3,
  180,
  'Olivia Wilson',
  'Dallas',
  'Credit Card',
  'Pending'),
 ('ORD0019',
  '22-03-25',
  'Running Shoes',
  'Footwear',
  60,
  3,
  180,
  'Olivia Wilson',
  'Houston',
  'Credit Card',
  'Completed'),
 ('ORD0046',
  '06-03-25',
  'Running Shoes',
  'Footwear',
  60,
  2,
  120,
  'David Lee',
  'Houston',
  'Debit Card',
  'Cancelled'),
 ('ORD0053',
  '24-03-25',
  'Running Shoes',
  'Footwear',
  60,
  4,
  240,
  'Emily Johnson',
  'Los Angeles',
  'PayPal',
  'Completed'),
 ('ORD0079',
  '09-03-25',
  'Running Shoes',
  'Footwear',
  60,
  2,
  120,
  'Emily Johnson',
  'Denver',
  'Gift Card',
  'Cancelled'),
 ('ORD0080',
  '23-02

## Running SQL Queries from the input data

In [21]:
user_conn = sqlite3.connect("sample_1.db")
df.columns = [col.strip().replace(" ", "_") for col in df.columns]
df.to_sql("df", user_conn, if_exists="replace", index=False)

53940

In [22]:
describe_table(user_conn, "df")

[('Unnamed:_0', 'INTEGER'),
 ('carat', 'REAL'),
 ('cut', 'TEXT'),
 ('color', 'TEXT'),
 ('clarity', 'TEXT'),
 ('depth', 'REAL'),
 ('table', 'REAL'),
 ('price', 'INTEGER'),
 ('x', 'REAL'),
 ('y', 'REAL'),
 ('z', 'REAL')]

In [23]:
print("📂 Default dataset loaded into SQLite as 'data'")
print("📂 User-uploaded dataset loaded into SQLite as 'df'")

📂 Default dataset loaded into SQLite as 'data'
📂 User-uploaded dataset loaded into SQLite as 'df'


# 5. Agents

In [24]:
from tabulate import tabulate
import google.generativeai as genai

# Make sure you have your API key configured somewhere before this
genai.configure(api_key=GOOGLE_API_KEY)

# Create model instance
model = genai.GenerativeModel("gemini-2.0-flash")

# Global config
short_config = genai.types.GenerationConfig(max_output_tokens=200)

In [25]:
# Agents function for both kaggle and user inputed csv:
def agent_loop(df, conn, table_name="data"):
    print("🔍 Ask a question about the dataset (or type 'exit'):")

    # Dynamically build schema from df
    schema = f"Table: {table_name}\nColumns:\n"
    for col in df.columns:
        schema += f"- {col}\n"

    # Start interaction loop
    while True:
        user_input = input("\n🧍 You: ")
        if user_input.lower() == "exit":
            break

        # Prompt for SQL
        prompt = f"""
        You are a helpful assistant that answers data questions by generating SQL queries.
        Here is the table schema:
        {schema}
        Question: {user_input}
        Only respond with a valid SQL query.
        """
        response = model.generate_content(prompt, generation_config=short_config)
        sql = response.text.strip().replace("```sql", "").replace("```", "").strip()
        print(f"\n🧾 Cleaned SQL:\n{sql}")

        try:
            result = execute_query(conn, sql)
            print("\n📊 Query Results:")
            for row in result:
                print(row)

            # Explanation
            summary_prompt = f"""
            Here is the result of the SQL query:
            {result}
            Explain this result in plain English for a data analyst.
            """
            summary = model.generate_content(summary_prompt, generation_config=short_config)
            print(f"\n🗣️ Summary:\n{summary.text}")
        except Exception as e:
            print(f"❌ Error: {e}")


## 5. 🤖 Agent Mode: Chat with Your Data W kaggle data and input data

In [26]:
agent_loop(data, kaggle_conn)

🔍 Ask a question about the dataset (or type 'exit'):

🧾 Cleaned SQL:
SELECT sum(Total_Sales) FROM data
 - DB CALL: execute_query(SELECT sum(Total_Sales) FROM data)

📊 Query Results:
(243845,)

🗣️ Summary:
Okay, here's the explanation in plain English for a data analyst:

"The SQL query you ran returned a single row containing one value: 243845.  It's being presented as a tuple with one element, which is a common way Python (and some SQL query tools) return data.  Therefore, the query appears to have isolated a single numerical result, most likely representing some kind of count, ID, or aggregated value based on the query's criteria. You should look back at the original SQL query to understand what that single number (243845) actually represents in your data."

**In simpler terms:**

"Your query pulled out just one number: 243845. It's the only result. To understand what it *means*, you need to look back at the SQL you wrote to get it."

**Key things to consider based on this result:**


In [27]:
"""Uncomment the line below to execute the agent_loop on the user-uploaded dataset."""
agent_loop(df, user_conn, table_name="df")

🔍 Ask a question about the dataset (or type 'exit'):

🧾 Cleaned SQL:
SELECT sum(price) FROM df
 - DB CALL: execute_query(SELECT sum(price) FROM df)

📊 Query Results:
(212135217,)

🗣️ Summary:
Okay, here's the explanation for a data analyst:

"The SQL query you ran returned a single row, and within that row, there's one column.  The value in that column is 212135217.  Without knowing the context of the query, it's difficult to say exactly what this number represents, but the query selected one value so it might be :

*   **A count:** For example, it could be the total number of customers, orders, products, etc. that match certain criteria.
*   **An ID:** This number may be a single ID that the query was intended to retrieve.
*   **A sum/average/other aggregate:** The number could be the result of a calculation, such as the total revenue, average order value, or maximum price, etc.

To understand the meaning of this number you need to look at the query and see which table was queried and

In [28]:
"""Uncomment the line below to execute a query on the user-uploaded dataset."""
# execute_query()

'Uncomment the line below to execute a query on the user-uploaded dataset.'