# Demonstrating the `create` function in pandas-ai

This notebook provides a comprehensive demonstration of the `create` function in the `pandas-ai` library. We will explore various ways to create datasets, from simple DataFrame-based datasets to more complex ones involving transformations, views, and external data sources.

## 1. Setup

First, let's import the necessary libraries and create some sample DataFrames that we'll use throughout this tutorial.

In [1]:
import pandas as pd
import pandasai as pai

# Sample DataFrame for sales data
sales_data = {
    'category': ['A', 'B', 'A', 'C', 'B', 'A'],
    'region': ['North', 'South', 'North', 'South', 'East', 'East'],
    'amount': [100, 150, 120, 80, 200, 90],
    'quantity': [10, 15, 12, 8, 20, 9]
}
sales_df = pai.DataFrame(sales_data)


## 2. Basic Dataset Creation

Here's how to create a simple dataset from a DataFrame, providing a description and a schema for the columns.

In [2]:
pai.create(
    path="my-org/my-dataset",
    df=sales_df,
    description="This is a sample dataset.",
    columns=[
        {"name": "category", "type": "string", "description": "Product category" },
        {"name": "region", "type": "string", "description": "Sales region" },
        {"name": "amount", "type": "float", "description": "Sale amount" },
        {"name": "quantity", "type": "integer", "description": "Number of items sold" }
    ],
)

Dataset saved successfully to path: my-org/my-dataset


Unnamed: 0,category,region,amount,quantity
0,A,North,100,10
1,B,South,150,15
2,A,North,120,12
3,C,South,80,8
4,B,East,200,20
5,A,East,90,9


## 3. Dataset with Transformations and Group By

This example demonstrates how to create a dataset with transformations to clean the data and aggregations using `group_by`.

In [3]:
pai.create(
    path="my-org/sales-summary",
    df=sales_df,
    description="Sales data with transformations and aggregations",
    columns=[
        {"name": "category", "type": "string", "description": "Product category" },
        {"name": "region", "type": "string", "description": "Sales region" },
        {"name": "amount", "type": "float", "expression": "sum(amount)", "alias": "total_sales" },
        {"name": "quantity", "type": "integer", "expression": "avg(quantity)", "alias": "avg_quantity" }
    ],
    transformations=[
        {
            "type": "fill_na",
            "params": {"column": "amount", "value": 0}
        },
        {
            "type": "map_values",
            "params": {
                "column": "category",
                "mapping": {"A": "Premium", "B": "Standard", "C": "Basic"}
            }
        }
    ],
    group_by=["category", "region"]
)

        # source (dict, optional): A dictionary specifying the data source configuration.
        #     Required if `df` is not provided. The connector may include keys like 'type',
        #     'table', or 'view' to define the data source type and structure.
        # relations (dict, optional): A dictionary specifying relationships between tables
        #     when the dataset is created as a view. Each relationship should be defined
        #     using keys such as 'type', 'source', and 'target'.
        # view (bool, optional): If True, the dataset will be created as a view instead


Dataset saved successfully to path: my-org/sales-summary


Unnamed: 0,category,region,total_sales,avg_quantity
0,Standard,South,150.0,15.0
1,Basic,South,80.0,8.0
2,Premium,North,220.0,11.0
3,Premium,East,90.0,9.0
4,Standard,East,200.0,20.0


### Comparing Chat Results

In [4]:
from pandasai.smart_dataframe import SmartDataframe
from pandasai import Agent
from pandasai_litellm.litellm import LiteLLM
import os

api_key = os.getenv("OPENAI_API_KEY", "your-api-key")
llm = LiteLLM(model="gpt-5-mini", api_key=api_key)
# Configure PandasAI using global config
pai.config.set({
    "llm": llm,
    "save_logs": True,
    "max_retries": 3,
    "verbose": True,
})

# Load the original dataset
original_sales_ds = pai.load("my-org/my-dataset")
original_agent = Agent([original_sales_ds])
response_original = original_agent.chat("What is the total sales amount?")
print("Chat with Original Dataset:", response_original)


Dataset loaded successfully.
2025-11-24 11:30:16 [INFO] Question: What is the total sales amount?
2025-11-24 11:30:17 [INFO] Running PandasAI with litellm LLM...
2025-11-24 11:30:17 [INFO] Prompt ID: a60e858b-33e0-4387-b026-674b795c6779
2025-11-24 11:30:17 [INFO] Generating new code...
2025-11-24 11:30:17 [INFO] Using Prompt: <tables>

<table dialect="duckdb" table_name="my_dataset" description="This is a sample dataset." columns="[{"name": "category", "type": "string", "description": "Product category", "expression": null, "alias": null}, {"name": "region", "type": "string", "description": "Sales region", "expression": null, "alias": null}, {"name": "amount", "type": "float", "description": "Sale amount", "expression": null, "alias": null}, {"name": "quantity", "type": "integer", "description": "Number of items sold", "expression": null, "alias": null}]" dimensions="6x4">
category,region,amount,quantity
A,North,100,10
B,South,150,15
A,North,120,12
C,South,80,8
B,East,200,20
</table>



In [5]:
from pandasai.smart_dataframe import SmartDataframe
from pandasai import Agent
from pandasai_litellm.litellm import LiteLLM
import os

api_key = os.getenv("OPENAI_API_KEY", "your-api-key")
llm = LiteLLM(model="gpt-5-mini", api_key=api_key)
# Configure PandasAI using global config
pai.config.set({
    "llm": llm,
    "save_logs": True,
    "max_retries": 3,
    "verbose": True,
})

# Load the transformed dataset
transformed_sales_ds = pai.load("my-org/sales-summary")
transformed_agent = Agent([transformed_sales_ds])
response_transformed = transformed_agent.chat("What are the total sales for each category and region? If the column 'category' does not exist, consider a similar column or indicate the error.")
print("Chat with Transformed Dataset:", response_transformed)


Dataset loaded successfully.
2025-11-24 11:30:25 [INFO] Question: What are the total sales for each category and region? If the column 'category' does not exist, consider a similar column or indicate the error.
2025-11-24 11:30:25 [INFO] Running PandasAI with litellm LLM...
2025-11-24 11:30:25 [INFO] Prompt ID: 8e874e07-2e65-45a2-b68a-ba29e2dc81ab
2025-11-24 11:30:25 [INFO] Generating new code...
2025-11-24 11:30:25 [INFO] Using Prompt: <tables>

<table dialect="duckdb" table_name="sales_summary" description="Sales data with transformations and aggregations" columns="[{"name": "category", "type": "string", "description": "Product category", "expression": null, "alias": "category"}, {"name": "region", "type": "string", "description": "Sales region", "expression": null, "alias": "region"}, {"name": "amount", "type": "float", "description": null, "expression": "sum(amount)", "alias": "total_sales"}, {"name": "quantity", "type": "integer", "description": null, "expression": "avg(quantity)"

You can also create a 'virtual' dataset by pointing to an external data source, like a CSV file. This avoids loading the entire dataset into memory.

## 5. Dataset with Relations (View)

This example shows how to create a dataset as a 'view', which is a virtual table defined by relationships between other datasets.

In [7]:
# Create sample DataFrames for orders and customers
orders_data = {
    'order_id': [1, 2, 3, 4, 5],
    'customer_id': [101, 102, 101, 103, 102],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'quantity': [1, 2, 1, 1, 1],
    'order_date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19']
}
orders_df = pai.DataFrame(orders_data)

customers_data = {
    'customer_id': [101, 102, 103, 104],
    'name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince'],
    'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'diana@example.com'],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
customers_df = pai.DataFrame(customers_data)

print("Orders DataFrame:")
print(orders_df)
print("\nCustomers DataFrame:")
print(customers_df)


Orders DataFrame:
PandasAI DataFrame(name='table_d13d73c88da84d78b6236e0bc1d5a0d8')
   order_id  customer_id     product  quantity  order_date
0         1          101      Laptop         1  2024-01-15
1         2          102       Mouse         2  2024-01-16
2         3          101    Keyboard         1  2024-01-17
3         4          103     Monitor         1  2024-01-18
4         5          102  Headphones         1  2024-01-19

Customers DataFrame:
PandasAI DataFrame(name='table_4947a8753320ceae08b0ea9dfb2414f9')
   customer_id           name                email         city
0          101  Alice Johnson    alice@example.com     New York
1          102      Bob Smith      bob@example.com  Los Angeles
2          103  Charlie Brown  charlie@example.com      Chicago
3          104   Diana Prince    diana@example.com      Houston


In [None]:
# First, let's create the base datasets for orders and customers
pai.create(path="my-org/orders", df=orders_df, description="Customer orders")
pai.create(path="my-org/customers", df=customers_df, description="Customer information")

# Now, create a view that joins them
# Note: In views, columns must use the format "table_name.column_name"
# and relations use "from" and "to" fields with the same format
pai.create(
    path="my-org/customer-orders-view",
    view=True,
    relations=[
        {
            "from": "orders.customer_id",
            "to": "customers.customer_id"
        }
    ],
    columns=[
        {"name": "orders.order_id", "type": "integer" },
        {"name": "customers.name", "type": "string", "description": "Customer Name" },
        {"name": "orders.product", "type": "string" }
    ],
    description="A view of customer orders with customer names."
)

Dataset saved successfully to path: my-org/customer-orders-view


In [11]:
# Now let's use pai.chat to query the view
# First, we need to configure the LLM (if not already configured)
from pandasai_litellm.litellm import LiteLLM
import os

api_key = os.getenv("OPENAI_API_KEY", "your-api-key")
llm = LiteLLM(model="gpt-5-mini", api_key=api_key)
pai.config.set({
    "llm": llm,
    "save_logs": True,
    "max_retries": 3,
    "verbose": True,
})

# Load the view dataset
view_df = pai.load("my-org/customer-orders-view")

# Use pai.chat to ask questions about the joined data
print("Question 1: Who ordered a Laptop?")
response1 = pai.chat("Who ordered a Laptop?", view_df)
print(f"Answer: {response1}\n")

print("Question 2: What products did Alice Johnson order?")
response2 = pai.chat("What products did Alice Johnson order?", view_df)
print(f"Answer: {response2}\n")

print("Question 3: Show me all orders with customer names")
response3 = pai.chat("Show me all orders with customer names", view_df)
print(f"Answer:\n{response3}")


Dataset loaded successfully.
Question 1: Who ordered a Laptop?
2025-11-24 11:37:04 [INFO] Question: Who ordered a Laptop?
2025-11-24 11:37:04 [INFO] Running PandasAI with litellm LLM...
2025-11-24 11:37:04 [INFO] Prompt ID: e297c4a7-6c90-4162-a676-60b584ad1cac
2025-11-24 11:37:04 [INFO] Generating new code...
2025-11-24 11:37:04 [INFO] Using Prompt: <tables>

<table dialect="postgres" table_name="customer_orders_view" description="A view of customer orders with customer names." columns="[{"name": "orders.order_id", "type": "integer", "description": null, "expression": null, "alias": null}, {"name": "customers.name", "type": "string", "description": "Customer Name", "expression": null, "alias": null}, {"name": "orders.product", "type": "string", "description": null, "expression": null, "alias": null}]" dimensions="5x0">
orders_order_id,customers_name,orders_product
1,Alice Johnson,Laptop
2,Bob Smith,Mouse
3,Alice Johnson,Keyboard
4,Charlie Brown,Monitor
5,Bob Smith,Headphones
</table>

