# Simplify the transition from the Spark_MinIO_StarRocks_Stacked Lakehouse to the Singdata Native Lakehouse

This table summarizes the key differences and similarities between setting up a Stacked Lakehouse using Docker compose and deploying a Native Cloud Lakehouse with a comprehensive cloud data infrastructure.

| Stacked Lakehouse | Native Cloud Lakehouse |
|-------------------|------------------------|
| Deploy Object Storage, Apache Spark, Iceberg catalog, and StarRocks using Docker compose | One-Stop, All in Cloud data infrastructure, includes Object Storage, optimized Iceberg and Singdata Lakehouse for Spark and StarRocks’s workloads |
| Load New York City Green Taxi data for the month of May 2023 into the Iceberg data lake | Load New York City Green Taxi data for the month of May 2023 into Singdata Lake volume with RBAC |
| Configure StarRocks to access the Iceberg catalog | Singdata Lakehouse access Lake volume directly |
| Query the data with StarRocks where the data sits | Query the data with Singdata Lakehouse where the data sits via SQL or Zettapark Python API |


## Stacked Lakehouse Overview
This stack referenced from the [StarRocks Quick Start: Apache Iceberg Lakehouse](https://docs.starrocks.io/docs/quick_start/iceberg/), comprises of MinIO for object storage, Apache Spark with PySpark for data processing, an Iceberg catalog for metadata management, and StarRocks for real-time analytics, all deployed seamlessly using Docker Compose.

## Singdata Lakehouse Overview
[Singdata Lakehouse](https://www.singdata.com) provides unified management of data lake files and data warehouse tables through its abstract storage layer (Volume) and Python API. This guide demonstrates how to perform file management operations in the data lake, including uploading (PUT), downloading (GET), and listing (LIST) files.
Key Concepts:


Volume Storage Abstraction: All data lake storage is mapped to Volume objects.




External Volume: Managed by customers, supporting integration with cloud storage like AWS S3 and Alibaba Cloud OSS.

Internal Volume: Managed by Singdata, divided into USER VOLUME  and TABLE VOLUME.





Python API: Provides a unified interface for seamless integration of files and tables.




In [1]:
# !pip install clickzetta_zettapark_python  -U -i https://pypi.tuna.tsinghua.edu.cn/simple

In [2]:
from clickzetta.zettapark.session import Session
import json,requests
import os
from datetime import datetime

## Import Libraries and Create a Session

In [3]:
import json

# 从配置文件中读取参数
with open('config.json', 'r') as config_file:
    config = json.load(config_file)

print("Connecting to Lakehouse.....\n")

# 创建会话
session = Session.builder.configs(config).create()

print("Connected and context as below...\n")

# print(session.sql("SELECT current_instance_id(), current_workspace(),current_workspace_id(), current_schema(), current_user(),current_user_id(), current_vcluster()").collect())

Connecting to Lakehouse.....

Connected and context as below...



## File Operations

Before starting, clean up the USER VOLUME to ensure a clean environment:

In [None]:
session.sql("REMOVE USER VOLUME SUBDIRECTORY '/'").show()

### List Files in USER VOLUME

In [None]:
session.sql("LIST USER VOLUME").show(10)

### Download the dataset with curl.

In [None]:
!mkdir -p data/parquet/
!curl -o data/parquet/green_tripdata_2023-05.parquet https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/iceberg/datasets/green_tripdata_2023-05.parquet

### Upload Dataset to Singdata Lakehouse USER VOLUME

In [None]:
file_path = "data/parquet/green_tripdata_2023-05.parquet"
session.file.put(file_path,"volume:user://~/nyc/greentaxis/parquet/")

### Verify Upload Results

In [4]:
session.sql("LIST USER VOLUME").show(100)

-------------------------------------------------------------------------------------------------------------------------------------------------
|relative_path                                       |url                                                 |size     |last_modified_time         |
-------------------------------------------------------------------------------------------------------------------------------------------------
|nyc/greentaxis/parquet/green_tripdata_2023-05.p...  |oss://cz-lh-sh-prod/123/workspaces/qiliang_ws_d...  |1673841  |2025-03-10 14:56:29+08:00  |
-------------------------------------------------------------------------------------------------------------------------------------------------



## Option1: Query file with Singdata Lakehouse SQL Directly

### Verify pickup datetime format

In [5]:
file_path = "nyc/greentaxis/parquet/green_tripdata_2023-05.parquet"

In [6]:
session.sql(f"SELECT lpep_pickup_datetime FROM user volume  using parquet files('{file_path}')").show(10)

-----------------------------
|lpep_pickup_datetime       |
-----------------------------
|2023-05-01 00:52:10+00:00  |
|2023-05-01 00:29:49+00:00  |
|2023-05-01 00:25:19+00:00  |
|2023-05-01 00:07:06+00:00  |
|2023-05-01 00:43:31+00:00  |
|2023-05-01 00:51:54+00:00  |
|2023-05-01 00:27:46+00:00  |
|2023-05-01 00:27:14+00:00  |
|2023-05-01 00:24:14+00:00  |
|2023-05-01 00:46:55+00:00  |
-----------------------------



### Find the busy hours
This query aggregates the trips on the hour of the day and shows that the busiest hour of the day is 18:00.

In [7]:
session.sql(f"SELECT COUNT(*) AS trips, hour(lpep_pickup_datetime) AS hour_of_day FROM user volume  using parquet files('{file_path}') GROUP BY hour_of_day ORDER BY trips DESC").show(24)

-----------------------
|trips  |hour_of_day  |
-----------------------
|5381   |18           |
|5253   |17           |
|5091   |16           |
|4736   |15           |
|4393   |14           |
|4275   |19           |
|3893   |12           |
|3816   |11           |
|3685   |13           |
|3616   |9            |
|3530   |10           |
|3361   |20           |
|3315   |8            |
|2917   |21           |
|2680   |7            |
|2322   |22           |
|1735   |23           |
|1202   |6            |
|1189   |0            |
|806    |1            |
|606    |2            |
|513    |3            |
|451    |5            |
|408    |4            |
-----------------------



## Option2: Query File with Singdata Lakehouse Zettapark Python API.

In [8]:
from clickzetta.zettapark.functions import hour, count, col, from_utc_timestamp
from clickzetta.zettapark.types import StructType, StructField, LongType, StringType, TimestampType, DoubleType
schema = StructType([StructField("lpep_pickup_datetime", TimestampType()), StructField("total_amount", DoubleType())])
path = "volume:user://~/nyc/greentaxis/parquet/green_tripdata_2023-05.parquet"

### Verify pickup datetime format

In [9]:
parquet_loaded = session.read.option("FORMAT_NAME", "parquet").schema(schema).parquet(path)

parquet_converted = parquet_loaded.withColumn(
    "lpep_pickup_utc", 
    from_utc_timestamp(col("lpep_pickup_datetime"), "Pacific/Pitcairn")  
)

parquet_converted.show()

------------------------------------------------------------------------
|lpep_pickup_datetime       |total_amount  |lpep_pickup_utc            |
------------------------------------------------------------------------
|2023-05-01 08:52:10+08:00  |31.4          |2023-05-01 00:52:10+08:00  |
|2023-05-01 08:29:49+08:00  |40.55         |2023-05-01 00:29:49+08:00  |
|2023-05-01 08:25:19+08:00  |14.16         |2023-05-01 00:25:19+08:00  |
|2023-05-01 08:07:06+08:00  |32.57         |2023-05-01 00:07:06+08:00  |
|2023-05-01 08:43:31+08:00  |9.0           |2023-05-01 00:43:31+08:00  |
|2023-05-01 08:51:54+08:00  |12.5          |2023-05-01 00:51:54+08:00  |
|2023-05-01 08:27:46+08:00  |51.25         |2023-05-01 00:27:46+08:00  |
|2023-05-01 08:27:14+08:00  |20.4          |2023-05-01 00:27:14+08:00  |
|2023-05-01 08:24:14+08:00  |20.88         |2023-05-01 00:24:14+08:00  |
|2023-05-01 08:46:55+08:00  |21.72         |2023-05-01 00:46:55+08:00  |
---------------------------------------------------

### Find the busy hours
This query aggregates the trips on the hour of the day and shows that the busiest hour of the day is 18:00.

In [10]:
# Load data and process timezone
greentaxis_df = session.read.option("FORMAT_NAME", "parquet").schema(schema).parquet(path)
result_df = (greentaxis_df
    # Convert local time (e.g., New York) to UTC time‌‌:ml-citation{ref="3,4" data="citationList"}
    .withColumn("ltz_time", from_utc_timestamp(col("lpep_pickup_datetime"), "Pacific/Pitcairn"))
    # Extract hour from UTC time (adjust timezone inversely if needed)
    .withColumn("hour_of_day", hour("ltz_time"))  
    .groupBy("hour_of_day")
    .agg(count("*").alias("trips"))
    .orderBy("trips", ascending=False)
)

# Adjust column order (optional)
result_df = result_df.select("trips", "hour_of_day")
result_df.show(24)


-----------------------
|trips  |hour_of_day  |
-----------------------
|5381   |18           |
|5253   |17           |
|5091   |16           |
|4736   |15           |
|4393   |14           |
|4275   |19           |
|3893   |12           |
|3816   |11           |
|3685   |13           |
|3616   |9            |
|3530   |10           |
|3361   |20           |
|3315   |8            |
|2917   |21           |
|2680   |7            |
|2322   |22           |
|1735   |23           |
|1202   |6            |
|1189   |0            |
|806    |1            |
|606    |2            |
|513    |3            |
|451    |5            |
|408    |4            |
-----------------------



## Close the Session

In [11]:
session.close()

## Summary
Through this guide, you have learned the advantages of Native Cloud Lakehouse:

-‌  **Integrated Cloud Infrastructure**‌
Provides a fully managed, all-in-one cloud platform with pre-optimized components (Object Storage, Iceberg, Singdata Lakehouse), eliminating complex multi-tool orchestration and ensuring seamless compatibility for Spark and StarRocks workloads‌.

‌-‌  **Enhanced Data Governance**‌
Supports Role-Based Access Control (RBAC) for secure data management within Singdata Lake Volume, enabling granular permissions and compliance with enterprise security policies‌.

-‌  **Direct Data Query**
Allows Singdata Lakehouse to query data directly from the Lake Volume without intermediate layers, reducing latency and simplifying architecture‌.

‌-‌  **Flexible Query Capabilities**‌
Enables unified data access via SQL or Zettapark Python API, catering to diverse analytical workflows while maintaining data locality for efficient computation‌.

‌-‌  **Optimized Storage and Scalability**‌
Leverages Singdata optimizations for Iceberg and scalable object storage, ensuring cost-efficiency and adaptability to large datasets‌.
