# Segmentation in Pinot

## Introduction

Table contents in Pinot are expected to grow infinitely and thus need to be distributed across multiple nodes. Therefore, the tables' dataset is split into segments, which are comparable to shards/partitions in classical RDBMSs. In Pinot, segmentation is done in a time-based fashion, meaning that configured timestamps of records in a given segment will be close to each other.
Segments store all columns of a table and organize data in columnar orientation for high encoding efficiency and optional pre-aggregation of metrics. In addition to the data itself, segments contain indices and other lookup-related data structures like dictionaries.

As Pinot is not a general-purpose database (data is immutable), it cannot be used as an application's "main datastore". Like other OLAP stores, Pinot is supposed to run next to the application's "main datastore" and its data has to be imported separately (ingestion). In order to facilitate near-realtime analytical queries, for example like the ones powering LinkedIn's well-known "Who viewed my profile" functionality, data is typically ingested into Pinot via event streaming platforms, like Apache Kafka (stream ingestion). In contrast to classical RDBMSs, Pinot comes with built-in support for directly reading from Kafka event streams.
However, data can also be ingested from traditional batch processing workflows, for example realized with Apache Hadoop or Apache Spark (batch ingestion).

Pinot tables are either defined as realtime or offline tables. Tables of both types are broken into segments. For realtime tables, data is consumed directly from event streams by Pinot servers as-is without any additional processing. Segments are built inside Pinot and are completed once a given threshold in size or time is reached. Segments for offline tables are built outside of Pinot in batch processing jobs, that might perform additional data deduplication or similar processing, and uploaded to the Pinot controller. Both table types might be combined to form hybrid tables, that allow both realtime analytics as well as long-term data storage (covered later on).

## Realtime Data Ingestion

To demonstrate how segments work in Pinot, we're going to focus on realtime data ingestion first. In the following examples, we'll be using the controller's and broker's REST APIs in order to dynamically create realtime tables, retrieve segment metadata and execute SQL queries.

In [None]:
# all imports
import copy
import requests
import json
import io
import re
import os
import shutil
import fileinput
import tarfile
import time
import pandas as pd

In [None]:
# some helpers for the upcoming examples
def server_name_from_instance(instance):
    return re.search('pinot-server-[0-9]+', instance).group()

def query_sql(query):
    print("query: " + query)
    return requests.get('http://pinot-broker.pinot:8099/query/sql', params={
        "sql" : query,
        "trace": "true"
    }).json()

def query_result_to_dataframe(result):
    return pd.DataFrame(columns=result['resultTable']['dataSchema']['columnNames'], data=result['resultTable']['rows'])

def extract_query_statistics_from_result(result):
    query_statistics_fields = ["numServersQueried","numServersResponded","numSegmentsQueried","numSegmentsProcessed","numSegmentsMatched","numConsumingSegmentsQueried","numDocsScanned","numEntriesScannedInFilter","numEntriesScannedPostFilter","numGroupsLimitReached","totalDocs","timeUsedMs"]
    return { key: result[key] for key in query_statistics_fields }

def extract_query_statistics_from_result_dataframe(result):
    return pd.DataFrame({"value": extract_query_statistics_from_result(result)})

ordinal_pattern = re.compile(r'__[0-9]+__([0-9]+)__')
def sort_by_ascending_ordinal(segments):
    segments.sort(key=lambda L: (int(ordinal_pattern.search(L).group(1)), L))

def segment_metadata_for_table(table):
    segments = requests.get(f'http://pinot-controller.pinot:9000/segments/{table}').json()
    
    segment_metadata = {}
    for segments_item in segments:
        for table_type, type_segments in segments_item.items():
            for segment in type_segments:
                segment_type_name = f"{segment}_{table_type}"
                segment_metadata[segment_type_name] = requests.get(f'http://pinot-controller.pinot:9000/segments/{table}/{segment}/metadata').json()
    
    return segment_metadata

def segment_metadata_of_nth_segment(segment_metadata, n, table_type="REALTIME"):
    segments_of_type = []
    for segment in segment_metadata.keys():
        if segment.endswith("_" + table_type):
            segments_of_type.append(segment)
    
    sort_by_ascending_ordinal(segments_of_type)
    return segment_metadata[segments_of_type[n]]


def start_time_of_nth_segment(segment_metadata, n, table_type="REALTIME"):
    return segment_metadata_of_nth_segment(segment_metadata, n, table_type)["segment.start.time"]

def wait_for_table_to_finish_loading(table, wait_time=15):
    last_total_docs = -1
    while True:
        response = requests.post('http://pinot-broker.pinot:8099/query/sql', json={"sql" : f"SELECT * FROM {table} LIMIT 1"}).json()
        total_docs = response["totalDocs"]
        if total_docs == last_total_docs:
            print(f"--Consumption of generated data for table {table} finished, (loaded {last_total_docs} docs)--")
            break
        
        last_total_docs = total_docs
        print(f"waiting for table {table} to finish loading (loaded {last_total_docs} docs)")
        time.sleep(wait_time)

At first, we will create two realtime tables. Both will be using the `trips` schema created above and read from the `trips` topic in Kafka, that was also created and filled with random records above.

In [None]:
# common configuration used for both tables
table_config_template = {
  "tableName": "",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "trip_start_time_millis",
    "timeType": "MILLISECONDS",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "60",
    "schemaName": "trips",
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "invertedIndexColumns": [
        "rider_name",
        "driver_name",
        "start_location",
        "end_location"
    ],
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "trips",
      "stream.kafka.consumer.type": "simple",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.zk.broker.url": "pinot-kafka-zookeeper:2181",
      "stream.kafka.broker.list": "pinot-kafka:9092",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
    }
  },
  "metadata": {
    "customConfigs": {}
  }
}

Pinot servers will continuously read from the Kafka topic into memory and compile a segment until a configured threshold is reached. The first table is configured to flush the new in-memory segment to disk, once either 12 hours have passed or the segment contains 80,000 rows (which will be the case for our example, as the data is already waiting in the Kafka stream).

In [None]:
# create first table
table_config = copy.deepcopy(table_config_template)
table_config["tableName"] = "trips_segmentation_1"
table_config["segmentsConfig"]["replication"] = "1"
table_config["segmentsConfig"]["replicasPerPartition"] = "1"
table_config["tableIndexConfig"]["streamConfigs"]["realtime.segment.flush.threshold.time"] = "12h"
table_config["tableIndexConfig"]["streamConfigs"]["realtime.segment.flush.threshold.size"] = "80000"
display(requests.post('http://pinot-controller.pinot:9000/tables', json=table_config).json())

In contrast to the first table, the second one will target a segment size of 50,000 rows and will additionally create 3 replicas of each segment on different server instances for data availability (fault tolerance) and load distribution of queries.

In [None]:
# create second table
table_config = copy.deepcopy(table_config_template)
table_config["tableName"] = "trips_segmentation_2"
table_config["segmentsConfig"]["replication"] = "3"
table_config["segmentsConfig"]["replicasPerPartition"] = "3"
table_config["tableIndexConfig"]["streamConfigs"]["realtime.segment.flush.threshold.time"] = "12h"
table_config["tableIndexConfig"]["streamConfigs"]["realtime.segment.flush.threshold.size"] = "50000"
display(requests.post('http://pinot-controller.pinot:9000/tables', json=table_config).json())

Let's wait for the tables to finish loading the data from Kafka:

In [None]:
wait_for_table_to_finish_loading("trips_segmentation_1")
wait_for_table_to_finish_loading("trips_segmentation_2")

The controller stores metadata for each segment, which can be viewed via its REST API. Each segment's metadata contains general information such as the table type, table name and time unit as well as segment-specific information such as the number of records (`segment.total.docs`), the timestamp of the segment's first and last record (`segment.start.time`, `segment.end.time`) and the segment's status (`segment.realtime.status`).
New realtime segments start in status `IN_PROGRESS`, which means that the segment is currently consuming data from the Kafka topic. Once the size or time threshold is reached, the consuming servers start a segment commit protocol in order to agree on the last record that shall be included in the segment. Once the commit protocol is completed, the segment transitions to `DONE` and the servers flush the data to disk. Afterwards, a new segment is started again to consume further data from the event stream.

We can now query the controller's REST API to retrieve metadata for all segments in both our tables.
The first table contains less segments, but each segment contains a higher number of records.

In [None]:
segment_metadata_1 = segment_metadata_for_table("trips_segmentation_1")
pd.DataFrame(segment_metadata_1)

The segment metadata for the second table shows more segments. Each of them has a lower number of total records and 3 replicas (`segment.realtime.numReplicas`).

In [None]:
segment_metadata_2 = segment_metadata_for_table("trips_segmentation_2")
pd.DataFrame(segment_metadata_2)

Pinot brokers are responsible for executing queries against the database. When a broker receives a new query, it sends multiple subqueries to Pinot servers that are hosting the segments belonging to the queried table. Once it has received results from all queried servers, it merges the subresults and returns the aggregated result to the client.
In order to efficiently execute queries, brokers use segment metadata to figure out, which segments need to be queried. For example, if we want to list the top 5 drivers in terms of trips count in a given timeframe, only the segments hosting data of the timeframe need to be queried.

To demonstrate this behaviour, we call the broker's REST API and query data from the time range of the first segment (before start time of the second segment). In the returned query statistics we can see, that not all segments of the table (`numSegmentsQueried`) are actually processed, but only 2 of them (`numSegmentsMatched`). This is because the last (the consuming) segment is always queried, as the metadata is not yet completed and so the broker can't tell upfront, if the last segment might contain relevant data.

In [None]:
# get data from first segment (consuming segment is always queried because of uncompleted metadata)
query_for_trips_segmentation_1 = f"""
    SELECT driver_name, sum(count) AS trips_count
    FROM trips_segmentation_1
    WHERE trip_start_time_millis BETWEEN {start_time_of_nth_segment(segment_metadata_1, 0)} AND {int(start_time_of_nth_segment(segment_metadata_1, 1))-1}
    GROUP BY driver_name
    ORDER BY trips_count desc
    LIMIT 5"""

query_result = query_sql(query_for_trips_segmentation_1)
display(query_result_to_dataframe(query_result))
display(extract_query_statistics_from_result_dataframe(query_result))

The second query targets the second table and lists the top 5 drivers according to rating over the time range of the first 3 segments.
Similarly to the query above, only relevant segments need to be processed for this query.
However, in contrast to the first query execution, the broker can make use of the segment replication and can distribute the subqueries for individual segments across different servers (note that `numServersQueried` is now 3 instead of 1).

In [None]:
# get data from first 3 segments (consuming segment is always queried because of uncompleted metadata)
query_for_trips_segmentation_2 = f"""
    SELECT driver_name, avg(driver_rating) AS rating
    FROM trips_segmentation_2
    WHERE trip_start_time_millis BETWEEN {start_time_of_nth_segment(segment_metadata_2, 0)} AND {int(start_time_of_nth_segment(segment_metadata_2, 3))-1}
    GROUP BY driver_name
    ORDER BY rating desc
    LIMIT 5"""

query_result = query_sql(query_for_trips_segmentation_2)
display(query_result_to_dataframe(query_result))
display(extract_query_statistics_from_result_dataframe(query_result))

## Query Routing

In order to efficiently distribute queries across the fleet of servers, brokers maintain so called routing tables, which contain mappings between segments of a table and servers where they are hosted on. 
In case of replicated segments (like in the second table), the routing table contains entries for all servers hosting a single segment. When queries arrive at the broker, the routing tables and segment metadata allow to efficiently scatter queries across servers to balance load across the cluster.

In [None]:
# some helpers for the upcoming examples
def routing_table_for_query(query):
    print("query: " + query)
    return requests.get('http://pinot-broker.pinot:8099/debug/routingTable/sql', params={
        "query" : query
    }).json()

def routing_table_for_table(table):
    return requests.get(f'http://pinot-broker.pinot:8099/debug/routingTable/{table}').json()

def external_view_for_table(table):
    return requests.get(f'http://pinot-controller.pinot:9000/tables/{table}/externalview').json()

def routing_table_for_query_dataframe(query):
    rt = routing_table_for_query(query)
    rt_data = {}

    for server, server_segments in rt.items():
        server_name = server_name_from_instance(server)
        for s in server_segments:
            rt_data[s] = server_name

    rt_data_list = []
    for segment, server in rt_data.items():
        rt_data_list.append({"segment": segment, "server": server})

    rt_data_list.sort(key=lambda L: (int(ordinal_pattern.search(L["segment"]).group(1)), L))
    return pd.DataFrame(rt_data_list)

def routing_table_for_table_dataframe(table):
    rt = routing_table_for_table(table)
    rt_data = {}

    for table_name_type, table_rt in rt.items():
        table_type = re.search('REALTIME|OFFLINE', table_name_type).group()
        for server, server_segments in table_rt.items():
            server_name = server_name_from_instance(server)
            for s in server_segments:
                try:
                    rt_data[s][table_type] = server_name
                except KeyError:
                    rt_data[s] = {table_type: server_name}

    rt_data_list = []
    for segment, type_server in rt_data.items():
        segment_data = {"segment": segment}
        for table_type, server in type_server.items():
            segment_data[table_type] = server
        rt_data_list.append(segment_data)

    rt_data_list.sort(key=lambda L: (int(ordinal_pattern.search(L["segment"]).group(1)), L))
    return pd.DataFrame(rt_data_list)

def external_view_for_table_dataframe(table):
    ev = external_view_for_table(table)
    ev_data = {}

    for table_type, ev_per_type in ev.items():
        if ev_per_type == None:
            continue
        
        for segment, segment_servers in ev_per_type.items():
            if not segment in ev_data:
                ev_data[segment] = {}
            for server, state in segment_servers.items():
                server_name = server_name_from_instance(server)
                try:
                    ev_data[segment][table_type].append(server_name)
                except KeyError:
                    ev_data[segment][table_type] = [server_name]

    return pd.DataFrame(ev_data).transpose()

First, let's take a look at the external view for both tables. The external view shows an overview, which segments are available on which server. In case of the first table, each segment is only available on a single server. The second table has a replica of each segment on every server.

In [None]:
display(external_view_for_table_dataframe("trips_segmentation_1"))
display(external_view_for_table_dataframe("trips_segmentation_2"))

We can use the broker's debug endpoint to retrieve a routing table for a specific SQL query. This can be seen as a query execution plan for segments distributed across multiple servers. Similar to calculating an efficient query execution plan in classical RDBMSs, Pinot takes a look at metadata, statistics and server associations.
The routing table might change everytime an identical query is executed, as brokers try to distribute compute load across servers hosting the same segment.

In [None]:
routing_table_for_query_dataframe(query_for_trips_segmentation_1.replace("trips_segmentation_1", "trips_segmentation_1_REALTIME"))

For the second query, the routing table shows, that the broker will try to equally distribute load between all the servers, as the segments are replicated.

In [None]:
routing_table_for_query_dataframe(query_for_trips_segmentation_2.replace("trips_segmentation_2", "trips_segmentation_2_REALTIME"))

## Advanced Configuration

The presented tables are rather simple and just demonstrate the basic mechanisms of segmentation, replication and query routing in Pinot. However, Pinot offers much more advanced configuration options for tweaking segment replication, availability and placement in large-scale Pinot clusters.

For example, Pinot servers can be grouped in so called "replica groups", that can be spread across different availability zones. Segment replicas will then be assigned to servers in different replica groups in order to achieve high-availability setups. Furthermore, segments can be partitioned based on column values to further increase query performance by decreasing the number of segments that need to be processed for a given query. This is very similar to partitioning/sharding in typical RDBMSs.
Additionally, servers can be assigned to different tenants for sharing a cluster across teams or grouped into server-pools to achieve no-downtime rolling restarts of large clusters.

All of these options show, that segmentation in Pinot is in the simplest aspects quite comparable to sharding mechanism in other database systems, but it is also much more advanced to support large-scale analytical use-cases while maintaining high performance.

# Batch Ingestion and Hybrid Tables

As mentioned earlier, Pinot also support ingesting data from batch processing jobs. For offline tables, the same principles apply as for realtime tables with regards to segmentation and query routing. 
Though, segments are compiled and packaged outside Pinot. For this purpose, Pinot offers different mechanisms to load pre-built segments from object stores (such as S3) or HDFS or to build new segments using Hadoop and/or Spark.
Segments are packaged as gzipped tar-archives (including data, index maps, column statistics) and can be uploaded to and downloaded from the controller.

While offline tables can be used standalone similar to the realtime tables presented above, a more interesting option is to combine an offline and a realtime table to form a hybrid table.
Hybrid tables are comprised of two individual tables, one offline table and one hybrid table, both sharing the same name, schema and – most importantly – time column. The hybrid table can be queried just like any other table, but the broker will transparently rewrite queries to fetch older records from the offline table and newer records from the realtime table.
This allows to process, deduplicate and sanitize records before pushing them to long-term storage. This is a key differentiator between Pinot and other databases and OLAP stores. It allows Pinot to achieve high-throughput ingestion, low-latency realtime analytics, while still allowing to backfill data in batch processing.

Since version `0.6.0` Pinot also offers a mechanism to regularly move records from a realtime table to the corresponding offline table. To configure this, the user can schedule a task, which should be executed on a minion instance for example once every day. The task execution will then take over downloading, transforming, aggregating, sorting and uploading of segments.

To demonstrate how batch ingestion and hybrid tables work in Pinot without setting up an external batch processing system or periodic segment transformation job, we're going to create a realtime table reading from our Kafka `trips` topic, download completed segments from the controller and re-upload them as offline segments.

First, we need to create both tables (note the shared name and schema):

In [None]:
# common configuration used for both tables types
table_config_template = {
  "tableName": "trips_hybrid",
  "segmentsConfig": {
    "timeColumnName": "trip_start_time_millis",
    "timeType": "MILLISECONDS",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "60",
    "schemaName": "trips",
    "replication": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "invertedIndexColumns": [
        "rider_name",
        "driver_name",
        "start_location",
        "end_location"
    ]
  },
  "metadata": {
    "customConfigs": {}
  }
}

In [None]:
# create offline table
table_config = copy.deepcopy(table_config_template)
table_config["tableType"] = "OFFLINE"
print(requests.post('http://pinot-controller.pinot:9000/tables', json=table_config).json())

In [None]:
# create realtime table
table_config["tableType"] = "REALTIME"
table_config["segmentsConfig"]["replicasPerPartition"] = "1"
table_config["tableIndexConfig"]["streamConfigs"] = {
  "streamType": "kafka",
  "stream.kafka.consumer.type": "simple",
  "stream.kafka.topic.name": "trips",
  "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
  "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
  "stream.kafka.zk.broker.url": "pinot-kafka-zookeeper:2181",
  "stream.kafka.broker.list": "pinot-kafka:9092",
  "realtime.segment.flush.threshold.time": "12h",
  "realtime.segment.flush.threshold.size": "50000",
  "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
}
print(requests.post('http://pinot-controller.pinot:9000/tables', json=table_config).json())

Let's again wait for our table to finish loading the data from Kafka.

In [None]:
wait_for_table_to_finish_loading("trips_hybrid")

Let's take a look at the external view of the hybrid table before touching it. We can see some realtime segments, that were built from the data stream from Kafka, but there are no offline segments so far:

In [None]:
external_view_for_table_dataframe("trips_hybrid")

In [None]:
# helpers for transforming realtime segments to offline segment
tmp_hybrid_basedir = "/tmp/trips_hybrid"
# cleanup old artifacts if any
shutil.rmtree(tmp_hybrid_basedir, ignore_errors=True)
os.mkdir(tmp_hybrid_basedir)

def path_for_realtime_tar(segment_name):
    return f"{tmp_hybrid_basedir}/{segment_name}.tar.gz"

def path_for_offline_dir(segment_name):
    return f"{tmp_hybrid_basedir}/{segment_name}_offline"

def path_for_offline_tar(segment_name):
    return f"{tmp_hybrid_basedir}/{segment_name}_offline.tar.gz"

def download_segment(segment_metadata):
    segment_name = segment_metadata["segment.name"]
    download_url = segment_metadata["segment.realtime.download.url"]
    segment_realtime_tar = path_for_realtime_tar(segment_name)

    # cleanup old downloads
    try:
        os.remove(segment_realtime_tar)
    except OSError:
        pass

    # download realtime segment tar
    response = requests.get(download_url, stream=True)
    with open(segment_realtime_tar, 'wb') as out_file:
        shutil.copyfileobj(response.raw, out_file)
    del response
    
    print(f"segment {segment_name} downloaded from {download_url} to {segment_realtime_tar}")
    return segment_realtime_tar

def untar_segment(segment_metadata):
    segment_name = segment_metadata["segment.name"]
    segment_offline_basedir = path_for_offline_dir(segment_name)
    segment_realtime_tar = path_for_realtime_tar(segment_name)

    # cleanup old artifacts if any
    shutil.rmtree(segment_offline_basedir, ignore_errors=True)

    # extract downloaded segment tar
    with tarfile.open(segment_realtime_tar, 'r:gz') as tar:
        tar.extractall(path=segment_offline_basedir)

    print(f"segment {segment_name} untarred to {segment_offline_basedir}")
    return segment_offline_basedir

def transform_segment(segment_metadata):
    realtime_table_name = segment_metadata["segment.table.name"]
    offline_table_name = realtime_table_name.replace("REALTIME", "OFFLINE")
    segment_name = segment_metadata["segment.name"]
    segment_offline_basedir = path_for_offline_dir(segment_name)
    
    # modify metadata.properties of segment
    segment_offline_dir = segment_offline_basedir + "/" + segment_name
    metadata_file = segment_offline_dir + "/v3/metadata.properties"
    metadata_contents = None
    with open(metadata_file, 'r') as file:
      metadata_contents = file.read()
    
    metadata_contents = metadata_contents.replace(realtime_table_name, offline_table_name)
    
    with open(metadata_file, 'w') as file:
      file.write(metadata_contents)
    del metadata_contents

    # create new offline segment tar
    segment_offline_tar = path_for_offline_tar(segment_name)
    with tarfile.open(segment_offline_tar, 'w:gz') as tar:
        tar.add(segment_offline_dir, arcname=segment_name)

    print(f"segment {segment_name} transformed to offline segment to {segment_offline_tar}")
    return segment_offline_tar

def upload_segment_to_offline_table(segment_metadata):
    realtime_table_name = segment_metadata["segment.table.name"]
    segment_name = segment_metadata["segment.name"]
    segment_offline_tar = path_for_offline_tar(segment_name)
    table_name = realtime_table_name.replace("_REALTIME", "_OFFLINE")
    
    # POST segment as multipart/form-data for key 'segment'
    with open(segment_offline_tar, 'rb') as tar:
        response = requests.post(f'http://pinot-controller.pinot:9000/v2/segments?table={table_name}', files={
            'segment': tar
        })
        print(response)
        print(response.json())

def transform_and_upload_nth_segment_to_offline_table(segment_metadata, n):
    nth_meta = segment_metadata_of_nth_segment(segment_metadata, n, table_type="REALTIME")
    
    # download, transform and upload all in one row
    download_segment(nth_meta)
    untar_segment(nth_meta)
    transform_segment(nth_meta)
    upload_segment_to_offline_table(nth_meta)

Now, we fetch the first two segments from the controller, manipulate the metadata and re-upload them to the controller as offline segments:

In [None]:
segment_metadata_hybrid = segment_metadata_for_table("trips_hybrid")

transform_and_upload_nth_segment_to_offline_table(segment_metadata_hybrid, 0)
transform_and_upload_nth_segment_to_offline_table(segment_metadata_hybrid, 1)

The external view for our hybrid table now shows the newly added offline segments:

In [None]:
external_view_for_table_dataframe("trips_hybrid")

This example query lists the top 5 riders in terms of total trip time. It shows that hybrid tables can be queried in the exact same way, as realtime tables:

In [None]:
query_for_hybrid = """
    SELECT rider_name, sum(trip_end_time_millis - trip_start_time_millis) / (60*60*1000) AS trip_time_sum
    FROM trips_hybrid
    GROUP BY rider_name
    ORDER BY trip_time_sum DESC
    LIMIT 5
    """

query_result = query_sql(query_for_hybrid)
query_result_to_dataframe(query_result)