# Introduction to Pinot

## Use Cases

Pinot is an open-source distributed highly-available OLAP datastore and built to serve analytical queries on real-time event data. It is developed by engineers of LinkedIn and Uber.
LinkedIn is operating Pinot clusters for real-time Online Analytical Processing. They divide their analytics applications into two main categories in their solution landscape: Internal applications and user-facing applications. Internal applications need to process large data volume (trillions of records), but higher query latencies are tolerated. On the opposite, user-facing applications are available for hundreds of millions of LinkedIn members. These applications have a very high query volume and are expected to have a lower latency.
Pinot production clusters at LinkedIn are serving tens of thousands queries per second. Overall, more than 50 analytical use cases are supported, and millions of records are ingested per second. 
[Rogers, Ryan & Subramaniam, Subbu & Peng, Sean & Durfee, David & Lee, Seunghyun & Kancha, Santosh & Sahay, Shraddha & Ahammad, Parvez. (2020). LinkedIn's Audience Engagements API: A Privacy Preserving Data Analytics System at Scale. https://arxiv.org/pdf/2002.05839.pdf]

## Design Principles

Key requirements for Pinot include:

- high performance (low latency) query execution
- near-realtime data ingestion
- linear horizontal scalability (in terms of data size, ingestion rate and query rate)
- query flexibility to cover a wide range of analytical use cases
- high availability of data as well as components (fault tolerance)

All of these requirements influence Pinot's fundamental design principles and distributed architecture. We present, how Pinot manages to achieve these goals in the following sections by describing the core concepts and demonstrating the most important mechanisms in Pinot.

## Architecture

A Pinot cluster is comprised of multiple distributed components. Each Pinot cluster consists of a controller, one or multiple brokers and multiple servers. Pinot supports multi-tenancy out-of-the-box, as multiple brokers and servers can be assigned to serve specific tenants. A table in pinot consists of columns and rows, which are broken horizontally into shards (named segments).

Apache Helix is a generic cluster management framework which is used for automatic management of partitioned and replicated distributed systems by creating and assigning tasks. Apache Zookeeper takes care of coordination and maintenance of the overall cluster state and health. In addition, it stores information about the cluster like server locations of a segment and table schema information. The Controller embeds the Helix agent and is the driver of the cluster. To access CRUD (Create, Read, Update, Delete) Operations on logical storage resources, it provides a REST interface.

If a client wants to query data of Pinot tables, the request will be sent to the broker. It routes queries to the appropriate server instances and keeps track on the query routing tables. These routing tables consist of a mapping between segments and server, where the segments reside on. This ensures the right routing of the query to the correct segment. Segments can either consume real-time data or data can be pushed into offline segments. By default, the query load is balanced across all available servers. The broker will return one consolidated result to the client, independent from the fact whether the table is divided into real-time and offline segments.

Servers are categorized into offline and real-time servers. According to this categorization, servers in Pinot either host offline or real-time data. The responsibility of a server is defined by the table assignment strategy.

If a new real-time table is configured, the real-time server will start consuming data from the streaming source (e.g. Kafka topic). The broker will watch the consumption, detect new segments and maintain them in the query routing list. If a segment has been completed (reached a specific amount of records or was available for a specific timeframe), the controller will upload the segment to the cluster's segment store. The status of the uploaded segment changes from "consuming" to "online" and the controller will start a new consumption on the real-time server.
With batch ingestion, already existing data (e.g. in Hadoop) can be loaded to a Pinot table. 

<img src='https://gblobscdn.gitbook.com/assets%2F-LtH6nl58DdnZnelPdTc%2F-M1pSGleddLn2q1vYEeM%2F-M1pvo4yOL0qNSjSS5nc%2FPinot-architecture%20(1).svg?alt=media&token=b0d011d8-4457-4bea-b29d-55d409eae7df' width="35%" height="35%">
                                                 
Image source: https://docs.pinot.apache.org/basics/architecture (accessed April, 4th 2021)

In addition to components shown in the above architectural diagram, minions can be deployed to the cluster. They leverage Apache Helix and execute tasks which are provided by the Helix Task Executor Framwork. A minion takes over tasks with intensive workloads from other components like indexing or purging data from a Pinot cluster, for example due to GDPR compliance.
The Pinot minion can also be used for Pinot's Offline Flow, which moves records from REALTIME tables to corresponding OFFLINE tables (covered later on).

### API Interface for Broker and Controller

Queries are sent to the broker's REST API (listening on port 8099 by default).
To get information about the resources of the Pinot cluster, we are accessing the controller's REST API, which is listening on port 9000.
Broker Configurations are defined in a specific broker.conf file. The properties define configurations like the query port for the broker or a limit for queries. The latter of which has the purpose to protect brokers and servers against queries returning very large amount of records. A query limit needs to be enabled at cluster level. In our scenario, the parameter `pinot.broker.enable.query.limit.override` is set to false, which means that the broker won't override or add a query limit when the returned record amount is larger than defined in the broker config file.

In [15]:
import requests
import json

print("\033[1m" + "Broker: "+ "\033[0m" + json.dumps((requests.get('http://pinot-controller.pinot:9000/v2/brokers/tenants')).json(), indent=2))
print("\033[1m" + "Health of Controller: "+ "\033[0m" + requests.get('http://pinot-controller.pinot:9000/pinot-controller/admin').text)
print("\033[1m" + "Cluster: "+ "\033[0m" + json.dumps((requests.get('http://pinot-controller.pinot:9000/cluster/configs')).json(), indent=2))

[1mBroker: [0m{
  "DefaultTenant": [
    {
      "instanceName": "Broker_pinot-broker-0.pinot-broker-headless.pinot.svc.cluster.local_8099",
      "host": "Broker_pinot-broker-0.pinot-broker-headless.pinot.svc.cluster.local",
      "port": 8099
    }
  ]
}
[1mHealth of Controller: [0mGOOD
[1mCluster: [0m{
  "allowParticipantAutoJoin": "true",
  "enable.case.insensitive": "false",
  "pinot.broker.enable.query.limit.override": "false",
  "default.hyperloglog.log2m": "8"
}


### Key differences to well-known database technologies

In Pinot, data ingestion is append-only. There is no possibility to modify values after ingestion by doing operations like `UPDATE` known from databases like PostgreSQL. Pinot is no replacement for databases in an operational business environment, which usually require updates to data because of the event's nature or due to data correction. For this use cases, Pinot does not fit. Instead, it can enhance use cases requiring fast analytics. However, data can still be purged after ingestion for fullfilling compliance requirements (e.g. GDPR). For this, the Minion can be used to replace entire segments, but in no case, single records can be manipulated.

Another difference of Apache Pinot compared to databases like PostgreSQL is that it doesn't support queries requiring movements of large amounts of data between the nodes, like joins. The query engine Presto can be used to join different tables in Pinot, but Presto needs to be set up additionally and is not part of Pinot.

Tables in Pinot typically have one primary time column, which is used to manage the time boundary between offline and realtime data in a hybrid table. This may sound familiar to the known concept of time series databases like Influxdb. Both databases are built to handle events with a timestamp, but the timestamp in Pinot is only strictly required for hybrid tables. In addition, Pinot is not only focused on storing timeseries of metrics, it also offers to storm string and bytes values in addition to numeric data types and date time fields. Although Influxdb also support strings to a specific extend, Pinot also offers e.g. text indexing for enhanced full text search.
Compared to the timeseries databases like Influxdb, Pinot is optimized for storing time data with a focus on append operations and queries. Update and delete operations on single records are not supported in Apache Pinot, though stream ingestion supports upserts, if a primary key has been defined in the schema.

Another key difference of Pinot in comparison to other distributed databases is the heterogeneous nature of its components. Some traditional RDBMSs like for example PostgreSQL can be scaled horizontally to form a cluster by adding more instances, that will each store and manage different partitions (shards) of the dataset. In this case, such a distributed setup is comprised of only a single stateful component, which is started on multiple machines (homogeneous distributed system).
In contrast to this, a Pinot cluster is comprised multiple heterogeneous components (described above), which each serve a specific purpose and are only responsible for a given subtask of the entire system. For example, servers are the stateful components of Pinot, that store and query the actual dataset, while brokers are stateless components, that don't host data themselves and only serve the query frontend for the database. With this, Pinot can be seen as a heterogeneous distributed system, which makes it more complex to deploy and operate, but also serves the key requirements described above (mainly horizontal scalability and fault tolerance).

## Schemas and Tables

### Schemas

To create a table in Pinot, a schema is required. A schema configuration defines fields and data types, this metadata is stored in the Zookeeper.
In our examples, we work with data of a fictional online plattform which connects car drivers and passengers to travel together in Germany (ride sharing). 

Columns in Pinot are of different categories: 
- dimension columns: support operations like `GROUP BY` and `WHERE` ("slice and dice"), e.g. name of the car driver, trip start and end location
- metric columns: represent quantitative data and can be used e.g. for aggregation clauses (e.g. payment amount, rating of the driver)
- DateTime columns: represent timestamps of records. One DataTime column can be treated as the primary time column, which is defined in the segment config of a table. The primary time column is used for determining boundaries of segments and between offline and realtime data in hybrid tables. A typical operation on DateTime columns is for example `WHERE`, e.g. time when the a ride was requested by the rider

Let's define the example `trips` schema:

In [17]:
import requests

schemaConfiguration = {
  "schemaName": "trips",
  "dimensionFieldSpecs": [
    {
      "name": "rider_name",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "driver_name",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "license_plate",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "start_location",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "start_zip_code",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
     {
      "name": "start_location_state",
      "dataType": "STRING",
      "defaultNullValue": ""
    }, 
    {
      "name": "end_location",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "end_zip_code",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
      {
      "name": "end_location_state",
      "dataType": "STRING",
      "defaultNullValue": ""
    }, 
    {
      "name": "rider_is_premium",
      "dataType": "INT",
      "defaultNullValue": 0
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "count",
      "dataType": "LONG",
      "defaultNullValue": 1
    },
    {
      "name": "payment_amount",
      "dataType": "FLOAT",
      "defaultNullValue": 0
    },
    {
      "name": "payment_tip_amount",
      "dataType": "FLOAT",
      "defaultNullValue": 0
    },
    {
      "name": "trip_wait_time_millis",
      "dataType": "LONG",
      "defaultNullValue": 0
    },
    {
      "name": "rider_rating",
      "dataType": "INT",
      "defaultNullValue": 0
    },
    {
      "name": "driver_rating",
      "dataType": "INT",
      "defaultNullValue": 0
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "trip_start_time_millis",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MINUTES",
      "dateTimeType": "PRIMARY"
    },
    {
      "name": "request_time_millis",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MINUTES",
      "dateTimeType": "SECONDARY"
    },
    {
      "name": "trip_end_time_millis",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MINUTES",
      "dateTimeType": "SECONDARY"
    }
  ]
}

# create the trips schema
response = requests.post('http://pinot-controller.pinot:9000/schemas?override=true', json=schemaConfiguration)
print("Create Schema: " + response.text)

# list all Schemas
response = (requests.get('http://pinot-controller.pinot:9000/schemas')).json()
print("Get all schemas: " + str(response))

Create Schema: {"status":"trips successfully added"}
Get all schemas: ['trips']


### Data Generation

Our Pinot tables will consume data from a Kafka Topic in realtime. To be able to consume messages from this topic, data needs to be produced and sent to the topic before.

To create and fill our Kafka topic, we first need to create a Kafka producer client.

In [20]:
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers=['pinot-kafka.pinot:9092'], value_serializer=lambda v: json.dumps(v).encode('utf-8'))

The below functions are used to generate random data records for car rides in Germany and inserts them to the Kafka Topic. Each ride consists of driver and passenger details, such as name and rating, measures like payments, details about origin and destination of the trip and different time measures, for example the time stamp when the trip was requested. Date and time of the trip is generated based on the current timestamp (and advancing by roughly 1 second per record).

In [21]:
import csv
import random
import names
import time
import json
import os

# Choose random city of file containing German cities with postcode
if not os.path.exists("./pgeocodeDE.txt"):
    # download segment to local file
    response = requests.get("https://symerio.github.io/postal-codes-data/data/geonames/DE.txt")
    with open("./pgeocodeDE.txt", 'w',encoding='utf8') as out_file:
        out_file.write(response.text)
    del response

geocode_file = open('./pgeocodeDE.txt')
geocode_list = list(csv.reader(geocode_file, delimiter='\t'))[1:] # skip first line (header)
random.shuffle(geocode_list)
geocode_list = geocode_list[:1000] # take only random 1000 places to generate more overlapping data
geocode_file.close()

def choose_random_city():
    return random.choice(geocode_list)

# generate only 1000 driver/rider names to generate more overlapping data
names_list = []
for i in range(1000):
    names_list.append(names.get_full_name())

def choose_random_name():
    return random.choice(names_list)
    
# Generation of License Plate
# create a pool of letters to choose from
letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
numbers = '0123456789'

def generate_license_plate():
    # generate 3 randomly chosen letters, L1, L2, L3
    L1 = random.choice(letters)
    L2 = random.choice(letters)
    L3 = random.choice(letters)
    L4 = random.choice(letters)
    # generate 4 randomly chosen numbers, N1, N2, N3, N4
    N1 = random.choice(numbers)
    N2 = random.choice(numbers)
  
    # combine it together into one print function
    return(L1+L2+'-'+L3+L4+'-'+N1+N2)

# Calculation of price based on distance between start city and end destination
def calculate_price(v_distance):
    v_multiplicator=round(random.uniform(0.8, 2.0),2)
    v_price=round(v_distance*v_multiplicator,2)
    return(v_price)

Let's generate our sample dataset, containing about 300.000 records in total, in order to demonstrate the different Pinot concepts and mechanisms later on:

In [None]:
# begin generating trips data at current time
start_timestamp_ms = time.time_ns() // 1000000

# Generate data
num_records = 300000 + random.randint(5000,10000)
for i in range(num_records):
    v_start_location = choose_random_city()
    v_end_location = choose_random_city()
    v_distance = random.randint(5,1000)

    # add random jitter, in large system our event stream is probably also not strictly sorted
    v_requesttime = start_timestamp_ms + i*1000 + random.randint(0,100);

    v_waiting_time_millis = random.randint(1,3600000)
    v_trip_time = round((v_distance/random.randint(45,60)) * 60 *60*1000)

    record = {
        "rider_name": choose_random_name(),
        "driver_name": choose_random_name(),
        "license_plate": generate_license_plate(),
        "start_location": v_start_location[2],
        "start_zip_code": v_start_location[1],
        "start_location_state": v_start_location[3],
        "end_location": v_end_location[2],
        "end_zip_code": v_end_location[1],
        "end_location_state": v_end_location[3],
        "rider_is_premium": random.randint(0, 1),
        "count": 1,
        "payment_amount": calculate_price(v_distance),
        "payment_tip_amount": random.randint(5,50),
        "trip_wait_time_millis": v_waiting_time_millis,
        "rider_rating": random.randint(0,5),
        "driver_rating": random.randint(0,5),
        "trip_start_time_millis": v_requesttime + v_waiting_time_millis,
        "request_time_millis": v_requesttime,
        "trip_end_time_millis": v_requesttime + v_waiting_time_millis + v_trip_time
    }
 
    producer.send('trips', value=record)
        
    if i % 5000 == 0:
        print(f'{i} records generated')

print(f'done generating {num_records} records, ready to do some fancy analytics!')

### Tables

Tables represent a collection of related data in Pinot. A table either have the type `OFFLINE` (ingesting pre-built pinot-segments from external stores) or `REALTIME` (data ingestion from streams). The user is not required to know the type of a table when querying it.

To configure a table, properties like name, type and indexing are required. In the following example, we create an example table which is consuming data from the Kafka topic filled above:

In [23]:
json_tableConfig = {
  "tableName": "trips",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "trip_start_time_millis",
    "timeType": "MILLISECONDS",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "60",
    "schemaName": "trips",
    "replication": "1",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "simple",
      "stream.kafka.topic.name": "trips",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.zk.broker.url": "pinot-kafka-zookeeper:2181",
      "stream.kafka.broker.list": "pinot-kafka:9092",
      "realtime.segment.flush.threshold.time": "12h",
      "realtime.segment.flush.threshold.size": "20000",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
    },
      "noDictionaryColumns": ["count", "payment_amount", "payment_tip_amount","trip_wait_time_millis", "rider_rating", "driver_rating"],
      "aggregateMetrics": True,
  },
  "metadata": {
    "customConfigs": {}
  }
} 

response = requests.post('http://pinot-controller.pinot:9000/tables', json=json_tableConfig)
print(response)
print(response.json())

<Response [200]>
{'status': 'Table trips_REALTIME succesfully added'}


After creation, data records of the Kafka Topic are loaded into the table. To execute a query, the SQL statement is sent to the broker of the Pinot cluster. The response contains the result records, as well as query statistics of the execution.

While our data is loading, let's query the example table to figure out, how many trips have already been completed with passengers, that are premium members of our ride sharing platform:

XXX use helper funcs, show query statistics  
XXX execute multiple times?

In [24]:
print(requests.post('http://pinot-broker.pinot:8099/query/sql', json={
    "sql" : "SELECT SUM(count) as trips_count FROM trips WHERE rider_is_premium = 1"
}).json())

{'resultTable': {'dataSchema': {'columnNames': ['trips_count'], 'columnDataTypes': ['DOUBLE']}, 'rows': [[157457.0]]}, 'exceptions': [], 'numServersQueried': 1, 'numServersResponded': 1, 'numSegmentsQueried': 16, 'numSegmentsProcessed': 16, 'numSegmentsMatched': 16, 'numConsumingSegmentsQueried': 1, 'numDocsScanned': 157457, 'numEntriesScannedInFilter': 315170, 'numEntriesScannedPostFilter': 157457, 'numGroupsLimitReached': False, 'totalDocs': 315170, 'timeUsedMs': 17, 'segmentStatistics': [], 'traceInfo': {}, 'minConsumingFreshnessTimeMs': 1618137855289}


### XXX Pre-aggregation?

One feature of Pinot, contributing to performance improvements, is the pre-aggregation for rows having the same dimension values. With that, realtime stream data is aggregated when its consume to reduce segment sizes. 