In [66]:
#Download Architecture Screenshot
if not os.path.exists("./pinotarchitecture.svg"):
    # download segment to local file
    response = requests.get("https://gblobscdn.gitbook.com/assets%2F-LtH6nl58DdnZnelPdTc%2F-M1pSGleddLn2q1vYEeM%2F-M1pvo4yOL0qNSjSS5nc%2FPinot-architecture%20(1).svg?alt=media&token=b0d011d8-4457-4bea-b29d-55d409eae7df")
    with open("./pinotarchitecture.svg", 'w',encoding='utf8') as out_file:
        out_file.write(response.text)
    del response

# Introduction to Pinot
## Use Cases
Pinot is a distributed high-available OLAP datastore and built to serve analytical queries on real-time event data. It was developed by engineers of LinkedIn and Uber.
LinkedIn is operating Pinot clusters for real-time Online Analytical Processing. They divide their analytics applications into two main categories in their solution landscape: Internal applications and site-facing applications. Internal applications need to process large data volume (trillions of records), but for them smaller query latencies are tolerated. On the opposite, site-facing applications are available for hundreds of millions of LinkedIn members. These applications have a very high query volume and are expected to have a lower latency.
Pinot production clusters at LinkedIn are serving tens of thousands queries per second. Overall, more than 50 analytical use cases are supported, and over millions of records are ingested per second. 
[Rogers, Ryan & Subramaniam, Subbu & Peng, Sean & Durfee, David & Lee, Seunghyun & Kancha, Santosh & Sahay, Shraddha & Ahammad, Parvez. (2020). LinkedIn's Audience Engagements API: A Privacy Preserving Data Analytics System at Scale. https://arxiv.org/pdf/2002.05839.pdf]




## Architecture

Multiple distributed system components build a Pinot cluster. Each Pinot cluster consists of a controller, one or multiple brokers and multiple servers. Pinot supports multi-tenancy out-of-the-box, as multiple brokers and servers can be combined. A table in pinot consists of columns and rows. The horizontally division into shards is named segments. 

Apache Helix is a generic cluster management framework which is used for automatic management of partitioned and replicated distributed systems by creating and assigning tasks. Apache Zookeeper takes care of coordination and maintenance of the overall cluster state and health. In addition, it stores information about the cluster like server locations of a segment and table schema information. The Controller embeds the Helix agent and is the driver of the cluster. To access CRUD (Create, Read, Update, Delete) Operations on logical storage resources, it provides a REST interface.

If a client wants to query data of Pinot tables, the request will be sent to the broker. It routes queries to the appropriate server instances and keeps track on the query routing tables. These routing tables consist of a mapping between segments and the server the segment resides on. This ensures the right routing of the query to the correct segment. Either these segments contain real-time data, or data is pushed into offline segments. By default, the query load is balanced across all available servers. The broker will return one consolidated reply to the client, independent from the fact if the table is divided into real-time and offline segments.

Servers are categorized into offline and real-time servers. According to this categorization, servers in Pinot either host offline or realtime data. The responsibility of a server is defined by the table assignment strategy.

If a new real-time table is configured, the real-time server will start consuming data from the streaming source. This can be for example a Kafka topic. The broker will watch the consumption, detect new segments and maintain them in the query routing list. If a segment has been completed (reached a specific amount of records or was available for a specific timeframe), the controller will upload the segment to the cluster's segment store. The status of the uploaded segment changes from "consuming" to "online" and the controller will start a new consumption on the realtime server.
With batch ingestion, already existing data (e.g. in Hadoop) can be loaded to a Pinot table. 


<img src="pinotarchitecture.svg" width="35%" height="35%">
                                                 
Source of the screenshot: https://docs.pinot.apache.org/basics/architecture (accessed 4 April 2021)

In addition to the components on the screenshot showing the Pinot cluster architecture, minions can be deployed to the cluster. They leverage Apache Helix and execute tasks which are provided by the Helix Task Executor Framwork. A minion takes over tasks with intensive workloads from other components like indexing or purging data from a Pinot cluster, for example due to GDPR compliance.
The Pinot minion is also required for the Offline Flow in Pinot. This flow moves records from  REALTIME tables to according OFFLINE tables.


### API Interface for Broker and Controller
Requests for queries via the REST API are sent to port 8099, as the broker is running on this port in our use case.
To get information about the resources of the Pinot cluster, we are accessing the controller, which is running on port 9000.
Broker Configurations are defined in a specific broker.conf file. The properties define configurations like the query port for the broker or a limit for queries. Latter has the purpose to protect brokers and servers against queries returning very large amount of records. A query limit needs to be enabled at cluster level. In our scenario, parameter pinot.broker.enable.query.limit.override is set to false, this means, the broker won't override or add a query limit when the returned record amount is larger than defined in the broker config file.

In [22]:
import requests
import json

print("\033[1m" + "Broker: "+ "\033[0m" + json.dumps((requests.get('http://pinot-controller.pinot:9000/v2/brokers/tenants')).json(), indent=2))
print("\033[1m" + "Health of Controller: "+ "\033[0m" + requests.get('http://pinot-controller.pinot:9000/pinot-controller/admin').text)
print("\033[1m" + "Cluster: "+ "\033[0m" + json.dumps((requests.get('http://pinot-controller.pinot:9000/cluster/configs')).json(), indent=2))

[1mBroker: [0m{
  "DefaultTenant": [
    {
      "instanceName": "Broker_pinot-broker-0.pinot-broker-headless.pinot.svc.cluster.local_8099",
      "host": "Broker_pinot-broker-0.pinot-broker-headless.pinot.svc.cluster.local",
      "port": 8099
    }
  ]
}
[1mHealth of Controller: [0mGOOD
[1mCluster: [0m{
  "allowParticipantAutoJoin": "true",
  "enable.case.insensitive": "false",
  "pinot.broker.enable.query.limit.override": "false",
  "default.hyperloglog.log2m": "8"
}


### Comparison with known database technologies

In Pinot, data ingestion is append-only. There is no possibility to modify values after ingestion by doing operations like UPSERT known from databases like PostgreSQL. Pinot is no replacement for databases in an operational business environment, which usually require updates to data because of the event's nature or due to data correction. For this use cases, Pinot does not fit. Instead, it can enhance use cases requiring fast analytics. To enable data purging, the Minion can be used in Pinot.
Another limitation of Apache Pinot compared to databases like PostgreSQL is that it doesn't support queries requiring movements of large amounts of data between the nodes, like joins. The query engine Presto can be used to join different tables in Pinot, but Presto needs to be set up on its own and is not part of Pinot.

Tables in Pinot can have one primary time column, which is used to manage the time boundary between offline and realtime data in a hybrid table. This may sound familiar to the known concept of time series databases like Influxdb. Both databases are built to handle events with a time stamp -  the time stamp is not a must for Pinot but realizes the hybrid tables. In addition, Pinot is not only focused on storing metrics, despite numeric data types and date time fields there is the option to have columns of type string and bytes. Although Influxdb also support Strings to a specific extend, Pinot offers possibilities for e.g. text search.

Compared to the time-series database Influxdb, Pinot is optimized for storing time data with a focus on write operations and queries. Updates and deletion operations are not supported in Apache Pinot, despite upserting data via stream ingestion - if a primary key has been defined in the schema.

Pinot can be categorized as a homogeneous distributed database, as all sites have identical software and access to the same defined schemas. In addition, all components are aware of each other and cooperate. 

## Introduction to Schemas and Tables in Pinot
### Schemas
To create a table in Pinot, a schema is required. A schema configuration defines fields and data types, this metadata is stored in the Zookeeper.
In our example, our data is based on a fictional online plattform which connects car drivers and passengers to travel together in Germany. 

Columns in Pinot consist of different categories: 
- Dimension columns support operations like GROUP BY and WHERE. (e.g. name of the car driver, license plate)
- Metric columns represent quantitative data and can be used e.g. for aggregation and filter clauses. (e.g. payment amount, rating of the driver)
- DateTime columns represent time columns. One DataTime column can be treated as the primary time column, which is defined in the segment config part of a table. The primary time column is used for offline streams between offline and realtime data in a realtime table. Operations supported are e.g. GROUP BY and WHERE. (e.g. time when the car driver was requested by the rider)

In [4]:
import requests

schemaConfiguration = {
  "schemaName": "trips",
  "dimensionFieldSpecs": [
    {
      "name": "rider_name",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "driver_name",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "license_plate",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "start_location",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "start_zip_code",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
     {
      "name": "start_location_state",
      "dataType": "STRING",
      "defaultNullValue": ""
    }, 
    {
      "name": "end_location",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "end_zip_code",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
      {
      "name": "end_location_state",
      "dataType": "STRING",
      "defaultNullValue": ""
    }, 
    {
      "name": "rider_is_premium",
      "dataType": "INT",
      "defaultNullValue": 0
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "count",
      "dataType": "LONG",
      "defaultNullValue": 1
    },
    {
      "name": "payment_amount",
      "dataType": "FLOAT",
      "defaultNullValue": 0
    },
    {
      "name": "payment_tip_amount",
      "dataType": "FLOAT",
      "defaultNullValue": 0
    },
    {
      "name": "trip_wait_time_millis",
      "dataType": "LONG",
      "defaultNullValue": 0
    },
    {
      "name": "rider_rating",
      "dataType": "INT",
      "defaultNullValue": 0
    },
    {
      "name": "driver_rating",
      "dataType": "INT",
      "defaultNullValue": 0
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "trip_start_time_millis",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MINUTES",
      "dateTimeType": "PRIMARY"
    },
    {
      "name": "request_time_millis",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MINUTES",
      "dateTimeType": "SECONDARY"
    },
    {
      "name": "trip_end_time_millis",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MINUTES",
      "dateTimeType": "SECONDARY"
    }
  ]
}

# Create Schema
response = requests.post('http://pinot-controller.pinot:9000/schemas?override=false', json=schemaConfiguration)
print("Create Schema: " + response.text)
# Display all Schemas
response = (requests.get('http://pinot-controller.pinot:9000/schemas')).json()
print("Get all schemas: " + str(response))

Create Schema: {"status":"trips successfully added"}
Get all schemas: ['trips']


### Data Generation
Our Pinot tables will consume data from a Kafka Topic in realtime. To be able to consume messages of this topic, data needs to be produced and sent to the topic before.

To create a Kafka Topic, we first need to create a Kafka Client.

In [10]:
from kafka.admin import KafkaAdminClient, NewTopic

admin_client = KafkaAdminClient(
    bootstrap_servers="pinot-kafka.pinot:9092", 
    client_id='test')

The below function generates data records for car rides in Germany and inserts them to the Kafka Topic. Each ride consists of driver and passenger details, such as name and rating, measures like payments, details about origin and destination of the trip and different time measures, for example the time stamp when the trip was requested. Date and time of the trip is generated based on the current timestamp, adding up some time randomly.

In [44]:
from kafka import KafkaProducer
import csv
import random
import names
import time
import json

producer = KafkaProducer(bootstrap_servers=['pinot-kafka.pinot:9092'], value_serializer=lambda v: json.dumps(v).encode('utf-8'))

# Choose random city of file containing German cities with postcode
if not os.path.exists("./pgeocodeDE.txt"):
    # download segment to local file
    response = requests.get("https://symerio.github.io/postal-codes-data/data/geonames/DE.txt")
    with open("./pgeocodeDE.txt", 'w',encoding='utf8') as out_file:
        out_file.write(response.text)
    del response

geocode_file = open('./pgeocodeDE.txt')
geocode_list = list(csv.reader(geocode_file, delimiter='\t'))[1:] # skip first line (header)
random.shuffle(geocode_list)
geocode_list = geocode_list[:1000] # take only random 1000 places to generate more overlapping data
geocode_file.close()

def choose_random_city():
    return random.choice(geocode_list)

# generate only 1000 driver/rider names to generate more overlapping data
names_list = []
for i in range(1000):
    names_list.append(names.get_full_name())

def choose_random_name():
    return random.choice(names_list)
    
# Generation of License Plate
# create a pool of letters to choose from
letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
numbers = '0123456789'

def generate_license_plate():
    # generate 3 randomly chosen letters, L1, L2, L3
    L1 = random.choice(letters)
    L2 = random.choice(letters)
    L3 = random.choice(letters)
    L4 = random.choice(letters)
    # generate 4 randomly chosen numbers, N1, N2, N3, N4
    N1 = random.choice(numbers)
    N2 = random.choice(numbers)
  
    # combine it together into one print function
    return(L1+L2+'-'+L3+L4+'-'+N1+N2)

# Calculation of price based on distance between start city and end destination
def calculate_price(v_distance):
    v_multiplicator=round(random.uniform(0.8, 2.0),2)
    v_price=round(v_distance*v_multiplicator,2)
    return(v_price)

# begin generating trips data at current time
start_timestamp_ms = time.time_ns() // 1000000

# Generate data
num_records = 100000 + random.randint(5000,10000)
for i in range(num_records):
    v_start_location=choose_random_city()
    v_end_location=choose_random_city()
    v_distance = random.randint(5,1000)

    # add random jitter, in large system our event stream is probably also not strictly sorted
    v_requesttime = start_timestamp_ms + i*1000 + random.randint(0,100);

    v_waiting_time_millis = random.randint(1,3600000)
    v_trip_time = round((v_distance/random.randint(45,60)) * 60 *60*1000)

    record = {
        "rider_name": choose_random_name(),
        "driver_name": choose_random_name(),
        "license_plate":generate_license_plate(),
        "start_location": v_start_location[2],
        "start_zip_code": v_start_location[1],
        "start_location_state": v_start_location[3],
        "end_location": v_end_location[2],
        "end_zip_code": v_end_location[1],
        "end_location_state": v_end_location[3],
        "rider_is_premium": random.randint(0, 1),
        "count": 1,
        "payment_amount": calculate_price(v_distance),
        "payment_tip_amount": random.randint(5,50),
        "trip_wait_time_millis": v_waiting_time_millis,
        "rider_rating": random.randint(0,5),
        "driver_rating": random.randint(0,5),
        "trip_start_time_millis": v_requesttime+v_waiting_time_millis,
        "request_time_millis": v_requesttime,
        "trip_end_time_millis": v_requesttime+v_waiting_time_millis+v_trip_time
    }
    producer.send('trips', value=record)
    
    if i % 5000 == 0:
        print(f'{i} records generated')
print(f'{num_records} records generated')

0 records generated
1 records generated


## Tables
Tables represent a collection of related data in Pinot. A table can have the type OFFLINE (ingesting pre-built pinot-segments from external stores), REALTIME (data ingestion from streams) or HYBRID (table consists of OFFLINE and REALTIME tables). The use is not require to know the type of a table when executing a table. 
To configure a table, properties like name, type and indexing are required. In the following example, the table is consuming data from the Kafka Topic *trips*. 
One feature of Pinot, contributing to performance improvements, is the pre-aggregation. With that, realtime stream data is aggregated when its consume to reduce segment sizes. 

In [46]:
json_tableConfig = {
  "tableName": "trips_aggregate",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "trip_start_time_millis",
    "timeType": "MILLISECONDS",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "60",
    "schemaName": "trips",
    "replication": "1",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "simple",
      "stream.kafka.topic.name": "trips",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.zk.broker.url": "pinot-kafka-zookeeper:2181",
      "stream.kafka.broker.list": "pinot-kafka:9092",
      "realtime.segment.flush.threshold.time": "12h",
      "realtime.segment.flush.threshold.size": "20000",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
    },
      "noDictionaryColumns": ["count", "payment_amount", "payment_tip_amount","trip_wait_time_millis", "rider_rating", "driver_rating"],
      "aggregateMetrics": True,
  },
  "metadata": {
    "customConfigs": {}
  }
} 

response = requests.post('http://pinot-controller.pinot:9000/tables', json=json_tableConfig)
print(response)
print(response.text)

<Response [200]>
{"status":"Table trips_aggregate_REALTIME succesfully added"}


After creation, data records of the Kafka Topic are loaded by the table. To execute a query, the statement string is sent to the broker of the Pinot cluster. The response contains the result records, as well as details about the execution.

In [64]:
print(requests.post('http://pinot-broker.pinot:8099/query/sql', json={
            "sql" : "SELECT COUNT(*) FROM trips_aggregate"
    }).json())

{'resultTable': {'dataSchema': {'columnNames': ['count(*)'], 'columnDataTypes': ['LONG']}, 'rows': [[308641]]}, 'exceptions': [], 'numServersQueried': 1, 'numServersResponded': 1, 'numSegmentsQueried': 16, 'numSegmentsProcessed': 16, 'numSegmentsMatched': 16, 'numConsumingSegmentsQueried': 1, 'numDocsScanned': 308641, 'numEntriesScannedInFilter': 0, 'numEntriesScannedPostFilter': 0, 'numGroupsLimitReached': False, 'totalDocs': 308641, 'timeUsedMs': 6, 'segmentStatistics': [], 'traceInfo': {}, 'minConsumingFreshnessTimeMs': 1618089542751}
