# Architecture
## About Pinot
Pinot is a real-time distributed OLAP datastore. It was built by engineers of LinkedIn and Uber.
LinkedIn is operating Pinot clusters for real-time Online Analytical Processing. They divide their analytics applications into two main categories of analytics applications in their solution landscape: Internal applications and site-facing applications. Internal applications need to process large data volume (trillions of records), but smaller query latencies are tolerated here. On the opposite, site-facing applications are available for hundred of millions of LinkedIn members. These applications have a very high query volume and are expected to have a lower latency budget.
Pinot production clusters at LinkedIn are serving tens of thousands queries per second. Overall, more than 50 analytical use cases are supported, and over millions of records are ingested per second. [3]

Pinot is an in-memory database and provides different configuration settings for consuming and completing segments (part of data tables) to also leverage off-heap memory efficiently.

Data ingestion is append-only, there is no possibilty to modify values after ingestion by doing e.g. an UPSERT operation like known in databses like PostgreSQL. Therefore, Pinot is not an replacement for databases in an operational business environment, but can enhance use cases with the requirement for fast analytics.

## Pinot Cluster

TBD: Show what components are running in the cluster? or part of cluster set up?

In [8]:
!kubectl get pods

/bin/bash: kubectl: command not found


## Architecture


**Apache Helix** is used for cluster management, **Apache Zookeeper** for coordination and maintenance of the overall cluster state and health.
To access the controller for CRUD operations on logical storage resources like tables, a REST interface is provided.

The **Broker** routes queries to the appropriate server instances. In addition, it maintains the query routing tables. These routing tables consist of a mapping between segments and the server the segement resides on. A table in pinot consists of columns and rows and is horizontally divided into shards which are named segments. Either this segments contain realtime data, or data is pushed into offline segments. By default, the query load is balanced across all available servers. The broker will return one consolidated reply to the client, independent if e.g. the table is splitted into realtime and offline segments.


<img src="images/Architecture.png" width="40%" height="40%">
Source: https://docs.pinot.apache.org/basics/architecture (accessed 4 April 2021)

Broker Configurations are defined in the broker.conf file. The properties define e.g. the time out for queries, the query port for the broker or a query limit for queries. Latter has the purpose to protect Broker and Server against queries returning large amount of records with no set limitation. A query limit needs to be enabled at cluster level.

In [3]:
import requests
import json
print("\033[1m" + "Cluster: "+ "\033[0m" + json.dumps((requests.get('http://pinot-controller.pinot:9000/cluster/configs')).json(), indent=2))

[1mCluster: [0m{
  "allowParticipantAutoJoin": "true",
  "enable.case.insensitive": "false",
  "pinot.broker.enable.query.limit.override": "false",
  "default.hyperloglog.log2m": "8"
}


By default, case sensitivity is enabled for Pinot queries. Parameter pinot.broker.enable.query.limit.override is set to false, this means, the broker won't override the query limit when it is larger than defined in the broker config.

In [4]:
import json

print("\033[1m" + "Broker: "+ "\033[0m" + json.dumps((requests.get('http://pinot-controller.pinot:9000/v2/brokers/tenants')).json(), indent=2))
print("\033[1m" + "Health of Controller: "+ "\033[0m" + requests.get('http://pinot-controller.pinot:9000/pinot-controller/admin').text)

[1mBroker: [0m{
  "DefaultTenant": [
    {
      "instanceName": "Broker_pinot-broker-0.pinot-broker-headless.pinot.svc.cluster.local_8099",
      "host": "Broker_pinot-broker-0.pinot-broker-headless.pinot.svc.cluster.local",
      "port": 8099
    }
  ]
}
[1mHealth of Controller: [0mGOOD


As the broker is running on port 8099, requests for queries via the REST API will be sent to port 8099 in the following.
To get e.g. information about the resources of the Pinot cluster, we are accessing the controller, which is running on port 9000.

#### Sources
[1] S. A. Javadi, H. Gupta, R. Manhas, S. Sahu and A. Gandhi, "EASY: Efficient Segment Assignment Strategy for Reducing Tail Latencies in Pinot," 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, 2018, pp. 1432-1437, doi: 10.1109/ICDCS.2018.00144.

[2] Pinot development Team, "Pinot Documentation - Release 0.016", https://thirdeye.readthedocs.io/_/downloads/en/stable/pdf/

[3] https://arxiv.org/pdf/2002.05839.pdf