Add basic documentation.

uber-archive · Oct 8, 2017 · 9d7a521 · 9d7a521
1 parent 2d6360f
commit 9d7a521
Show file tree

Hide file tree

Showing 39 changed files with 2,593 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,13 @@
+[![ReadTheDocs][doc-img]][doc]
+
+# AthenaX: SQL-based streaming analytic platform at scale
+
+AthenaX is a streaming analytic platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL). Originated from [Uber Technologies][ubeross], AthenaX scales across hundreds of machines and processes hundreds of billions of real-time events every single day.
+
+See also:
+
+  * AthenaX [documentation][doc] for getting started, operational details, and other information.
+  * Blog post [Introducing AthenaX, Uber Engineering’s Open Source Streaming Analytics Platform](https://eng.uber.com/athenax/).
+
+[doc-img]: https://readthedocs.org/projects/athenax/badge/?version=latest
+[doc]: http://athenax.readthedocs.org/en/latest/
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,20 @@
+[![ReadTheDocs][doc-img]][doc]
+
+# AthenaX Documentation
+
+This directory contains AthenaX documentation hosted by [readthedocs][doc].
+
+## Building
+
+The documentation is built with [MkDocs](http://www.mkdocs.org/).
+You need to have [virtualenv](https://virtualenv.pypa.io/en/stable/) installed.
+
+Then simply run `make docs`.
+
+## Deploying
+
+Raise a PR on GitHub. Once the PR is approved and merged,
+ask one of Athenax maintainers to kick off the build on [readthedocs](https://readthedocs.org/projects/athenax/).
+
+[doc-img]: https://readthedocs.org/projects/athenax/badge/?version=latest
+[doc]: http://athenax.readthedocs.org/en/latest/
diff --git a/docs/design.md b/docs/design.md
@@ -0,0 +1,39 @@
+# Design
+
+## Overview
+
+![Architecture](images/architecture.svg)
+***
+
+The above figure describes the overall architecture of AthenaX. AthenaX consists of two major components. The first one is AthenaX master, which manages the full lifecycles of the jobs. The second one is the catalogs and connectors, which describe how AthenaX jobs should interact with the external world.
+
+To run an AthenaX job, first the user submits an AthenaX job through the [REST APIs](https://github.com/uber/AthenaX/blob/master/athenax-backend/src/main/resources/athenax-backend-api.yaml). Then the AthenaX master compiles the job to a native Flink application and then deploys it to the desired cluster. After the deployment, the AthenaX master monitors the status of the job continuously and recovers it in case of failures.
+
+## Design considerations
+
+We are going to use the following terminologies in the subsections:
+
+* *Job*. A job describes a streaming analytic application. It consists of the SQL, the clusters that the job should be run upon, and the amount of resources (i.e., memory and CPU cores) that are required by the job.
+* *Catalog*. A catalog describes how to translate a table in the SQL to a data source or data sink.
+* *Cluster*. A cluster is a YARN cluster that is capable of running AthenaX job.
+* *Instance*. An instance is a Flink application running on top of a specific cluster that is realized from a specific AthenaX job. AthenaX compiles a job down to a Flink job and realizes it on a specific cluster.
+
+### Plugging in catalogs and connectors
+
+We found that it is essential to allow users to plug in their own catalogs and connectors. Recall that catalogs specify how to map a table in SQL to data sources or data sinks. Connectors specify how AthenaX should interact with external systems (e.g., publishing to a Kafka topic or making RPCs calls)
+
+User-defined catalogs must implement [AthenaXTableCatalogProvider]( https://github.com/uber/AthenaX/blob/master/athenax-vm-api/src/main/java/com/uber/athenax/vm/api/AthenaXTableCatalogProvider.java). The connectors must implement [DataSinkProvider](https://github.com/uber/AthenaX/blob/master/athenax-vm-api/src/main/java/com/uber/athenax/vm/api/DataSinkProvider.java) if it provides a data sink, or [TableSourceConverter](https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/table/catalog/TableSourceConverter.scala) if it provides data sources. Please see the [Kafka-JSON connector](https://github.com/uber/AthenaX/tree/master/athenax-vm-connectors/athenax-vm-connector-kafka) for more details.
+
+### Compiling AthenaX jobs to Flink applications
+
+To execute the AthenaX jobs efficiently, Athenax compiles them to native Flink applications using the Flink's [Table and SQL APIs](https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/table/index.html). On a very high level, AthenaX combines the catalogs and the parameters (e.g., parallelism) with the SQL specified the job and compiles it to a Flink application, which is represented as a [JobGraph](https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/job_scheduling.html) in Flink.
+
+AthenaX supports SQL with user-defined functions. Users can specify additional JARs to be loaded along with the SQL. To compile them safely, AthenaX compiles the SQL in a dedicated [process](https://github.com/uber/AthenaX/blob/master/athenax-vm-compiler/src/main/java/com/uber/athenax/vm/compiler/executor/ContainedExecutor.java). However, the functionality, particularly localizing the UDF jar, has not been fully implemented in the current version.
+
+### Managing job instances
+
+AthenaX abstracts the running status of the jobs using two state. The first one is desired state, which describes the clusters and the resources that the job should start with. The second is actual state, which contains the same information but for all realized instances for the particular job.
+
+The [watchdog](https://github.com/uber/AthenaX/blob/master/athenax-backend/src/main/java/com/uber/athenax/backend/server/jobs/WatchdogPolicy.java) periodically computes actual states and compares them with the desired states. The AthenaX master can start or kill the corresponding YARN applications based on the differences. In this case starting a new job is just a concrete case of recovering a failed job. Auto scaling is also handled similarly.
+
+One thing worth noting is that all the actual states are "soft states", that is, they can be recovered through scanning the YARN clusters. The design allows the AthenaX master to be run in multiple availability zones provided that (1) the underlying persistent data store is available in the deployed available zones, and (2) the watchdog is aware of active master in multiple availability zones.
diff --git a/docs/examples.md b/docs/examples.md
@@ -0,0 +1,45 @@
+# Examples
+
+This section briefly describes a few examples on building streaming analytic applications using SQL. Please read Flink's [Table and SQL API](https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/table/sql.html) for more details.
+
+## Transforming a stream
+
+The follow query chooses two fields in the table:
+
+```sql
+SELECT
+  a, b
+FROM
+  orders
+```
+
+## Aggregation over group windows
+
+Aggregation over a group window is one of the most common usage pattern in streaming analytics. The following query computes the total number of orders over the last 10 minutes:
+
+```sql
+SELECT
+  COUNT(*)
+FROM
+  orders
+GROUP BY
+  HOP(rowtime, INTERVAL '10' MINUTE, INTERVAL '1' MINUTE)
+```
+
+## Using user-defined functions
+
+AthenaX supports using user-defined functions (UDFs) in the query. The UDF is available in the query once it is registered through the [CREATE FUNCTION](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL) statement. For example, the following query uses the `GetRegion` UDF to compute the region id from the longitude and the latitude fields in the order table:
+
+```sql
+CREATE FUNCTION
+  GetRegion
+AS
+  'com.uber.athenax.geo.GetRegion'
+USING JAR
+  'http://.../geo.jar';
+
+SELECT
+  GetRegion(lat, lng)
+FROM
+  orders
+```
diff --git a/docs/getting_started.md b/docs/getting_started.md
@@ -0,0 +1,68 @@
+# Getting Started
+
+## Building AthenaX and Flink
+
+To run AthenaX, you need to build both AthenaX and Flink. They require Java 8 and Maven 3 to be built.
+
+### Build AthenaX
+
+```bash
+$ git clone git@github.com:uber/AthenaX.git
+$ mvn clean install
+```
+
+### Build Flink
+
+```bash
+$ git clone git@github.com:apache/flink.git
+$ mvn clean install
+```
+
+## Configuring AthenaX
+
+The next step is to write a YAML-based configuration file. Here is an example:
+
+```yaml
+athenax.master.uri: http://localhost:8083
+catalog.impl: com.foo.MyCatalogProvider
+clusters:
+  foo:
+    yarn.site.location: hdfs:///app/athenax/yarn-site.xml
+    athenax.home.dir: hdfs:///tmp/athenax
+    filnk.uber.jar.location: hdfs:///app/athenax/flink.jar
+    localize.resources:
+      - http://foo/log4j.properties
+    additional.jars:
+      - http://foo/connectors.jar
+      - http://foo/foo.jar
+extras:
+  foo: bar
+```
+
+The meanings of the configuration are the following:
+
+Entry                   | Required | Description
+----------------------- | -------- | -----------  
+athenax.master.uri      | Yes      | The REST endpoint that the AthenaX master should listen to.
+catalog.impl            | Yes      | The class name of your [catalog provider](https://github.com/uber/AthenaX/blob/master/athenax-vm-api/src/main/java/com/uber/athenax/vm/api/AthenaXTableCatalogProvider.java).
+clusters                | Yes      | Describe the YARN cluster each of which is a sub-entry of the configuration.
+yarn.site.location      | Yes      | The location of the `yarn-site.xml` that contains the Hadoop-specific configuration of the cluster. Will be used by all instances in the cluster.
+athenax.home.dir        | Yes      | A temporary directory used when starting all instances.
+filnk.uber.jar.location | Yes      | The location of the Flink uber JAR that is generated from the previous step.
+localize.resources      | Yes      | Additional files that will be localized and shipped along with all job instances but will not be added into the classpaths of the instances (e.g., `log4j.properties`).
+additional.jars         | Yes      | Additional JARs that will be localized and added into the classpaths of the instances (e.g., the JARs of the connectors and their dependency).
+extras                  | Yes      | Additional configuration that can be used by your customization.
+
+Please see [AthenaXConfiguration](https://github.com/uber/AthenaX/blob/master/athenax-backend/src/main/java/com/uber/athenax/backend/server/AthenaXConfiguration.java) for more details on the configuration.
+
+## Start
+
+```bash
+$ java -jar athenax-backend-0.1-SNAPSHOT.jar --conf <your configuration>
+```
+
+AthenaX will start serving the [REST API](https://github.com/uber/AthenaX/blob/master/athenax-backend/src/main/resources/athenax-backend-api.yaml) on the configured endpoint. Users can start submitting jobs.
+
+## Customizing AthenaX
+
+AthenaX is designed to be pluggable in order to satisfy the needs in different contexts. The [catalog](https://github.com/uber/AthenaX/blob/master/athenax-vm-api/src/main/java/com/uber/athenax/vm/api/AthenaXTableCatalogProvider.java), [connectors](https://github.com/uber/AthenaX/blob/master/athenax-vm-api/src/main/java/com/uber/athenax/vm/api/DataSinkProvider.java), [data store](https://github.com/uber/AthenaX/blob/master/athenax-backend/src/main/java/com/uber/athenax/backend/server/jobs/JobStore.java), and the [watchdog](https://github.com/uber/AthenaX/blob/master/athenax-backend/src/main/java/com/uber/athenax/backend/server/jobs/WatchdogPolicy.java) are all pluggable. Please refer to the respective implementation for more details.