Skip to content

Performance Tuning

flow edited this page May 29, 2017 · 16 revisions

Segment mode

  • vlt - The vlt mode is the brand new designed segment mode and now becomes the default mode in IndexR. It is the fastest storage format on Hadoop in the world. It offers over 2x scan speed of Parquet, with only 75% disk cost. It is great suitable for most cases.

  • basic - The basic mode using the compress algorithm from open source Infobright, see here. It can compress your data to extremely small size. We get 10:1 compress ratio in our production environment. And yet the scanning speed is faster than many other storage format. If you want to store very large data set with minimal disk cost, like hundreds of TBs historical trading data over years, it is a good choice to use basic mode. With the Rough Set Index we use in IndexR, you can still access your data in least duration.

Field Types

Try using number types, they are much more CPU, IO efficient then strings.

Table schema

  • Add outer index to columns. (v0.5.0+)

We currently support inverted index. Outer index can speed up filtering, especially for strings types. e.g. in IndexR table schema:

{
    "schema":{
        "columns":
        [
            {"name": "col0", "dataType": "string", "index": true},
            ...
        ]
    },
    ...

and in Hive table schema:

TBLPROPERTIES (
  ...
  'indexr.index.columns'='col0,col1',
  ...
) 

You should reimport your data to take effect after schema updates.

  • Use agg settings if you can.

You can set grouping to true if you want to pre-aggregation those events. And dims not only used for rolling up rows by dimensions, but also for sorting the rows. Sorting is good for compression and indexing. e.g.

{
    "schema":{
        "columns":
        [
            {"name": "c0", "dataType": "int"},
            {"name": "c1", "dataType": "string"},
            {"name": "c2", "dataType": "int"}
        ]
    },
    "agg":{
        "grouping": true,
        "dims": [
            "c0",
            "c1"
        ],
        "metrics": [
            {"name": "c2", "agg": "sum"}
        ]
    }

The rows in realtime segments will be arranged like SELECT c0, c1, sum(c2) FROM XXX GROUP BY c0, c1 ORDER BY c0, c1.

And don't forget to specify agg setting in Hive table schema.

Drill settings

Optimize Drill. The most important settings:

  • planner.width.max_per_node - Max scan threads per node. Set to a reasonable value to avoid big/slow queries swallowing all machine resources and influences between queries.
  • planner.width.max_per_query - Max scan threads per query.
  • planner.in_subquery_threshold - The IN condition translating to JOIN threshold. We normally set to 100.

For example, if you have 12 cores and 24 threads on each node, suggestion is first set to the values below, and then twist them with balancing between single query performance and concurrent ability.

alter system set planner.width.max_per_node = 5
alter system set planner.width.max_per_query = 1000;

You can also following the official Drill document to do furter tunings. https://drill.apache.org/docs/performance-tuning-introduction/.

HDFS Short-Circuit Local Reads

IndexR have done lots of optimization on IO operations since version 0.4.1. And HDFS Short-Circuit Local Reads is highly recommanded in IndexR. It can minimize the hdfs client cost while doing disk IO. Normally you can see at least 35% query speed up after open Short-Circuit Local Reads.

Steps:

  • On each Hadoop datanode machine, create dir /var/lib/hadoop-hdfs if not exists yet, and chown to the user who running Hadoop process.
mkdir -p /var/lib/hadoop-hdfs
chown -R hadoop /var/lib/hadoop-hdfs
  • Modify hdfs-site.xml in both Hadoop (e.g. etc/hadoop/hdfs-site.xml) and Drill (conf/hdfs-site.xml), add
  <property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
  </property>
  <property>
    <name>dfs.domain.socket.path</name>
    <value>/var/lib/hadoop-hdfs/dn_socket</value>
  </property>
  • Add -Djava.library.path=<hadoop native lib dir> to DRILL_JAVA_OPTS in drill-env.sh. For example, suppose your Hadoop is installed in /usr/local/hadoop, then:
export DRILL_JAVA_OPTS="... -Djava.library.path=/usr/local/hadoop/lib/native"

  • Restart HDFS cluster and Drill cluster
  • Run some queries, and check your datanode log file. After found some logs like below, means you are good to go!
2017-05-17 12:15:08,855 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 40f0d080fa46195d4f3e0c425622f707, slotIdx: 126, srvID: b7d8fa4c-13c5-4cfc-abff-044effc64db7, success: true

After you done with configuration, please check carefully whether Short-Circuit Local Reads is actually working or not. If don't, then you may run into Hadoop version incompatible issue. The libhadoop.so (see here) may not compatible between different hadoop versions. Drill is by default using Hadoop 2.7.1. You will need to recompile your Drill distribution with corresponding Hadoop version. Follow this commit to modify some Drill code and recompile it. Note that please change hadoop.version to your own Hadoop cluster version!