forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange
This pull request aims to improve the performance of SQL's Exchange operator when shuffling UnsafeRows. It also makes several general efficiency improvements to Exchange. Key changes: - When performing hash partitioning, the old Exchange projected the partitioning columns into a new row then passed a `(partitioningColumRow: InternalRow, row: InternalRow)` pair into the shuffle. This is very inefficient because it ends up redundantly serializing the partitioning columns only to immediately discard them after the shuffle. After this patch's changes, Exchange now shuffles `(partitionId: Int, row: InternalRow)` pairs. This still isn't optimal, since we're still shuffling extra data that we don't need, but it's significantly more efficient than the old implementation; in the future, we may be able to further optimize this once we implement a new shuffle write interface that accepts non-key-value-pair inputs. - Exchange's `compute()` method has been significantly simplified; the new code has less duplication and thus is easier to understand. - When the Exchange's input operator produces UnsafeRows, Exchange will use a specialized `UnsafeRowSerializer` to serialize these rows. This serializer is significantly more efficient since it simply copies the UnsafeRow's underlying bytes. Note that this approach does not work for UnsafeRows that use the ObjectPool mechanism; I did not add support for this because we are planning to remove ObjectPool in the next few weeks. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#7456 from JoshRosen/unsafe-exchange and squashes the following commits: 7e75259 [Josh Rosen] Fix cast in SparkSqlSerializer2Suite 0082515 [Josh Rosen] Some additional comments + small cleanup to remove an unused parameter a27cfc1 [Josh Rosen] Add missing newline 741973c [Josh Rosen] Add simple test of UnsafeRow shuffling in Exchange. 359c6a4 [Josh Rosen] Remove println() and add comments 93904e7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-exchange 8dd3ff2 [Josh Rosen] Exchange outputs UnsafeRows when its child outputs them dd9c66d [Josh Rosen] Fix for copying logic 035af21 [Josh Rosen] Add logic for choosing when to use UnsafeRowSerializer 7876f31 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-shuffle cbea80b [Josh Rosen] Add UnsafeRowSerializer 0f2ac86 [Josh Rosen] Import ordering 3ca8515 [Josh Rosen] Big code simplification in Exchange 3526868 [Josh Rosen] Iniitial cut at removing shuffle on KV pairs
- Loading branch information
Showing
8 changed files
with
398 additions
and
116 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
80 changes: 80 additions & 0 deletions
80
sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.execution | ||
|
||
import org.apache.spark._ | ||
import org.apache.spark.rdd.RDD | ||
import org.apache.spark.serializer.Serializer | ||
import org.apache.spark.sql.catalyst.InternalRow | ||
import org.apache.spark.sql.types.DataType | ||
|
||
private class ShuffledRowRDDPartition(val idx: Int) extends Partition { | ||
override val index: Int = idx | ||
override def hashCode(): Int = idx | ||
} | ||
|
||
/** | ||
* A dummy partitioner for use with records whose partition ids have been pre-computed (i.e. for | ||
* use on RDDs of (Int, Row) pairs where the Int is a partition id in the expected range). | ||
*/ | ||
private class PartitionIdPassthrough(override val numPartitions: Int) extends Partitioner { | ||
override def getPartition(key: Any): Int = key.asInstanceOf[Int] | ||
} | ||
|
||
/** | ||
* This is a specialized version of [[org.apache.spark.rdd.ShuffledRDD]] that is optimized for | ||
* shuffling rows instead of Java key-value pairs. Note that something like this should eventually | ||
* be implemented in Spark core, but that is blocked by some more general refactorings to shuffle | ||
* interfaces / internals. | ||
* | ||
* @param prev the RDD being shuffled. Elements of this RDD are (partitionId, Row) pairs. | ||
* Partition ids should be in the range [0, numPartitions - 1]. | ||
* @param serializer the serializer used during the shuffle. | ||
* @param numPartitions the number of post-shuffle partitions. | ||
*/ | ||
class ShuffledRowRDD( | ||
@transient var prev: RDD[Product2[Int, InternalRow]], | ||
serializer: Serializer, | ||
numPartitions: Int) | ||
extends RDD[InternalRow](prev.context, Nil) { | ||
|
||
private val part: Partitioner = new PartitionIdPassthrough(numPartitions) | ||
|
||
override def getDependencies: Seq[Dependency[_]] = { | ||
List(new ShuffleDependency[Int, InternalRow, InternalRow](prev, part, Some(serializer))) | ||
} | ||
|
||
override val partitioner = Some(part) | ||
|
||
override def getPartitions: Array[Partition] = { | ||
Array.tabulate[Partition](part.numPartitions)(i => new ShuffledRowRDDPartition(i)) | ||
} | ||
|
||
override def compute(split: Partition, context: TaskContext): Iterator[InternalRow] = { | ||
val dep = dependencies.head.asInstanceOf[ShuffleDependency[Int, InternalRow, InternalRow]] | ||
SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context) | ||
.read() | ||
.asInstanceOf[Iterator[Product2[Int, InternalRow]]] | ||
.map(_._2) | ||
} | ||
|
||
override def clearDependencies() { | ||
super.clearDependencies() | ||
prev = null | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.