## Parquet Deserialization Benchmarks

To determine if it is likely that spark is introducing undue latency in simple counting/aggregation operations, this notebook will use [parquet4s](https://github.com/mjakubowski84/parquet4s) to read and time parquet dumps of plink data directly.

**Conclusion**: Spark takes upwards of 10 seconds to traverse one 100k record partition of a vcf-esque dataset.  Other operations that do a little more than simply traverse a file, like those in the Glow GWAS tutorial for counting unique variant/sample ids, take somewhere between 10 and 20 seconds so it appears that simply deserializing the information is taking a large chunk of the time (50+%).  Using a direct parquet reader was slower than Spark, so that foray adds little information.

In [1]:
import $ivy.`com.github.mjakubowski84::parquet4s-core:1.0.0`
import $ivy.`sh.almond::almond-spark:0.6.0`
import $ivy.`org.apache.spark::spark-sql:2.4.4`
import $file.^.init.paths, paths._
import $file.^.init.benchmark, benchmark._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.DataFrame
import com.github.mjakubowski84.parquet4s.ParquetReader
Logger.getLogger("org").setLevel(Level.WARN)

val data_dir = GWAS_TUTORIAL_DATA_DIR / "1_QC_GWAS"
val path = data_dir / "HapMap_3_r3_1.parquet"

val ss = {
  NotebookSparkSession
    .builder()
    .progress(enable=false, keep=false)
    .config("spark.sql.shuffle.partitions", "1")
    .config("spark.ui.enabled", "false")
    .config("spark.driver.host", "localhost")
    .master("local[1]") // Use single threaded reads for comparisons
    .getOrCreate()
}
import ss.implicits._

Loading spark-stubs
Creating SparkSession


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/01/13 23:07:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


[32mimport [39m[36m$ivy.$                                               
[39m
[32mimport [39m[36m$ivy.$                              
[39m
[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$file.$           , paths._
[39m
[32mimport [39m[36m$file.$               , benchmark._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[32mimport [39m[36morg.apache.spark.sql.DataFrame
[39m
[32mimport [39m[36mcom.github.mjakubowski84.parquet4s.ParquetReader
[39m
[36mdata_dir[39m: [32mbetter[39m.[32mfiles[39m.[32mFile[39m = /home/eczech/data/gwas/tutorial/1_QC_GWAS
[36mpath[39m: [32mbetter[39m.[32mfiles[39m.[32mFile[39m = /home/eczech/data/gwas/tutorial/1_QC_GWAS/HapMap_3_r3_1.parquet
[36mss[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@5ed9e86a
[32mimport [39m[36mss.implic

In [2]:
ss.read.parquet(path.toString).printSchema

root
 |-- contigName: string (nullable = true)
 |-- names: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- position: double (nullable = true)
 |-- start: long (nullable = true)
 |-- end: long (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAlleles: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- genotypes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- sampleId: string (nullable = true)
 |    |    |-- calls: array (nullable = true)
 |    |    |    |-- element: integer (containsNull = true)



In [3]:
// Select a single file to compare file deserialization times on
val file = path.glob("*.parquet").toList(0).toString

[36mfile[39m: [32mString[39m = [32m"/home/eczech/data/gwas/tutorial/1_QC_GWAS/HapMap_3_r3_1.parquet/part-00005-72073dbc-7b2b-49c4-91c5-a14cdb8b553d-c000.snappy.parquet"[39m

In [4]:
ss.read.parquet(file.toString).count

[36mres3[39m: [32mLong[39m = [32m99865L[39m

In [6]:
// Test reads on a projection with two scalar (i.e. small) fields
case class Record (
    contigName: Option[String],
    position: Option[Double]
)

(1 to 5).foreach(_ => time {
    val records = ParquetReader.read[Record](file.toString)
    records.size // traverse once
    records.close()
})

Elapsed time: 15.5 seconds
Elapsed time: 15.4 seconds
Elapsed time: 15.7 seconds
Elapsed time: 15.5 seconds
Elapsed time: 15.4 seconds


defined [32mclass[39m [36mRecord[39m

In [7]:
// Test traversal over full projection
case class Genotype (
    sampleId: Option[String],
    calls: Array[Int]
)
case class Record (
    contigName: Option[String],
    names: Option[Array[String]],
    position: Option[Double],
    start: Option[Long],
    end: Option[Long],
    referenceAllele: Option[String],
    alternateAlleles: Option[Array[String]],
    genotypes: Option[Array[Genotype]]
)

(1 to 5).foreach(_ => time {
    val records = ParquetReader.read[Record](file.toString)
    records.size // traverse once
    records.close()
})

Elapsed time: 29.7 seconds
Elapsed time: 29.8 seconds
Elapsed time: 29.7 seconds
Elapsed time: 29.8 seconds
Elapsed time: 30.0 seconds


defined [32mclass[39m [36mGenotype[39m
defined [32mclass[39m [36mRecord[39m

In [5]:
// Compare to spark times by forcing spark to compute some stupid
// aggregation that involves many of the fields, especially the genotypes
// field since it is by far the largest
(1 to 5).foreach(_ => time {
    ss.read.parquet(file.toString)
    .agg(
        max(length($"contigName")) + 
        sum($"position") + 
        sum(size($"genotypes")) + 
        sum(size($"names")) + 
        sum(size($"alternateAlleles"))
    )
    .collect()
})

Elapsed time: 10.2 seconds
Elapsed time: 9.2 seconds
Elapsed time: 9.2 seconds
Elapsed time: 9.1 seconds
Elapsed time: 9.2 seconds
