Skip to content

Matrix API Optimization summary and a few local mode benchmarks

tomtau edited this page Sep 26, 2013 · 14 revisions

This is to provide a summary of the Scalding Matrix API optimization work that has been done as a part of the Google Summer of Code project, including a few benchmarks of the old and new Matrix API in a local mode. The main goal was about optimizing the matrix product, because it is a quite expensive operation on Map/Reduce. This was mainly done in two ways: 1) making use of the associativity, i.e. (AB)C = A(BC) (disclaimer: values may little differ in floating points), but the costs may be different. For that, the well-known dynamic programming algorithm was used (http://en.wikipedia.org/wiki/Matrix_chain_multiplication) making decisions based on provided SizeHints and with a slight change that costs of subchains are multiplied by their lengths – this is to ensure that more spread out ("bushy") trees/plans are preferred. 2) Reusing computed pipes. This gets very useful in graph propagations G^k V – if we take just G^k alone, we only need to compute a fraction of all joins (( (G G) (G G) (( GG …).

After consulting with @posco, we chose to write the optimized Matrix API from scratch rather than optimizing the existing one – firstly, it was easier to incorporate the above mentioned dynamic programming algorithm in it; secondly, TypedPipes are used (the original Matrix API pre-dates Typed API) and that also (I think) brings some performance improvement.

Other smaller optimizations were made in different areas. If there is a chain of matrix sums, they are all done in one groupBy. If we have a tree with (A * B) + C, A*B is computed into an outer sum, so that the whole thing can be done in two M/R passes (otherwise, it'd require three). Results of briefly benchmarking these features are shown below.

There was also some less significant work that I didn't benchmark, namely implementing the Hadamard Product, negation and difference optimization (using an algebraic rule that (-A)B=A(-B) so just negating the smaller one), adding scalar operations and optimizing multiplication by a scalar (same as the previous case), trace optimization (trace of a matrix sum is a sum of traces of its matrices, trace(BA)=trace(AB) where the costs may differ), optimization of row L2 normalization (where computing a column sum of a sum equals to computing a sum of column sums of its matrices), and adding some extra features (such as infinite column and row vectors which can be used in intermediate computations in planning).

Benchmarks

The following benchmarks do not provide a definite answer to what speedup is obtained with the new Matrix API... they rather give an idea of what kind of improvement one may see. This is mainly because they were all executed in the local mode (I did not have a Hadoop cluster to test it on). This fact limited possible dimensions – larger ones were either crashing with java.lang.OutOfMemoryError: Java heap space, or did not seem to finish in a “reasonable” time with the old Matrix API. Given this, I omitted doing some more rigorous analysis and just roughly placed here results of each benchmark. So, feel free to contribute with your experience and/or results if you tried the new Matrix API. It would be more interesting to see the speedups on some real applications (rather than these randomly generated artificial test cases) and on a real Hadoop cluster.

Each benchmark ran two jobs – one using the old Matrix API, the other using the new API. I wrote a quick and dirty script for generating these jobs as well as generating corresponding input dense matrices. Each job was run 10 times (they did not vary too much) and the execution time was measured by the 'time' command. All benchmarks were tested on the following system configuration:

  • Kernel: Linux 3.2.0-51-generic #77-Ubuntu SMP Wed Jul 24 20:18:19 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
  • JVM: java version "1.6.0_45", Java(TM) SE Runtime Environment (build 1.6.0_45-b06), Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)
  • CPU: Intel(R) Core(TM) i5-3570 CPU (4 cores @ 3.40GHz)
  • RAM: 4 GB DDR3 @ 1600 MHz
  • HDD: Hitachi Hds721010cla332 Sata 7200rpm
  • Latest Scalding from twitter/scalding/develop (as of 9-5-2013)

Random matrix chain multiplication

In this benchmark, there is a chain multiplication of 5 matrices with the following dimensions: (35, 15), (15, 5), (5, 10), (10, 200), (200, 250)... which can be optimally factorized as ((A1(A2 A3))((A4 A5) A6). Here are the generated job with the old Matrix API and generated job with the new Matrix API. The corresponding run times are below – it took about 6 seconds for the old API and 3 second for the new one.

Old API runtimes

real	0m5.896s
user	0m8.453s
sys	0m0.152s

real	0m5.899s
user	0m8.281s
sys	0m0.164s

real	0m5.924s
user	0m8.361s
sys	0m0.184s

real	0m6.030s
user	0m8.557s
sys	0m0.148s

real	0m6.032s
user	0m8.445s
sys	0m0.180s

real	0m6.055s
user	0m8.561s
sys	0m0.164s

real	0m6.179s
user	0m8.573s
sys	0m0.160s

real	0m5.810s
user	0m8.481s
sys	0m0.168s

real	0m6.417s
user	0m8.913s
sys	0m0.180s

real	0m5.797s
user	0m8.449s
sys	0m0.200s

New API runtimes

real	0m3.135s
user	0m5.108s
sys	0m0.152s

real	0m3.127s
user	0m4.944s
sys	0m0.104s

real	0m2.932s
user	0m4.920s
sys	0m0.128s

real	0m3.108s
user	0m4.928s
sys	0m0.116s

real	0m2.937s
user	0m4.716s
sys	0m0.100s

real	0m3.132s
user	0m4.832s
sys	0m0.132s

real	0m2.953s
user	0m4.840s
sys	0m0.108s

real	0m2.941s
user	0m4.828s
sys	0m0.148s

real	0m2.911s
user	0m4.780s
sys	0m0.088s

real	0m2.907s
user	0m4.740s
sys	0m0.128s

Reusing computed pipes

The purpose of this benchmark was to test the feature of reusing the already computed pipes / subtrees in products. We compute G^8 where G has dimensions 100x100 (when I tried 200x200, the new API finished within 30-40 seconds, whereas the old one did not seem to finish after more than 15 minutes) – it is optimized as (((G G) (G G)) ((G G) (G G))) and computed on the left hand side are reused on the right hand side. Here are the generated job with the old Matrix API and generated job with the new Matrix API. The corresponding run times are below – it took about 18 seconds for the old API and 6.5 second for the new one.

Old API runtimes

real	0m17.944s
user	0m26.770s
sys	0m0.388s

real	0m18.504s
user	0m28.210s
sys	0m0.416s

real	0m17.907s
user	0m26.826s
sys	0m0.332s

real	0m18.022s
user	0m26.822s
sys	0m0.428s

real	0m18.167s
user	0m27.006s
sys	0m0.404s

real	0m18.166s
user	0m27.054s
sys	0m0.344s

real	0m17.763s
user	0m25.730s
sys	0m0.400s

real	0m18.028s
user	0m26.822s
sys	0m0.424s

real	0m18.354s
user	0m27.146s
sys	0m0.416s

real	0m18.045s
user	0m26.862s
sys	0m0.384s

New API runtimes

real	0m6.525s
user	0m8.097s
sys	0m0.196s

real	0m6.521s
user	0m7.932s
sys	0m0.368s

real	0m6.518s
user	0m8.025s
sys	0m0.272s

real	0m6.525s
user	0m8.049s
sys	0m0.264s

real	0m6.518s
user	0m7.984s
sys	0m0.280s

real	0m6.530s
user	0m7.900s
sys	0m0.312s

real	0m6.509s
user	0m8.057s
sys	0m0.252s

real	0m6.530s
user	0m7.996s
sys	0m0.256s

real	0m6.322s
user	0m7.860s
sys	0m0.240s

real	0m6.519s
user	0m8.141s
sys	0m0.180s

Sums in one groupBy

In this benchmark, five different matrices, each with a dimension 300x300, are summed to test the optimization where a chain of sums is done using one groupBy. Here are the generated job with the old Matrix API and generated job with the new Matrix API. The corresponding run times are below – it took about 5.5 seconds for the old API and 3.4 second for the new one.

Old API runtimes

real	0m5.522s
user	0m10.405s
sys	0m0.216s

real	0m5.308s
user	0m9.593s
sys	0m0.212s

real	0m5.307s
user	0m9.465s
sys	0m0.208s

real	0m5.721s
user	0m11.437s
sys	0m0.236s

real	0m5.601s
user	0m10.801s
sys	0m0.208s

real	0m5.448s
user	0m10.009s
sys	0m0.220s

real	0m5.397s
user	0m10.081s
sys	0m0.204s

real	0m5.426s
user	0m9.905s
sys	0m0.256s

real	0m5.637s
user	0m10.237s
sys	0m0.192s

real	0m5.411s
user	0m10.021s
sys	0m0.232s

New API runtimes

real	0m3.330s
user	0m6.984s
sys	0m0.292s

real	0m3.400s
user	0m7.080s
sys	0m0.288s

real	0m3.337s
user	0m7.084s
sys	0m0.256s

real	0m3.412s
user	0m7.324s
sys	0m0.248s

real	0m3.343s
user	0m7.036s
sys	0m0.236s

real	0m3.343s
user	0m7.016s
sys	0m0.272s

real	0m3.336s
user	0m7.020s
sys	0m0.260s

real	0m3.431s
user	0m7.232s
sys	0m0.328s

real	0m3.340s
user	0m6.988s
sys	0m0.332s

real	0m3.391s
user	0m7.008s
sys	0m0.336s

(A*B) + C with two map/reduce ops

This benchmark tested the feature where products are computed into an outer sum, so that (A*B)+C requires only 2 Map/Reduce operations. Here, all three matrices had dimensions 100x100. Here are the generated job with the old Matrix API and generated job with the new Matrix API. The corresponding run times are below – it took about 4.5 seconds for the old API and 3.5 second for the new one.

Old API runtimes

real	0m4.500s
user	0m7.444s
sys	0m0.136s

real	0m4.450s
user	0m7.568s
sys	0m0.112s

real	0m4.345s
user	0m7.076s
sys	0m0.120s

real	0m4.357s
user	0m7.296s
sys	0m0.124s

real	0m4.414s
user	0m7.300s
sys	0m0.152s

real	0m4.271s
user	0m6.928s
sys	0m0.120s

real	0m4.362s
user	0m7.356s
sys	0m0.124s

real	0m4.190s
user	0m7.336s
sys	0m0.144s

real	0m4.378s
user	0m7.300s
sys	0m0.140s

real	0m4.622s
user	0m7.840s
sys	0m0.128s

New API runtimes

real	0m3.609s
user	0m5.544s
sys	0m0.148s

real	0m3.384s
user	0m5.456s
sys	0m0.168s

real	0m3.591s
user	0m5.792s
sys	0m0.148s

real	0m3.410s
user	0m5.328s
sys	0m0.148s

real	0m3.406s
user	0m5.280s
sys	0m0.148s

real	0m3.607s
user	0m5.808s
sys	0m0.140s

real	0m3.611s
user	0m5.616s
sys	0m0.184s

real	0m3.594s
user	0m5.700s
sys	0m0.176s

real	0m3.414s
user	0m5.352s
sys	0m0.164s

real	0m3.789s
user	0m5.920s
sys	0m0.220s

Overhead of optimization and untyped Pipe

The purpose of this benchmark was to originally find an overhead of the optimization procedure by providing just a multiplication of two matrices. With larger dimensions (300x150 and 150x200), however, the new API still ran faster than the old one (around 16.5 seconds vs. 28.5 seconds) possibly due to the overhead of runtime checks in the untyped API. With smaller dimensions (60x30 and 30x40), the old API ran slightly faster (around 1.3 seconds with the old API vs. 1.56 seconds with the new one). Here are the generated job with the old Matrix API and generated job with the new Matrix API.

Smaller dimensions

Old API runtimes

real	0m1.302s
user	0m2.044s
sys	0m0.064s

real	0m1.304s
user	0m2.076s
sys	0m0.036s

real	0m1.298s
user	0m2.064s
sys	0m0.032s

real	0m1.305s
user	0m2.064s
sys	0m0.040s

real	0m1.295s
user	0m2.048s
sys	0m0.044s

real	0m1.311s
user	0m2.080s
sys	0m0.036s

real	0m1.306s
user	0m2.036s
sys	0m0.052s

real	0m1.315s
user	0m2.056s
sys	0m0.044s

real	0m1.302s
user	0m2.076s
sys	0m0.036s

real	0m1.305s
user	0m2.048s
sys	0m0.056s

New API runtimes

real	0m1.572s
user	0m2.504s
sys	0m0.048s

real	0m1.564s
user	0m2.484s
sys	0m0.048s

real	0m1.566s
user	0m2.484s
sys	0m0.068s

real	0m1.567s
user	0m2.512s
sys	0m0.060s

real	0m1.555s
user	0m2.484s
sys	0m0.056s

real	0m1.563s
user	0m2.496s
sys	0m0.064s

real	0m1.557s
user	0m2.504s
sys	0m0.048s

real	0m1.559s
user	0m2.480s
sys	0m0.056s

real	0m1.560s
user	0m2.520s
sys	0m0.048s

real	0m1.557s
user	0m2.492s
sys	0m0.052s

Larger dimensions

Old API runtimes

real	0m28.119s
user	0m41.811s
sys	0m0.524s

real	0m29.470s
user	0m44.395s
sys	0m0.572s

real	0m29.411s
user	0m44.587s
sys	0m0.540s

real	0m28.734s
user	0m42.035s
sys	0m0.576s

real	0m28.705s
user	0m42.175s
sys	0m0.480s

real	0m29.383s
user	0m43.991s
sys	0m0.636s

real	0m28.379s
user	0m41.851s
sys	0m0.472s

real	0m29.174s
user	0m42.587s
sys	0m0.520s

real	0m28.509s
user	0m42.135s
sys	0m0.596s

real	0m28.585s
user	0m42.035s
sys	0m0.604s

New API runtimes

real	0m16.606s
user	0m18.465s
sys	0m0.312s

real	0m16.253s
user	0m18.309s
sys	0m0.288s

real	0m16.999s
user	0m18.869s
sys	0m0.344s

real	0m16.811s
user	0m18.681s
sys	0m0.368s

real	0m16.206s
user	0m18.177s
sys	0m0.268s

real	0m17.607s
user	0m19.509s
sys	0m0.348s

real	0m17.625s
user	0m19.601s
sys	0m0.372s

real	0m15.011s
user	0m16.997s
sys	0m0.336s

real	0m16.606s
user	0m18.353s
sys	0m0.384s

real	0m16.019s
user	0m18.045s
sys	0m0.352s