Matrix API Optimization summary and a few local mode benchmarks
This is to provide a summary of the Scalding Matrix API optimization work that has been done as a part of the Google Summer of Code project, including a few benchmarks of the old and new Matrix API in a local mode. The main goal was about optimizing the matrix product, because it is a quite expensive operation on Map/Reduce. This was mainly done in two ways: 1) making use of the associativity, i.e. (AB)C = A(BC) (disclaimer: values may little differ in floating points), but the costs may be different. For that, the well-known dynamic programming algorithm was used (http://en.wikipedia.org/wiki/Matrix_chain_multiplication) making decisions based on provided SizeHints and with a slight change that costs of subchains are multiplied by their lengths – this is to ensure that more spread out ("bushy") trees/plans are preferred. 2) Reusing computed pipes. This gets very useful in graph propagations G^k V – if we take just G^k alone, we only need to compute a fraction of all joins (( (G G) (G G) (( GG …).
After consulting with @posco, we chose to write the optimized Matrix API from scratch rather than optimizing the existing one – firstly, it was easier to incorporate the above mentioned dynamic programming algorithm in it; secondly, TypedPipes are used (the original Matrix API pre-dates Typed API) and that also (I think) brings some performance improvement.
Other smaller optimizations were made in different areas. If there is a chain of matrix sums, they are all done in one groupBy. If we have a tree with (A * B) + C, A*B is computed into an outer sum, so that the whole thing can be done in two M/R passes (otherwise, it'd require three). Results of briefly benchmarking these features are shown below.
There was also some less significant work that I didn't benchmark, namely implementing the Hadamard Product, negation and difference optimization (using an algebraic rule that (-A)B=A(-B) so just negating the smaller one), adding scalar operations and optimizing multiplication by a scalar (same as the previous case), trace optimization (trace of a matrix sum is a sum of traces of its matrices, trace(BA)=trace(AB) where the costs may differ), optimization of row L2 normalization (where computing a column sum of a sum equals to computing a sum of column sums of its matrices), and adding some extra features (such as infinite column and row vectors which can be used in intermediate computations in planning).
The following benchmarks do not provide a definite answer to what speedup is obtained with the new Matrix API... they rather give an idea of what kind of improvement one may see. This is mainly because they were all executed in the local mode (I did not have a Hadoop cluster to test it on). This fact limited possible dimensions – larger ones were either crashing with java.lang.OutOfMemoryError: Java heap space, or did not seem to finish in a “reasonable” time with the old Matrix API. Given this, I omitted doing some more rigorous analysis and just roughly placed here results of each benchmark. So, feel free to contribute with your experience and/or results if you tried the new Matrix API. It would be more interesting to see the speedups on some real applications (rather than these randomly generated artificial test cases) and on a real Hadoop cluster.
Each benchmark ran two jobs – one using the old Matrix API, the other using the new API. I wrote a quick and dirty script for generating these jobs as well as generating corresponding input dense matrices. Each job was run 10 times (they did not vary too much) and the execution time was measured by the 'time' command. All benchmarks were tested on the following system configuration:
- Kernel: Linux 3.2.0-51-generic #77-Ubuntu SMP Wed Jul 24 20:18:19 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
- JVM: java version "1.6.0_45", Java(TM) SE Runtime Environment (build 1.6.0_45-b06), Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)
- CPU: Intel(R) Core(TM) i5-3570 CPU (4 cores @ 3.40GHz)
- RAM: 4 GB DDR3 @ 1600 MHz
- HDD: Hitachi Hds721010cla332 Sata 7200rpm
- Latest Scalding from twitter/scalding/develop (as of 9-5-2013)
In this benchmark, there is a chain multiplication of 5 matrices with the following dimensions: (35, 15), (15, 5), (5, 10), (10, 200), (200, 250)... which can be optimally factorized as ((A1(A2 A3))((A4 A5) A6). Here are the generated job with the old Matrix API and generated job with the new Matrix API. The corresponding run times are below – it took about 6 seconds for the old API and 3 second for the new one.
real 0m5.896s
user 0m8.453s
sys 0m0.152s
real 0m5.899s
user 0m8.281s
sys 0m0.164s
real 0m5.924s
user 0m8.361s
sys 0m0.184s
real 0m6.030s
user 0m8.557s
sys 0m0.148s
real 0m6.032s
user 0m8.445s
sys 0m0.180s
real 0m6.055s
user 0m8.561s
sys 0m0.164s
real 0m6.179s
user 0m8.573s
sys 0m0.160s
real 0m5.810s
user 0m8.481s
sys 0m0.168s
real 0m6.417s
user 0m8.913s
sys 0m0.180s
real 0m5.797s
user 0m8.449s
sys 0m0.200s
real 0m3.135s
user 0m5.108s
sys 0m0.152s
real 0m3.127s
user 0m4.944s
sys 0m0.104s
real 0m2.932s
user 0m4.920s
sys 0m0.128s
real 0m3.108s
user 0m4.928s
sys 0m0.116s
real 0m2.937s
user 0m4.716s
sys 0m0.100s
real 0m3.132s
user 0m4.832s
sys 0m0.132s
real 0m2.953s
user 0m4.840s
sys 0m0.108s
real 0m2.941s
user 0m4.828s
sys 0m0.148s
real 0m2.911s
user 0m4.780s
sys 0m0.088s
real 0m2.907s
user 0m4.740s
sys 0m0.128s
The purpose of this benchmark was to test the feature of reusing the already computed pipes / subtrees in products. We compute G^8 where G has dimensions 100x100 (when I tried 200x200, the new API finished within 30-40 seconds, whereas the old one did not seem to finish after more than 15 minutes) – it is optimized as (((G G) (G G)) ((G G) (G G))) and computed on the left hand side are reused on the right hand side. Here are the generated job with the old Matrix API and generated job with the new Matrix API. The corresponding run times are below – it took about 18 seconds for the old API and 6.5 second for the new one.
real 0m17.944s
user 0m26.770s
sys 0m0.388s
real 0m18.504s
user 0m28.210s
sys 0m0.416s
real 0m17.907s
user 0m26.826s
sys 0m0.332s
real 0m18.022s
user 0m26.822s
sys 0m0.428s
real 0m18.167s
user 0m27.006s
sys 0m0.404s
real 0m18.166s
user 0m27.054s
sys 0m0.344s
real 0m17.763s
user 0m25.730s
sys 0m0.400s
real 0m18.028s
user 0m26.822s
sys 0m0.424s
real 0m18.354s
user 0m27.146s
sys 0m0.416s
real 0m18.045s
user 0m26.862s
sys 0m0.384s
real 0m6.525s
user 0m8.097s
sys 0m0.196s
real 0m6.521s
user 0m7.932s
sys 0m0.368s
real 0m6.518s
user 0m8.025s
sys 0m0.272s
real 0m6.525s
user 0m8.049s
sys 0m0.264s
real 0m6.518s
user 0m7.984s
sys 0m0.280s
real 0m6.530s
user 0m7.900s
sys 0m0.312s
real 0m6.509s
user 0m8.057s
sys 0m0.252s
real 0m6.530s
user 0m7.996s
sys 0m0.256s
real 0m6.322s
user 0m7.860s
sys 0m0.240s
real 0m6.519s
user 0m8.141s
sys 0m0.180s
In this benchmark, five different matrices, each with a dimension 300x300, are summed to test the optimization where a chain of sums is done using one groupBy. Here are the generated job with the old Matrix API and generated job with the new Matrix API. The corresponding run times are below – it took about 5.5 seconds for the old API and 3.4 second for the new one.
real 0m5.522s
user 0m10.405s
sys 0m0.216s
real 0m5.308s
user 0m9.593s
sys 0m0.212s
real 0m5.307s
user 0m9.465s
sys 0m0.208s
real 0m5.721s
user 0m11.437s
sys 0m0.236s
real 0m5.601s
user 0m10.801s
sys 0m0.208s
real 0m5.448s
user 0m10.009s
sys 0m0.220s
real 0m5.397s
user 0m10.081s
sys 0m0.204s
real 0m5.426s
user 0m9.905s
sys 0m0.256s
real 0m5.637s
user 0m10.237s
sys 0m0.192s
real 0m5.411s
user 0m10.021s
sys 0m0.232s
real 0m3.330s
user 0m6.984s
sys 0m0.292s
real 0m3.400s
user 0m7.080s
sys 0m0.288s
real 0m3.337s
user 0m7.084s
sys 0m0.256s
real 0m3.412s
user 0m7.324s
sys 0m0.248s
real 0m3.343s
user 0m7.036s
sys 0m0.236s
real 0m3.343s
user 0m7.016s
sys 0m0.272s
real 0m3.336s
user 0m7.020s
sys 0m0.260s
real 0m3.431s
user 0m7.232s
sys 0m0.328s
real 0m3.340s
user 0m6.988s
sys 0m0.332s
real 0m3.391s
user 0m7.008s
sys 0m0.336s
This benchmark tested the feature where products are computed into an outer sum, so that (A*B)+C requires only 2 Map/Reduce operations. Here, all three matrices had dimensions 100x100. Here are the generated job with the old Matrix API and generated job with the new Matrix API. The corresponding run times are below – it took about 4.5 seconds for the old API and 3.5 second for the new one.
real 0m4.500s
user 0m7.444s
sys 0m0.136s
real 0m4.450s
user 0m7.568s
sys 0m0.112s
real 0m4.345s
user 0m7.076s
sys 0m0.120s
real 0m4.357s
user 0m7.296s
sys 0m0.124s
real 0m4.414s
user 0m7.300s
sys 0m0.152s
real 0m4.271s
user 0m6.928s
sys 0m0.120s
real 0m4.362s
user 0m7.356s
sys 0m0.124s
real 0m4.190s
user 0m7.336s
sys 0m0.144s
real 0m4.378s
user 0m7.300s
sys 0m0.140s
real 0m4.622s
user 0m7.840s
sys 0m0.128s
real 0m3.609s
user 0m5.544s
sys 0m0.148s
real 0m3.384s
user 0m5.456s
sys 0m0.168s
real 0m3.591s
user 0m5.792s
sys 0m0.148s
real 0m3.410s
user 0m5.328s
sys 0m0.148s
real 0m3.406s
user 0m5.280s
sys 0m0.148s
real 0m3.607s
user 0m5.808s
sys 0m0.140s
real 0m3.611s
user 0m5.616s
sys 0m0.184s
real 0m3.594s
user 0m5.700s
sys 0m0.176s
real 0m3.414s
user 0m5.352s
sys 0m0.164s
real 0m3.789s
user 0m5.920s
sys 0m0.220s
The purpose of this benchmark was to originally find an overhead of the optimization procedure by providing just a multiplication of two matrices. With larger dimensions (300x150 and 150x200), however, the new API still ran faster than the old one (around 16.5 seconds vs. 28.5 seconds) possibly due to the overhead of runtime checks in the untyped API. With smaller dimensions (60x30 and 30x40), the old API ran slightly faster (around 1.3 seconds with the old API vs. 1.56 seconds with the new one). Here are the generated job with the old Matrix API and generated job with the new Matrix API.
real 0m1.302s
user 0m2.044s
sys 0m0.064s
real 0m1.304s
user 0m2.076s
sys 0m0.036s
real 0m1.298s
user 0m2.064s
sys 0m0.032s
real 0m1.305s
user 0m2.064s
sys 0m0.040s
real 0m1.295s
user 0m2.048s
sys 0m0.044s
real 0m1.311s
user 0m2.080s
sys 0m0.036s
real 0m1.306s
user 0m2.036s
sys 0m0.052s
real 0m1.315s
user 0m2.056s
sys 0m0.044s
real 0m1.302s
user 0m2.076s
sys 0m0.036s
real 0m1.305s
user 0m2.048s
sys 0m0.056s
real 0m1.572s
user 0m2.504s
sys 0m0.048s
real 0m1.564s
user 0m2.484s
sys 0m0.048s
real 0m1.566s
user 0m2.484s
sys 0m0.068s
real 0m1.567s
user 0m2.512s
sys 0m0.060s
real 0m1.555s
user 0m2.484s
sys 0m0.056s
real 0m1.563s
user 0m2.496s
sys 0m0.064s
real 0m1.557s
user 0m2.504s
sys 0m0.048s
real 0m1.559s
user 0m2.480s
sys 0m0.056s
real 0m1.560s
user 0m2.520s
sys 0m0.048s
real 0m1.557s
user 0m2.492s
sys 0m0.052s
real 0m28.119s
user 0m41.811s
sys 0m0.524s
real 0m29.470s
user 0m44.395s
sys 0m0.572s
real 0m29.411s
user 0m44.587s
sys 0m0.540s
real 0m28.734s
user 0m42.035s
sys 0m0.576s
real 0m28.705s
user 0m42.175s
sys 0m0.480s
real 0m29.383s
user 0m43.991s
sys 0m0.636s
real 0m28.379s
user 0m41.851s
sys 0m0.472s
real 0m29.174s
user 0m42.587s
sys 0m0.520s
real 0m28.509s
user 0m42.135s
sys 0m0.596s
real 0m28.585s
user 0m42.035s
sys 0m0.604s
real 0m16.606s
user 0m18.465s
sys 0m0.312s
real 0m16.253s
user 0m18.309s
sys 0m0.288s
real 0m16.999s
user 0m18.869s
sys 0m0.344s
real 0m16.811s
user 0m18.681s
sys 0m0.368s
real 0m16.206s
user 0m18.177s
sys 0m0.268s
real 0m17.607s
user 0m19.509s
sys 0m0.348s
real 0m17.625s
user 0m19.601s
sys 0m0.372s
real 0m15.011s
user 0m16.997s
sys 0m0.336s
real 0m16.606s
user 0m18.353s
sys 0m0.384s
real 0m16.019s
user 0m18.045s
sys 0m0.352s