Skip to content
Commits on Jun 26, 2016
  1. @johnynek
  2. @johnynek

    Setting version to 0.12.1

    johnynek committed Jun 25, 2016
Commits on Jun 25, 2016
  1. @johnynek

    Minor fixer for backwards compatibility (proposed 0.12.1) (#535)

    * Minor fixer for backwards compatibility
    
    * fix some test failures
    
    * fix stray typo
    
    * try to fix the spark flakey test
    
    * try to fix noisy decayedvector test
    
    * comment out the spark tests :(
    johnynek committed on GitHub Jun 25, 2016
Commits on Jun 24, 2016
  1. @non @johnynek

    Optimize count-min sketch (#533)

    * Update CMSBenchmark for performance testing.
    
    This commit copies the previous benchmark to TopCMSBenchmark (since
    it was actually using the TopCMS implementation) and then converts
    CMSBenchmark to benchmark CMS itself. It also standardizes and
    simplifies the code a little bit to make it easier to see what's
    happening, and removes some unused benchmark parameters.
    
    On the author's machine (MBP retina), here are the current benchmark
    results:
    
    [info] Benchmark                         (delta)  (eps)  (size)   Mode  Cnt    Score     Error  Units
    [info] CMSBenchmark.sumLargeBigIntCms  0.0000001    0.1    1000  thrpt    5   49.939 ±   1.152  ops/s
    [info] CMSBenchmark.sumLargeBigIntCms  0.0000001  0.005    1000  thrpt    5   48.309 ±   2.453  ops/s
    [info] CMSBenchmark.sumLargeStringCms  0.0000001    0.1    1000  thrpt    5   50.682 ±  14.959  ops/s
    [info] CMSBenchmark.sumLargeStringCms  0.0000001  0.005    1000  thrpt    5   51.256 ±   3.108  ops/s
    [info] CMSBenchmark.sumSmallBigIntCms  0.0000001    0.1    1000  thrpt    5  556.247 ±  34.430  ops/s
    [info] CMSBenchmark.sumSmallBigIntCms  0.0000001  0.005    1000  thrpt    5  377.775 ±  34.519  ops/s
    [info] CMSBenchmark.sumSmallLongCms    0.0000001    0.1    1000  thrpt    5  594.712 ±  26.725  ops/s
    [info] CMSBenchmark.sumSmallLongCms    0.0000001  0.005    1000  thrpt    5  449.726 ± 110.672  ops/s
    
    * Optimize the CMS implementation and machinery.
    
    This commit does a number of things:
    
     1. Split CMSHasher into its own file
     2. Optimize CMS monoid's .sum and .sumOption
     3. Reduce allocations for +, ++, frequency, etc.
     4. Fix CMS law-checking so it runs during testing.
     5. Faster CMS hashing functions.
    
    Running on the author's machine (MBP retina), after this commit the
    CMS benchmarks are:
    
    [info] Benchmark                         (delta)  (eps)  (size)   Mode  Cnt     Score      Error  Units
    [info] CMSBenchmark.sumLargeBigIntCms  0.0000001    0.1    1000  thrpt    5   626.961 ±   32.379  ops/s
    [info] CMSBenchmark.sumLargeBigIntCms  0.0000001  0.005    1000  thrpt    5   573.082 ±  186.829  ops/s
    [info] CMSBenchmark.sumLargeStringCms  0.0000001    0.1    1000  thrpt    5   173.149 ±   64.034  ops/s
    [info] CMSBenchmark.sumLargeStringCms  0.0000001  0.005    1000  thrpt    5   146.868 ±  136.613  ops/s
    [info] CMSBenchmark.sumSmallBigIntCms  0.0000001    0.1    1000  thrpt    5  1369.887 ±  188.736  ops/s
    [info] CMSBenchmark.sumSmallBigIntCms  0.0000001  0.005    1000  thrpt    5  1144.827 ±  238.539  ops/s
    [info] CMSBenchmark.sumSmallLongCms    0.0000001    0.1    1000  thrpt    5  7998.298 ± 6520.702  ops/s
    [info] CMSBenchmark.sumSmallLongCms    0.0000001  0.005    1000  thrpt    5  4708.305 ± 1749.729  ops/s
    
    * Clean up TopCMS benchmark.
    
    This removes the (now unnecessary) import for CMSHasher instances.
    It also cleans things up a little bit.
    
    * improve TopCMS performance
    
    * Respond to review comments.
    
    Specifically, this removes some dead code (which we had added
    to tests), fixes an out-of-date comment, and reduces the range
    of integers we generate to ensure more collisions.
    
    * Revert to the older CMSHasher[BigInt] implementation.
    
    I still would like to improve this, but will do so in a follow-on PR.
    non committed with johnynek Jun 24, 2016
Commits on Jun 17, 2016
  1. @joshualande @johnynek

    Add MapAlgebra.group function, update documentation on MapAlgebra.sum…

    …ByKey, slightly expact the sumByKey documentation. (#532)
    joshualande committed with johnynek Jun 17, 2016
Commits on Jun 15, 2016
  1. @johnynek

    Merge pull request #531 from joshualande/joshualande/default_k_approx…

    …imate_percentile
    
    Add a default k value for Aggregator.approximatePercentile
    johnynek committed on GitHub Jun 15, 2016
  2. @johnynek

    Merge pull request #530 from non/topic/batching-and-efficiency

    Add Batched[A] type for efficient lazy addition
    johnynek committed on GitHub Jun 14, 2016
  3. @striation
Commits on Jun 14, 2016
  1. @striation

    Fix a couple more oversights.

    striation committed Jun 14, 2016
  2. @striation

    Respond to review comments.

    There are a number of changes here:
    
     - Added Equiv[Batched[A]] (though it's slow)
     - Removed the non-compacting Monoid[Batched[A]]
     - Added/improved comments
     - Added .compact method
     - Ensured that folds are correctly compacted
     - Make Batched[A] serializable
    striation committed Jun 14, 2016
  3. @striation

    Make Aggregator.toList more efficient.

    This commit uses internal batching to avoid creating intermediate
    lists. We store a tree structure and then just render the list
    once at the end.
    
    There are probably other places that can benefit from the same
    approach, but this was one that was pointed out to me.x
    striation committed Jun 14, 2016
  4. @striation

    Add Batched data type.

    This is the free semigroup -- it represents combining one or more
    elements using an associative operation. The values are stored in a
    tree structure, which can be traversed. If a semigroup for the
    underlying values becomes available, the structure can be summed to a
    single value.
    
    Batching represents a space/time trade-off (we use more space to use
    less time). In some cases these structures may become too big. To
    address this, there are constructors for Monoid[Batched[A]] and
    Semigroup[Batched[A]] instances that will periodically compact
    themselves (summing the tree to a single value) to keep the overall
    size below a specific parameter.
    
    Batched is appropriate anytime an eager concatenation or combination
    will produce intermediate values that will be thrown away, and a more
    efficient aggregate summation is possible. In those cases, building up
    a batch and then summing it should result in much more efficient
    operations.
    
    This commit adds the type, with documentation. It also adds some basic
    law-checks and tests.
    striation committed Jun 14, 2016
  5. @johnynek

    Merge pull request #529 from joshualande/joshualande/random_sample_ag…

    …gregator
    
    Add an Aggregator.randomSample aggregator
    johnynek committed on GitHub Jun 14, 2016
  6. Add Aggregator.randomSample aggregator

    joshualande committed Jun 11, 2016
Commits on Jun 7, 2016
  1. @johnynek

    Merge pull request #527 from joshualande/joshualande/add_sorted_take_by

    Add sortedTakeBy and sortedReverseTakeBy to Aggregator.scala
    johnynek committed Jun 7, 2016
  2. @joshualande
Commits on Jun 1, 2016
  1. @johnynek

    Merge pull request #525 from twitter/jnievelt/524

    Fixing BytesSpec to not reference head on an empty list
    johnynek committed Jun 1, 2016
  2. @johnynek

    Merge pull request #526 from dossett/develop

    Update Apache Storm link
    johnynek committed Jun 1, 2016
  3. @dossett

    Update Apache Storm link

    dossett committed Jun 1, 2016
Commits on May 16, 2016
  1. Fixing BytesSpec to not reference head on an empty list

    Joe Nievelt committed May 16, 2016
Commits on Apr 27, 2016
  1. @johnynek

    Merge pull request #519 from tresata/feature/spacesaver-cleanup

    make SSOne/SSMany constructors private and add apply method with count
    johnynek committed Apr 27, 2016
Commits on Apr 26, 2016
  1. @johnynek

    Merge pull request #520 from piyushnarang/develop

    Bump scala version
    johnynek committed Apr 26, 2016
Commits on Apr 25, 2016
  1. @piyushnarang

    Bump scala version

    piyushnarang committed Apr 25, 2016
Commits on Apr 17, 2016
  1. @koertkuipers
Commits on Apr 14, 2016
  1. @johnynek

    Merge pull request #518 from joshualande/fix_type

    Fix the input type of Operators.toRichTraversable
    johnynek committed Apr 13, 2016
Commits on Apr 9, 2016
  1. fix input type of toRichTraversable

    jlande committed Apr 9, 2016
Commits on Feb 26, 2016
  1. @johnynek

    Merge pull request #514 from twitter/ggonzalez/fix_flaky_minhasherspec

    Fix flaky `MinHasherSpec` test.  Fixes #500
    johnynek committed Feb 26, 2016
  2. @Gabriel439
  3. @Gabriel439

    Fix flaky `MinHasherSpec` test

    One `MinHasher32` being tested was using 247 hash functions internally, which
    means that the expected absolute error bound:
    
        1 / sqrt {# of hash functions} = 1 / sqrt 247  ~  0.064
    
    The same test required an upper bound on the error of `0.05`, which was below
    the expected error bound, meaning that this test would give flaky results due to
    occasionally exceeding the required error bound.  This change increases the
    upper bound on the error to 0.1 which should give enough headroom over the
    expected error bound to reduce the number of random test failures.
    Gabriel439 committed Feb 26, 2016
Commits on Feb 22, 2016
  1. @Gabriel439

    Merge pull request #510 from twitter/jnievelt/503

    Using dropRight instead of tail for shortening BytesSpec array
    Gabriel439 committed Feb 22, 2016
Commits on Feb 17, 2016
  1. @johnynek

    Merge pull request #511 from NathanHowell/identity

    Identity class and Monad instance
    johnynek committed Feb 17, 2016
  2. @NathanHowell

    Add the Identity monad

    NathanHowell committed Feb 17, 2016
Commits on Feb 16, 2016
  1. BytesSpec tests empty Arrays

    Joe Nievelt committed Feb 16, 2016
Something went wrong with that request. Please try again.