Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Commits on Aug 24, 2015
  1. @arunthirupathi @FelixGV

    Fork lift corrupts the data on schema mismatch

    arunthirupathi authored FelixGV committed
    1) If the source and destination schema does not match, currently
    forklift corrupts the destination data by streaming in bytes from
    the source data.
    After this commit, the forklift will fail when a schema mismatch
    is detected. The old behavior if required can be achieved by
    undocumented parameter ignore-schema-mismatch
    Default of fork lift which forklifts all stores is changed to
    fail when store name is not specified. I can't imagine a situation
    where you want to forklift form one cluster to other often. If an
    admin forgets to specify this parameter, they are forklifting the
    entire cluster which is not definitely intended default.
    Added 5 unit tests ( 3 for key mismatch and 2 for value mismatch).
    Added pretty print functions to Compression and SerializerDefintion.
Commits on Aug 12, 2015
  1. @FelixGV

    Ensure that all the AbstractStorageEngineTest tests get run by all th…

    James Lent authored FelixGV committed
    …e subclasses.
    Add the @Test annotation to several tests.  Without that annotation it appears that
    those subclasses that are parameterized do not run these specific test cases.  This
    is most likely a base gradle issue.  Perhaps somewhat related to:
Commits on Aug 11, 2015
  1. @FelixGV

    Fixed DataCleanupJobTest.

    FelixGV authored
Commits on Jul 20, 2015
  1. @FelixGV

    Voldemort BnP pushes to all colos in parallel.

    FelixGV authored
    Also contains many logging improvements to discriminate between hosts and clusters.
Commits on Jul 16, 2015
  1. @FelixGV

    Rewrite of the EventThrottler code to use Tehuti.

    FelixGV authored
    - Makes throttling less vulnerable to spiky traffic sneaking in "between the interval".
    - Also fixes throttling for the HdfsFetcher when compression is enabled.
Commits on Jun 30, 2015
  1. @FelixGV
  2. @FelixGV

    First-cut implementation of Build and Push High Availability.

    FelixGV authored
    This commit introduces a limited form of HA for BnP. The new functionality is disabled by default and can be enabled via the following server-side configurations, all of which are necessary:
    push.ha.enabled=true<some arbitrary name which is unique per physical cluster>
    push.ha.lock.path=<some arbitrary HDFS path used for shared state>
    The Build and Push job will interrogate each cluster it pushes to and honor each clusters' individual settings (i.e.: one can enable HA on one cluster at a time, if desired). However, even if the server settings enable HA, this should be considered a best effort behavior, since some BnP users may be running older versions of BnP which will not honor HA settings. Furthermore, up-to-date BnP users can also set the following config to disable HA, regardless of server-side settings:
    Below is a description of the behavior of BnP HA, when enabled.
    When a Voldemort server fails to do some fetch(es), the BnP job attempts to acquire a lock by moving a file into a shared directory in HDFS. Once the lock is acquired, it will check the state in HDFS to see if any nodes have already been marked as disabled by other BnP jobs. It then determines if the Voldemort node(s) which failed the current BnP job would bring the total number of unique failed nodes above the configured maximum, with the following outcome in each case:
    - If the total number of failed nodes is equal or lower than the max allowed, then metadata is added to HDFS to mark the store/version currently being pushed as disabled on the problematic node. Afterwards, if the Voldemort server that failed the fetch is still online, it will be asked to go in offline node (this is best effort, as the server could be down). Finally, BnP proceeds with swapping the new data set version on, as if all nodes had fetched properly.
    - If, on the other hand, the total number of unique failed nodes is above the configured max, then the BnP job will fail and the nodes that succeeded the fetch will be asked to delete the new data, just like before.
    In either case, BnP will then release the shared lock by moving the lock file outside of the lock directory, so that other BnP instances can go through the same process one at a time, in a globally coordinated (mutually exclusive) fashion. All HA-related HDFS operations are retried every 10 seconds up to 90 times (thus for a total of 15 minutes). These are configurable in the BnP job via push.ha.lock.hdfs.timeout and push.ha.lock.hdfs.retries respectively.
    When a Voldemort server is in offline mode, in order for BnP to continue working properly, the BnP jobs must be configured so that push.cluster points to the admin port, not the socket port. Configured in this way, transient HDFS issues may lead to the Voldemort server being put in offline mode, but wouldn't prevent future pushes from populating the newer data organically.
    External systems can be notified of the occurrences of the BnP HA code getting triggered via two new BuildAndPushStatus passed to the custom BuildAndPushHooks registered with the job: SWAPPED (when things work normally) and SWAPPED_WITH_FAILURES (when a swap occurred despite some failed Voldemort node(s)). BnP jobs that failed because the maximum number of failed Voldemort nodes would have been exceeded still fail normally and trigger the FAILED hook.
    Future work:
    - Auro-recovery: Transitioning the server from offline to online mode, as well as cleaning up the shared metadata in HDFS, is not handled automatically as part of this commit (which is the main reason why BnP HA should not be enabled by default). The recovery process currently needs to be handled manually, though it could be automated (at least for the common cases) as part of future work.
    - Support non-HDFS based locking mechanisms: the HdfsFailedFetchLock is an implementation of a new FailedFetchLock interface, which can serve as the basis for other distributed state/locking mechanisms (such as Zookeeper, or a native Voldemort-based solution).
    Unrelated minor fixes and clean ups included in this commit:
    - Cleaned up some dead code.
    - Cleaned up abusive admin client instantiations in BnP.
    - Cleaned up the closing of resources at the end of the BnP job.
    - Fixed a NPE in the ReadOnlyStorageEngine.
    - Fixed a broken sanity check in Cluster.getNumberOfTags().
    - Improved some server-side logging statements.
    - Fixed exception type thrown in ConfigurationStorageEngine's and FileBackedCachingStorageEngine's getCapability().
Commits on Jun 12, 2015
  1. @arunthirupathi

    ConnectionException is not catastrophic

    arunthirupathi authored
    1) If a connection timesout or fails during protocol negotiation,
    they are treated as normal errors instead of catastrophic errors.
    Connection timeout was a regression from NIO connect fix. Protocol
    negotiation timeout is a new change to detect the failed servers
    2) When a node is marked down, the outstanding queued requests are
    not failed and let them go through the connection creation cycle.
    When there is no outstanding requests they can wait infinitely until
    the next request comes up.
    3) UnreachableStoreException is sometimes double wrapped. This causes
    the catastrophic errors to be not detected accurately. Created an utility
    method, when you are not sure if the thrown exception could be
    UnreachableStoreException use this method, which handles this case
    4) In non-blocking connect if the DNS does not resolve the Java throws
    UnresolvedAddressException instead of UnknownHostException. Probably an
    issue in java. Also UnresolvedAddressException is not derived from IOException
    but from IllegalArgumentException which is weird. Fixed the code to handle
    5) Tuned the remembered exceptions timeout to twice the connection timeout.
    Previously it was hardcoded to 3 seconds, which was too aggressive when the
    connection for some use cases where set to more than 5 seconds.
    Added unit tests to verify all the above cases.
Commits on May 27, 2015
  1. @arunthirupathi

    Add more testing for Serialization.

    arunthirupathi authored
    Added more testing for Serialization. I was doing some tests on what is
    the expected input for the serializers and expected output. I thought it
    will be a good idea instead of just documenting, if i can write unit
    tests to validate them. Most of them have very poor testing, so decided
    to add the unit tests. I will add more testing as I start working more
    on the expected input/output.
Commits on May 15, 2015
  1. @cshaxu

    fix slop pusher unit test

    cshaxu authored
Commits on May 13, 2015
  1. @cshaxu
Commits on May 12, 2015
  1. @cshaxu
Commits on May 6, 2015
  1. @arunthirupathi

    NIO style connect

    arunthirupathi authored
    Problems :
          1) Connect blocks the selector. This causes other operations
    (read/write ) queued on the selector to incur additional latency or
    timeout. This is worse when you have data centers that are far away.
          2) ProtocolNegotiation request is done after the connection
    establishment which blocks the selector in the same manner.
          3) If Exceptions are encountered while getting connections from
    the queue they are ignored.
    Solutions :
             The connection creation is async. Create method is modified to
    createAsync and it takes in the pool object. for NIO the createAsync
    triggers an async operation which checks in the connection when it is
    ready. For Blocking connections the createAsync blocks, creates the
    connection and checks in the connection to the pool before returning.
             As the connection creation is async now, exceptions are
    remembered (for 5 seconds ) in the pool. When some thread asks for a
    connection and if the exceptions are remembered they will get an
             There is no ordering in the way connections are handed out, one
    thread can request a connection and before it could wait, other thread
    could steal this connection. This is avoided to a certain extent by
    instead of doing one blocking wait, the thread splits the blocking wait
    in 2 half and creates connection if required. This should not be a
    problem in the real world as when you reach steady state ( create
    required number of connections) this can't happen.
               Upgrade the source compatibility from java 5 to 6.
    Most of the code is written with the assumption of Java 6, I don't
    believe you can run this code on Java 5. So the impact should be
    minimal, but if it goes in Client V2 branch, it will get benefit of
    additional testing.
Commits on Apr 27, 2015
  1. @arunthirupathi

    Rebalance unit tests fail intermittently

    arunthirupathi authored
    There are 2 issues.
    1) Put is asynchronous, so there needs to be wait time before the put
    is verified on all the nodes.
    2) Repeated puts need to generate different vector clocks.
Commits on Apr 25, 2015
  1. @cshaxu
Commits on Apr 20, 2015
  1. @cshaxu

    create admin api for quota operations

    cshaxu authored
    1. Get quota by node id
    2. Set quota by node id
    3. Rebalance quota
    4. Unit test for the new admin apis
  2. @cshaxu
Commits on Apr 16, 2015
  1. @arunthirupathi

    Vector clock deserializer from Input Stream

    arunthirupathi authored
    Avoid double allocating the value size for puts which can be potentially
    few kilobytes. Vector clock has a deserializer from InputStream and it is
    used to avoid the double allocation on the hot path.
  2. @arunthirupathi

    Separate Client and Admin Request Handler

    arunthirupathi authored
    Separated both Admin and Client Request Handler.
    Currently the client port will answer admin requests and the admin
    port will answer client requests. You can bootstrap from one of these
    ports and client after bootstrapping sends the queries to the correct
    This is dangerous as most of the security implementations of voldemort
    relies on blocking the admin port via firewall and an attacker can
    change the voldemort source code to send the admin requests to client
    My intention for the fix was to make sure that the client answers only
    client requests. This will help me
    to make the client request handler share the read and write buffer
    without touching the admin request handler. Though it can be done for
    both client and admin, admin requests are too few and there are too many
    places to touch. So will fix only the client request handler.
    The AdminClient expects both the client and admin request handler. The
    admin client does some get remote metadata calls which uses the
    voldemort native v1 requests on admin port. So leaving the admin request
    handler unchanged, just moved some code so that client request handlers
    are isolated.
  3. @arunthirupathi

    client sharing read/write buffer

    arunthirupathi authored
    Client either writes/reads from socket, never does them together.
    So the buffer can be shared which will bring down the memory requirement
    for the client by half.
    But the client has to watch for 2 things
    1) On Write the buffer expands as necessary. So the buffer needs to be
    reinitialized if it grows.
    2) On Read, if the buffer can't accomodate it grows as necessary, this
    case also needs to be handled.
    This works as expected and the unit tests are passing. Will put it
    through VPL to measure the efficiency of the fixes.
    Created a new class to hold the Buffer reference. This helps to share
    the buffer between input and output streams easily. Previously you have
    to watch out for places where one buffer moves away from the other and need
    to call an explicit method to update it.
    Also moved many buffer growing and resetting logic to a common code, so it
    is more readable and understandable.
    Should I rename the ByteBufferContainer to MutableByteBuffer this fits the
    MutableInt pattern nicely where a single int can be shared by multiple classes
    and updating one is visible to others.
Commits on Apr 13, 2015
  1. @arunthirupathi

    Increase the heap size for Tests

    arunthirupathi authored
    Increase the heap size for Tests to 8GB
    ZoneShrinkage tests fails time to time with errors, as it runs out of
Commits on Apr 6, 2015
  1. @arunthirupathi

    Metadata queries are not sent to same zone

    arunthirupathi authored
    Metadata queries for system stores are sent to lowest number node
    in the cluster instead of the zone. Added a hack to the local pref
    strategy if the client zone is set, use the zone local routing.
    The code is very complicated (unnecesarily) did not clean it up as I
    dont want to run it for all the scenarios and wanted to make a safe fix.
  2. @arunthirupathi

    RouteToAllStrategy routes to node 0..n

    arunthirupathi authored
    RouteToAllStrategy tries the node always in a fixed order.
    This creates too much metadata queries on the node 0.
    For zoned cluster, the node with lowest id gets bombarded with too many
    connections and get queries.
    Create a shuffled node, when the cluster is initialized and use this in
    the routing strategy. The random seed is used at the initialization to
    make it random every time the cluster is re-initialized.
Commits on Mar 31, 2015
  1. @arunthirupathi

    Added more tests for the ClientRequestFormat

    arunthirupathi authored
    Added more tests to validate isCompleteResponse for the
    Noticed that protocolBuffers will break if the server sends in less than
    4 bytes of data.
  2. @arunthirupathi

    Test cases for client request response

    arunthirupathi authored
    1) Validates the request response.
    2) Added more validation for missing version timestamps and other issues
    3) Added backward compatibility tests.
Commits on Mar 26, 2015
  1. @cshaxu
Commits on Mar 22, 2015
  1. @bhasudha

    Incorporating code review feedbacks

    bhasudha authored
    - Adding optional read lock for get API
    - Config change for RocksDb default data directory
    - minor log fixes
  2. @arunthirupathi @bhasudha

    Warn on add/delete store in SetMetadata command

    arunthirupathi authored bhasudha committed
    1) Currently if you add or delete a store using set metadata the cluster
    will be in an inconsistent state. Added warning to the server side log
    if this happens
    2) ReplaceNodeCLI does not work correctly if you start the node with
    empty stores xml. Fixed that. Now it accepts empty stores.xml or the
    same stores.xml as the other nodes.
    3) get stores.xml returns different order different times. Made the
    ordering constant sorted by the storeName.
    4) meta check stores.xml verify if the store exists and it is
    queriable on the node.
Commits on Mar 18, 2015
  1. @bhasudha

    Add logging messages

    bhasudha authored
  2. @bhasudha

    Adding Parameterized StorageEngineTest for Rocksdb

    bhasudha authored
    In this commit;
    * RocksdbStorageEngineTest that extends AbstractStorageEngineTest
    * Some fixes to the RocksdbStorageEngine
    * Adding support for getVersions(ByteArray key)
  3. @bhasudha

    Modifying the unit test to be a parameterized test

    bhasudha authored
    * Now unit test tests both RocksdbStorageEngine and PartitionPrefixedRocksDbStorageEngine
    * Fixed getALL unit test.
  4. @bhasudha

    Adding test case for getall

    bhasudha authored
  5. @bhasudha

    Adding more unit test cases

    bhasudha authored
  6. @bhasudha

    Adding basic get after put test to test RocksDB APIs

    bhasudha authored
    * My tests fail with "java.lang.UnsatisfiedLinkError: no rocksdbjni in java.library.path" . Need to fix this later.
Commits on Mar 17, 2015
  1. @FelixGV

    Dependency clean up:

    FelixGV authored
    - Removed tusk, RestHadoopFetcher and related classes and tests.
    - Now fetching libthrift, catalina-ant and tehuti from Maven.
Something went wrong with that request. Please try again.