Permalink
Commits on Jan 13, 2017
  1. Fix the connection leaking issue. (#469)

    gaojieliu committed on GitHub Jan 13, 2017
Commits on Jan 7, 2017
Commits on Jan 6, 2017
  1. Added build.replica.factor check when decide whether Voldemort should…

    … fail the
    
    data push when data fetches fail in some nodes in HA mode.
    gaojieliu committed Jan 6, 2017
Commits on Nov 29, 2016
Commits on Nov 28, 2016
  1. Adhere target/source java compatibility for contrib projects

    We are using Java 1.7 on our hadoop cluster, but locally I use JDK 1.8
    for building, so contrib got build with 1.8 by default because its
    source and target version is not specified explicitly. Consequentially
    the resulting jar didn't work in our cluster environment. This
    reconfiguration fixes this.
    
    But since the contrib projects don't build anymore with 1.6 source
    compatibility I bumped up the configured javac.version to 1.7, even
    though two other options are available:
    
    1. Fix contrib to make it 1.6 compatible again
    2. Specify different versions for the main and contrib projects
    
    But since Java 1.6 is EOL since quite a while I suggest that no
    Voldemort servers with a Java 1.6 runtime should be running anymore and
    it should be save to upgrade and keep the configuration simple.
    bitti committed with FelixGV Oct 13, 2016
  2. Fix detection for fetcher protocol warning

    Fetcher protocol warning was shown if the recommended protocol is
    specified explicitly, which is unnecessary confusing
    bitti committed with FelixGV Oct 19, 2016
Commits on Nov 24, 2016
  1. Make nodes option working again for generate_cluster_xml.py

    Option -n or --nodes is broken since d2452a9 when the
    input file check was added.
    bitti committed Nov 24, 2016
Commits on Nov 10, 2016
  1. Releasing Voldemort 1.10.23

    FelixGV committed Nov 10, 2016
  2. BnP now retries fetches when cluster.xml is stale.

    Previously, there was a race condition where a BnP job would initialize
    its AdminClients at the beginning of the job, and then hang on to that
    Cluster state throughout the job. If a maintenance is going on while
    the job is running, then it's possible that by the time the job gets to
    the "Push" phase, the Cluster representation may be stale. In those
    cases, it is possible that a BnP job may attempt to push to a node
    which has been swapped out of the cluster. This may cause BnP HA to
    trigger even though the cluster is actually healthy at that time.
    
    In order to fix this, two changes are made in this commit:
    1. In the VoldemortSwapJob, the AdminClient is constructed from scratch
       rather than being created based on the previous Cluster state. This
       should minimize the window during which it is possible to change
       the cluster.xml and make BnP hit the wrong node, but it does not
       completely eliminate the race condition.
    2. In the AdminStoreSwapper, the invokeFetch() code will check if an
       exception is caused by a stale cluster state. If it is, it will
       get a fresh AdminClient and retry the operation. This should totally
       prevent the race condition.
    
    The BnP job will do a limited amount of fetch retries (10 attempts with
    30 seconds of wait time between each) and only when hitting soft errors
    (i.e.: connection failure, etc.).
    FelixGV committed Nov 8, 2016
Commits on Nov 9, 2016
  1. Tweaked the AdminClient's currentVersion so that it is not stale.

    Previously, there could be a case where an AdminClient is created from
    a stale Cluster instance, which would lead to isClusterModified() not
    returning the correct result. This was because the AdminClient would
    always set its currentVersion to the current time, no matter how long
    ago the passed in Cluster instance was originally generated. In cases
    where the cluster.xml configuration is altered after the Cluster
    instance is constructed, but before the AdminClient is constructed,
    then there is a potentially very long window during which the wrong
    currentVersion would be set.
    FelixGV committed Nov 8, 2016
Commits on Nov 8, 2016
  1. BnP now kills an async job that it is waiting on if that job times out.

    Previously, BnP would just leave the aync job running if it timed out,
    which is wasteful, and could cause a subsequent job retry to fail if
    there are two fetch jobs running concurrently for the same store.
    FelixGV committed Nov 8, 2016
Commits on Nov 7, 2016
  1. Changed the DeleteAllFailedFetchStrategy so that it affects all nodes.

    Previously, the DeleteAllFailedFetchStrategy would only attempt to
    delete data from nodes which succeeded in their fetch. In some failure
    modes, this is appropriate, but in other cases, it isn't. In any case,
    there is no harm in trying to delete data on all nodes, even those
    that failed their fetch. This commit makes it so.
    FelixGV committed Nov 7, 2016
Commits on Nov 4, 2016
  1. Made admin connection/socket timeout configurable in BnP.

    Also changed the default socket timeout to 180 seconds.
    
    This fixes the following problem: when a node is unreachable and
    completely shut down, requests to it will time out, which takes
    60 seconds. When BnP notices this, it will reach one of the live
    nodes in the cluster and ask it to deal with the failure. The
    live node will try to talk to the dead node, which will also
    take 60 seconds to time out. By the time the live node decides
    that the dead node is unreachable, and responds to the BnP job,
    the BnP job will have already timed out. Then, the BnP job will
    think that the HandleFailedFetchRequest could not complete
    successfully (even though it did in fact complete successfully)
    and BnP HA will be aborted.
    
    The solution is that the BnP job's socket timeout must be greater
    than the server's default connection timeout.
    
    This was not an issue before when we had insanely long time outs,
    but those time outs have been reduced considerably in commit
    34debd3. This is likely when we
    regressed on the handling of this failure mode.
    FelixGV committed Nov 4, 2016
  2. Fix lots of typos and spelling mistakes

    This shouldn't entail any functional changes (besides some corrected log
    or assertion error messages)
    bitti committed with FelixGV Oct 20, 2016
Commits on Oct 5, 2016
  1. Added some extra logging when OOM occurs in BnP.

    The AvroStoreBuilderMapper can OOM when manipulating certain bad Avro records.
    
    This change does not actually prevent the OOM, but merely prints some useful
    info before dying.
    FelixGV committed Oct 4, 2016
Commits on Sep 26, 2016
  1. Replaced the following instances with http://www.project-voldemort.com

    …since urls don't resolve from some places
    
    $ grep -r 'http://project-voldemort' .
    ./clients/python/setup.py:      url='http://project-voldemort.com',
    ./NOTES:For the most up-to-date information see http://project-voldemort.com
    ./contrib/collections/src/java/voldemort/collections/VStack.java: *        voldemort JSON formats: http://project-voldemort.com/design.php
    mattwisein committed Sep 26, 2016
Commits on Sep 20, 2016
  1. Releasing Voldemort 1.10.22

    mattwisein committed Sep 20, 2016
Commits on Sep 13, 2016
  1. The BnP job should be resilient to colo failures, but this regressed.

    This commit adds a safe guard to bring back resilience to full colo
    failures.
    
    Now, if a colo is unreachable, the BnP job will still push to the
    other (healthy) colos, but it will fail the job afterwards with a
    message saying which colo failed.
    FelixGV committed Sep 12, 2016
Commits on Sep 6, 2016
  1. Python client has an issue with inconsistent indentation (#446)

    The indentation in the code is mostly spaces while the offending
    line is tab indented. Hence, importing and initializing the client
    fails with an Indentation error.
    esawtooth committed with mattwisein Sep 6, 2016
Commits on Aug 30, 2016
  1. Provide chunk size suggestion for BnP jobs with chunk overflow except…

    …ions and fix num chunks algorith to round up
    mattwisein committed Aug 29, 2016
Commits on Aug 29, 2016
  1. Introduced new boolean "readonly.omit.port" server configuration.

    When set to true, the port will be removed from the fetch URI. In
    this case, the already-existing "readonly.modify.port" setting is
    ignored.
    
    When set to false (which is the default), then the port will be
    left as part of the fetch URI (according to the already-existing
    "readonly.modify.port" setting).
    FelixGV committed Aug 25, 2016
Commits on Aug 17, 2016
  1. vadmin.sh stream support for system stores

    The commands
    bin/vadmin.sh stream fetch-entries
    bin/vadmin.sh stream fetch-keys
    
    does not work on System stores like voldsys$_client_registry
    
    There is a client side check for valid stores, which only is
    validating the user stores. Added a check to include the system
    stores as well.
    arunthirupathi committed Aug 17, 2016
  2. Data Cleanup job Does not run on system stores

    1) Client registry System store is a in-memory store and supposed to be cleaned up after 7 days.
    Last change to the DataCleanupJob made the system stores fail with the missing store exception.
    
    Clients re-use the same client id, so unless lots of clients become
    dead and removed, this will not cause a leak on the server resources. The effect is negligible.
    
    Now the DataCleanupJob checks for both system stores and normal stores for a store definition.
    
    2) If the store retention days is modified to zero, then the store will
    delete all the records. But if the store is started with 0 retention days
    it means the data retention is not enabled. Fixed the discrepancy.
    arunthirupathi committed with arunthirupathi Aug 17, 2016
Commits on Aug 11, 2016
  1. Revert "Provide chunk size suggestion for BnP jobs with chunk overflo…

    …w exceptions and fix num chunks algorith to round up"
    
    This reverts commit fdd2ca9.
    squarY committed Aug 11, 2016
Commits on Aug 10, 2016
  1. Releasing Voldemort 1.10.21

    Fix release notes.
    squarY committed Aug 10, 2016
  2. Merge pull request #436 from squarY/timeoutfix

    Fix: Extend the timeout of admin request
    squarY committed on GitHub Aug 10, 2016
  3. Fix: extend admin request time out from 1min to 5min.

    Add more logs when handing failed fetch request.
    
    Fix issues based on RB.
    squarY committed Aug 10, 2016
Commits on Aug 5, 2016
  1. Provide chunk size suggestion for BnP jobs with chunk overflow except…

    …ions and fix num chunks algorith to round up
    FelixGV committed with mattwisein Jun 10, 2016
Commits on Jul 28, 2016
  1. RO Store Create floods the Log with error messages

    Creating a RO store queries for an existing store, which
    fails with StoreNotFoundException. This exception is logged with call
    stack, which floods the logs on every store creation.
    
    This may trick the alerting system into treating this as error.
    Not logging a call stack and reducing the log to info, when such
    exceptions are logged.
    arunthirupathi committed Jul 28, 2016
Commits on Jul 25, 2016
Commits on Jul 21, 2016
  1. Log verifyOrAddStore time in the logs.

    Currently the time spent in verify or Add Store is not tracked.
    This change introduces a log line to track this time.
    
    Following log message will be added, for the calls.
    
    [18:36:23,113 voldemort.client.protocol.admin.AdminClient] INFO
    verifyOrAddStore() BootStrapUrls: [tcp://localhost:48150] Store:
    abc-xyz-read-only Verification Time: 10 ms, Creation Time: 39 ms [main]
    arunthirupathi committed Jul 21, 2016
Commits on Jul 20, 2016
  1. Rest Server Port is not serialized correctly

    Problems
    1) While running ./gradlew clean build, exits the process and the
    build fails in the middle.
    2) When Voldemort server rest validation fails it exits the process.
    3) Cluster does not serialize the rest port correctly, which caused the
    rest port validation to fail.
    4) Before auto node detection the tests were using in memory
    cluster instead of the cluster in the metadata. Now both tests and
    product code use the same code path, which caused the tests to fail.
    
    Fix:
    1) Cluster serializes the rest port, if it is greater than zero.
    2) When rest server validation fails, it throws an exception, instead
    of exiting the process. (Searched code for System.exit and Coordinator
    Server does the same, but saving that for a different day).
    3) Node state string contains the rest port, if it is present.
    4) Let the RestServiceR2StoreTest fail with an actual error message,
    instead of boiler plate error message, which made the debugging harder.
    arunthirupathi committed Jul 20, 2016
  2. InsufficientOperationalNodes concurrent exception

    From time to time, Insufficient operational nodes can throw
    concurrent modification exception, as failures is not thread safe list.
    
    Modified the list to CopyOnWriteArrayList, the code path is only used
    when nodes fail, so there should not be any noticable impact to the
    performance.
    arunthirupathi committed Jul 14, 2016
  3. Fix the HintedHandOff flaky tests

    1) I made some changes to metadata store in auto detect node id,
    and noticed these tests were failing. On investigation the test
    failures are caused by using static variables some of which are
    modified and based on the ordering they may or may not fail.
    
    2) Removed most of the static usage and made most of them as
    parameters.
    
    I still don't completely understand the test as it is quiet complicated,
    but sprinkled in some sleep to make sure that slops are registered.
    
    Tests passed successfully on 50 continous runs.
    arunthirupathi committed Jul 14, 2016