Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Branch: master
Commits on Jul 29, 2015
  1. @singhsiddharth

    Merge pull request #288 from FelixGV/remove_scala_and_ec2_testing

    singhsiddharth authored
    Remove scala, ec2 testing and public-lib directory
  2. @FelixGV
  3. @FelixGV
Commits on Jul 27, 2015
  1. @FelixGV
Commits on Jul 23, 2015
  1. @FelixGV

    Improved BnP logging.

    FelixGV authored
Commits on Jul 20, 2015
  1. @FelixGV

    Releasing Voldemort 1.9.18

    FelixGV authored
  2. @FelixGV

    Voldemort BnP pushes to all colos in parallel.

    FelixGV authored
    Also contains many logging improvements to discriminate between hosts and clusters.
Commits on Jul 16, 2015
  1. @FelixGV

    Rewrite of the EventThrottler code to use Tehuti.

    FelixGV authored
    - Makes throttling less vulnerable to spiky traffic sneaking in "between the interval".
    - Also fixes throttling for the HdfsFetcher when compression is enabled.
Commits on Jul 10, 2015
  1. @arunthirupathi

    Fix the white space changes

    arunthirupathi authored
    The previous refactor was done from my mac book which did not
    replace the tabs with spaces. This messed up lot of the editing.
    Instead of re-doing the change with spaces, I just formatted the code
    which is easier and no re-verification is required.
    you can review teh commit by adding ?w=1 on github url
    or use the git diff -w if you are using the command line to ignore
    the whitespaces and there are not many changes.
Commits on Jul 1, 2015
  1. @arunthirupathi

    Pass in additional parameters to fetch

    arunthirupathi authored
    1) Currently the AsyncOperationStatus is set for HdfsFetcher
    if 2 or more fetches are going on, this would produce
    erroneous results.
    2) Add StoreName, Version, Metadatastore for use
    in future fetches.
    3) Enabled the Hadoop* Tests, don't know why they were not
    run in ant tests. When I ported them for parity reasons
    I disabled them too, but now enabling it as the test seems valid.
    4) made the fetch throw IOException instead of throwable, which
    seems less reliable and catching more than it is intended.
  2. @arunthirupathi

    Refactor file fetcher to Strategy Interface/class

    arunthirupathi authored
    Refactor the file fetcher to Strategy Interface and class
    In the future this lets you modify the file fetching strategey
    like having BuildAndPush build only one copy for partition, chunk
    and the fetcher can fetch them under different names.
    There is no logic change, just the code is refactored.
  3. @FelixGV
Commits on Jun 30, 2015
  1. @arunthirupathi

    Merge pull request #271 from dallasmarlow/coordinator-class

    arunthirupathi authored
    Thanks for the fix @dallasmarlow 
    update coordinator class name in server script
  2. @FelixGV
  3. @FelixGV

    First-cut implementation of Build and Push High Availability.

    FelixGV authored
    This commit introduces a limited form of HA for BnP. The new functionality is disabled by default and can be enabled via the following server-side configurations, all of which are necessary:
    push.ha.enabled=true<some arbitrary name which is unique per physical cluster>
    push.ha.lock.path=<some arbitrary HDFS path used for shared state>
    The Build and Push job will interrogate each cluster it pushes to and honor each clusters' individual settings (i.e.: one can enable HA on one cluster at a time, if desired). However, even if the server settings enable HA, this should be considered a best effort behavior, since some BnP users may be running older versions of BnP which will not honor HA settings. Furthermore, up-to-date BnP users can also set the following config to disable HA, regardless of server-side settings:
    Below is a description of the behavior of BnP HA, when enabled.
    When a Voldemort server fails to do some fetch(es), the BnP job attempts to acquire a lock by moving a file into a shared directory in HDFS. Once the lock is acquired, it will check the state in HDFS to see if any nodes have already been marked as disabled by other BnP jobs. It then determines if the Voldemort node(s) which failed the current BnP job would bring the total number of unique failed nodes above the configured maximum, with the following outcome in each case:
    - If the total number of failed nodes is equal or lower than the max allowed, then metadata is added to HDFS to mark the store/version currently being pushed as disabled on the problematic node. Afterwards, if the Voldemort server that failed the fetch is still online, it will be asked to go in offline node (this is best effort, as the server could be down). Finally, BnP proceeds with swapping the new data set version on, as if all nodes had fetched properly.
    - If, on the other hand, the total number of unique failed nodes is above the configured max, then the BnP job will fail and the nodes that succeeded the fetch will be asked to delete the new data, just like before.
    In either case, BnP will then release the shared lock by moving the lock file outside of the lock directory, so that other BnP instances can go through the same process one at a time, in a globally coordinated (mutually exclusive) fashion. All HA-related HDFS operations are retried every 10 seconds up to 90 times (thus for a total of 15 minutes). These are configurable in the BnP job via push.ha.lock.hdfs.timeout and push.ha.lock.hdfs.retries respectively.
    When a Voldemort server is in offline mode, in order for BnP to continue working properly, the BnP jobs must be configured so that push.cluster points to the admin port, not the socket port. Configured in this way, transient HDFS issues may lead to the Voldemort server being put in offline mode, but wouldn't prevent future pushes from populating the newer data organically.
    External systems can be notified of the occurrences of the BnP HA code getting triggered via two new BuildAndPushStatus passed to the custom BuildAndPushHooks registered with the job: SWAPPED (when things work normally) and SWAPPED_WITH_FAILURES (when a swap occurred despite some failed Voldemort node(s)). BnP jobs that failed because the maximum number of failed Voldemort nodes would have been exceeded still fail normally and trigger the FAILED hook.
    Future work:
    - Auro-recovery: Transitioning the server from offline to online mode, as well as cleaning up the shared metadata in HDFS, is not handled automatically as part of this commit (which is the main reason why BnP HA should not be enabled by default). The recovery process currently needs to be handled manually, though it could be automated (at least for the common cases) as part of future work.
    - Support non-HDFS based locking mechanisms: the HdfsFailedFetchLock is an implementation of a new FailedFetchLock interface, which can serve as the basis for other distributed state/locking mechanisms (such as Zookeeper, or a native Voldemort-based solution).
    Unrelated minor fixes and clean ups included in this commit:
    - Cleaned up some dead code.
    - Cleaned up abusive admin client instantiations in BnP.
    - Cleaned up the closing of resources at the end of the BnP job.
    - Fixed a NPE in the ReadOnlyStorageEngine.
    - Fixed a broken sanity check in Cluster.getNumberOfTags().
    - Improved some server-side logging statements.
    - Fixed exception type thrown in ConfigurationStorageEngine's and FileBackedCachingStorageEngine's getCapability().
Commits on Jun 29, 2015
  1. @arunthirupathi

    Merge pull request #273 from bitti/master

    arunthirupathi authored
    @bitti  thanks for the fix, merged it in.
    Fix SecurityException when running HadoopStoreJobRunner in an oozie java action
Commits on Jun 17, 2015
Commits on Jun 12, 2015
  1. @arunthirupathi
  2. @arunthirupathi

    ConnectionException is not catastrophic

    arunthirupathi authored
    1) If a connection timesout or fails during protocol negotiation,
    they are treated as normal errors instead of catastrophic errors.
    Connection timeout was a regression from NIO connect fix. Protocol
    negotiation timeout is a new change to detect the failed servers
    2) When a node is marked down, the outstanding queued requests are
    not failed and let them go through the connection creation cycle.
    When there is no outstanding requests they can wait infinitely until
    the next request comes up.
    3) UnreachableStoreException is sometimes double wrapped. This causes
    the catastrophic errors to be not detected accurately. Created an utility
    method, when you are not sure if the thrown exception could be
    UnreachableStoreException use this method, which handles this case
    4) In non-blocking connect if the DNS does not resolve the Java throws
    UnresolvedAddressException instead of UnknownHostException. Probably an
    issue in java. Also UnresolvedAddressException is not derived from IOException
    but from IllegalArgumentException which is weird. Fixed the code to handle
    5) Tuned the remembered exceptions timeout to twice the connection timeout.
    Previously it was hardcoded to 3 seconds, which was too aggressive when the
    connection for some use cases where set to more than 5 seconds.
    Added unit tests to verify all the above cases.
  3. update coordinator class name in server script

    Dallas Marlow authored
Commits on Jun 9, 2015
  1. @FelixGV

    Releasing Voldemort 1.9.16

    FelixGV authored
  2. @FelixGV
  3. @FelixGV
  4. @gnb @FelixGV

    Fix error reporting in AvroUtils.getSchemaFromPath()

    gnb authored FelixGV committed
    - report errors with an exception
    - report errors exactly once
    - provide the failing pathname
    - don't generate spurious cascading NPE failures
Commits on Jun 8, 2015
  1. @arunthirupathi

    Merge pull request #269 from FelixGV/VoldemortConfig_bug

    arunthirupathi authored
    Fixed VoldemortConfig bug introduced in 3692fa3.
  2. @FelixGV
Commits on Jun 6, 2015
  1. @arunthirupathi

    Merge pull request #265 from gnb/VOLDENG-1912

    arunthirupathi authored
    Unregister the "-streaming-stats" mbean correctly
  2. @gnb

    Unregister the "-streaming-stats" mbean correctly

    gnb authored
    This avoids littering up the logs with JMX exceptions like this
    2015/06/04 23:55:58.105 ERROR [JmxUtils] [voldemort-admin-server-t21] [voldemort] [] Error unregistering mbean voldemort.server.StoreRepository:type=cmp_comparative_insights
            at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(
            at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.exclusiveUnregisterMBean(
            at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.unregisterMBean(
            at com.sun.jmx.mbeanserver.JmxMBeanServer.unregisterMBean(
            at voldemort.utils.JmxUtils.unregisterMbean(
            at voldemort.server.StoreRepository.removeStorageEngine(
            at voldemort.server.protocol.admin.AdminServiceRequestHandler.handleDeleteStore(
            at voldemort.server.protocol.admin.AdminServiceRequestHandler.handleRequest(
            at java.util.concurrent.ThreadPoolExecutor.runWorker(
            at java.util.concurrent.ThreadPoolExecutor$
  3. @arunthirupathi
Commits on Jun 5, 2015
  1. @arunthirupathi

    Fix Log message

    arunthirupathi authored
    HdfsFile does not have toString method which causes object id
    to be printed in the log message, it broke the script we had
    for collecting the download speed. Although speed can be calculated
    better now using the stats file, but that is a separate project.
    Added number of directories being downloaded, files in addition to size.
    This will help to track some more details, as the files if not exist,
    dummy files are created in place.
    Renamed HDFSFetcherAdvancedTest to HdfsFetcherAdvancedTest to keep it in
    sync with other naming conventions.
Commits on Jun 4, 2015
  1. @arunthirupathi

    Merge pull request #263 from FelixGV/hung_async_task_mitigation

    arunthirupathi authored
    Added SO_TIMEOUT config (default 30 mins) in ConfigurableSocketFactory.
    Looks good.
  2. @FelixGV

    Added SO_TIMEOUT config (default 30 mins) in ConfigurableSocketFactor…

    FelixGV authored
    …y and VoldemortConfig.
    Added logging to detect hung async jobs in AdminClient.waitForCompletion
Commits on May 31, 2015
  1. @arunthirupathi

    HdfsCopyStatsTest fails intermittently

    arunthirupathi authored
    The OS returns the expected files in random order. Use set instead of list.
Commits on May 27, 2015
  1. @arunthirupathi

    Add more testing for Serialization.

    arunthirupathi authored
    Added more testing for Serialization. I was doing some tests on what is
    the expected input for the serializers and expected output. I thought it
    will be a good idea instead of just documenting, if i can write unit
    tests to validate them. Most of them have very poor testing, so decided
    to add the unit tests. I will add more testing as I start working more
    on the expected input/output.
Commits on May 22, 2015
  1. @arunthirupathi

    Release 1.9.14

    arunthirupathi authored
    Release version 1.9.14
Something went wrong with that request. Please try again.