Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Commits on Jul 1, 2015
  1. @FelixGV
Commits on Jun 30, 2015
  1. @FelixGV

    First-cut implementation of Build and Push High Availability.

    FelixGV authored
    This commit introduces a limited form of HA for BnP. The new functionality is disabled by default and can be enabled via the following server-side configurations, all of which are necessary:
    
    push.ha.enabled=true
    push.ha.cluster.id=<some arbitrary name which is unique per physical cluster>
    push.ha.lock.path=<some arbitrary HDFS path used for shared state>
    push.ha.lock.implementation=voldemort.store.readonly.swapper.HdfsFailedFetchLock
    push.ha.max.node.failure=1
    
    The Build and Push job will interrogate each cluster it pushes to and honor each clusters' individual settings (i.e.: one can enable HA on one cluster at a time, if desired). However, even if the server settings enable HA, this should be considered a best effort behavior, since some BnP users may be running older versions of BnP which will not honor HA settings. Furthermore, up-to-date BnP users can also set the following config to disable HA, regardless of server-side settings:
    
    push.ha.enabled=false
    
    Below is a description of the behavior of BnP HA, when enabled.
    
    When a Voldemort server fails to do some fetch(es), the BnP job attempts to acquire a lock by moving a file into a shared directory in HDFS. Once the lock is acquired, it will check the state in HDFS to see if any nodes have already been marked as disabled by other BnP jobs. It then determines if the Voldemort node(s) which failed the current BnP job would bring the total number of unique failed nodes above the configured maximum, with the following outcome in each case:
    
    - If the total number of failed nodes is equal or lower than the max allowed, then metadata is added to HDFS to mark the store/version currently being pushed as disabled on the problematic node. Afterwards, if the Voldemort server that failed the fetch is still online, it will be asked to go in offline node (this is best effort, as the server could be down). Finally, BnP proceeds with swapping the new data set version on, as if all nodes had fetched properly.
    - If, on the other hand, the total number of unique failed nodes is above the configured max, then the BnP job will fail and the nodes that succeeded the fetch will be asked to delete the new data, just like before.
    
    In either case, BnP will then release the shared lock by moving the lock file outside of the lock directory, so that other BnP instances can go through the same process one at a time, in a globally coordinated (mutually exclusive) fashion. All HA-related HDFS operations are retried every 10 seconds up to 90 times (thus for a total of 15 minutes). These are configurable in the BnP job via push.ha.lock.hdfs.timeout and push.ha.lock.hdfs.retries respectively.
    
    When a Voldemort server is in offline mode, in order for BnP to continue working properly, the BnP jobs must be configured so that push.cluster points to the admin port, not the socket port. Configured in this way, transient HDFS issues may lead to the Voldemort server being put in offline mode, but wouldn't prevent future pushes from populating the newer data organically.
    
    External systems can be notified of the occurrences of the BnP HA code getting triggered via two new BuildAndPushStatus passed to the custom BuildAndPushHooks registered with the job: SWAPPED (when things work normally) and SWAPPED_WITH_FAILURES (when a swap occurred despite some failed Voldemort node(s)). BnP jobs that failed because the maximum number of failed Voldemort nodes would have been exceeded still fail normally and trigger the FAILED hook.
    
    Future work:
    
    - Auro-recovery: Transitioning the server from offline to online mode, as well as cleaning up the shared metadata in HDFS, is not handled automatically as part of this commit (which is the main reason why BnP HA should not be enabled by default). The recovery process currently needs to be handled manually, though it could be automated (at least for the common cases) as part of future work.
    - Support non-HDFS based locking mechanisms: the HdfsFailedFetchLock is an implementation of a new FailedFetchLock interface, which can serve as the basis for other distributed state/locking mechanisms (such as Zookeeper, or a native Voldemort-based solution).
    
    Unrelated minor fixes and clean ups included in this commit:
    
    - Cleaned up some dead code.
    - Cleaned up abusive admin client instantiations in BnP.
    - Cleaned up the closing of resources at the end of the BnP job.
    - Fixed a NPE in the ReadOnlyStorageEngine.
    - Fixed a broken sanity check in Cluster.getNumberOfTags().
    - Improved some server-side logging statements.
    - Fixed exception type thrown in ConfigurationStorageEngine's and FileBackedCachingStorageEngine's getCapability().
Commits on Jun 17, 2015
Commits on Jun 9, 2015
  1. @FelixGV
  2. @gnb @FelixGV

    Fix error reporting in AvroUtils.getSchemaFromPath()

    gnb authored FelixGV committed
    - report errors with an exception
    - report errors exactly once
    - provide the failing pathname
    - don't generate spurious cascading NPE failures
Commits on Jun 5, 2015
  1. @arunthirupathi

    Fix Log message

    arunthirupathi authored
    HdfsFile does not have toString method which causes object id
    to be printed in the log message, it broke the script we had
    for collecting the download speed. Although speed can be calculated
    better now using the stats file, but that is a separate project.
    
    Added number of directories being downloaded, files in addition to size.
    This will help to track some more details, as the files if not exist,
    dummy files are created in place.
    
    Renamed HDFSFetcherAdvancedTest to HdfsFetcherAdvancedTest to keep it in
    sync with other naming conventions.
Commits on Jun 4, 2015
  1. @FelixGV

    Added SO_TIMEOUT config (default 30 mins) in ConfigurableSocketFactor…

    FelixGV authored
    …y and VoldemortConfig.
    
    Added logging to detect hung async jobs in AdminClient.waitForCompletion
Commits on May 31, 2015
  1. @arunthirupathi

    HdfsCopyStatsTest fails intermittently

    arunthirupathi authored
    The OS returns the expected files in random order. Use set instead of list.
Commits on May 18, 2015
  1. @arunthirupathi

    Output stats file for RO files download

    arunthirupathi authored
    .stats directory will be created and will contain last X (default: 50)
    stats file.
    
    If a version-X is fetched a file with the same name as this directory
    name will contain the stats for this download.
    
    The stats file will contain the individual file name, time it took to download
    and few other information.
    
    Added unit tests for the HdfsCopyStatsTest
Commits on May 12, 2015
  1. @arunthirupathi

    Refactor HdfsFetcher

    arunthirupathi authored
    1) Created directory and File class to help me in the future.
    2) Cleaned up some code to make for easier readability.
Commits on May 1, 2015
  1. @bhasudha
Commits on Apr 23, 2015
  1. @bhasudha

    Adding compression to RO path - first pass commit

    bhasudha authored
    VoldemortConfig
    - Added a new config for compression codec. Default value for this
      property is GZIP. This is used by the AdminServiceRequestHandler to
    respond to the VoldemortBuildAndPushJob on what codec is supported.
    
    VAdminProto
    - Added a new Request Type for getting the suported compression codecs
      from RO Voldemort Server
    
    AdminServiceRequestHandler
    - New method to handle the above request type.
    
    AdminClient
    - Provides a method - getSupportedROStorageCompressionCodecs, that
      supports the above request type..
    
    VoldemortBuildAndPushJob
    - inside run(), immediately after check cluster equalities, an admin
      request is issued to the VoldemortServer (specified by the property
    "push.node") to fetch the RO Compression Codec supported by the Server.
    - If any of the supported CODEC match the COMPRESSION_CODEC, then
      compression specific properties are set. Else no compression is
    enabled.
    
    AbstractHadoopJob
    - This is where the RO compression specific properties are set in
      Jobconf inside the createJobConf() Method
    
    HadoopStoreWriter and HadoopStoreWriterPerBucket
    - Adding dummy test only constructors
    - Creating index and value file streams based on compression settings
    - Got rid of some unused variables
    - minor movement of code
    
    HDFSFetcher
    - Changed copyFileWithCheckSum() to check if the files are ending with
      ".gz"
      and create a GZPIInputStream based on that.
    - GZPIInputStream (if compression is enabled) wraps the orifinal
      FSDataInputStream
    
    Tests for HadoopStoreWriter and  HadoopStoreWriterPerBucket
    - These ar parameterized tests - takes in a boolean to either save keys
      or not
    - Run two tests - compressed and uncompressed
    - have tighter assumptions and use the test specific constructors in the
      corresponding classes
Commits on Apr 7, 2015
  1. @arunthirupathi

    Refactor the HDFS fetcher

    arunthirupathi authored
    1) Move some code into a method
    2) Allocate the buffer per fetch instead of per file.
    
    Tested by fetching 2 directories on HDFS and verified the output.
Commits on Mar 30, 2015
  1. @gnb
  2. @gnb

    Allow HdfsFetcher to fetch individual files

    gnb authored
    but only from the main(), not when fetch() is invoked by the server.
Commits on Mar 27, 2015
  1. @FelixGV

    BnP improvement:

    FelixGV authored
    - Removed a bunch of redundant constructors that made code unreadble.
    - Added a min.number.of.records config (defaults to 1) to prevent pushing empty stores.
    - Improved error handling and reporting in BnP's run function.
Commits on Mar 17, 2015
  1. @FelixGV

    Further BnP cleanups:

    FelixGV authored
    - Removed Azkaban dependency as much as possible.
    - Standardized on using voldemort.utils.Props as much as possible.
    - Deleted VoldemortMultiStoreBuildAndPushJob which is not actively used and suffering from code rot.
    - Added Content-Length header support in BnP HttpHook.
    - Added more utility functions to voldemort.utils.Props
    - Safeguards in BnP HttpHook's concurrent code.
    - Removed System.exit calls from BnP.
Commits on Feb 19, 2015
  1. @FelixGV

    Many Build and Push improvements:

    FelixGV authored
    - Set umask for recursive HDFS permissions.
    - Added some retry logic for an HDFS operation in the HadoopStoreBuilder.
    - Added BnP Abstract and Http Hook classes for common use cases.
    - Improved logging and debuggability.
    - Removed Azkaban dependency from AvroStoreBuilderMapper.
    - Deleted the deprecated VoldemortBatchIndexJob.
Commits on Feb 17, 2015
  1. @gnb
Commits on Feb 3, 2015
  1. @FelixGV

    Many Build And Push improvements:

    FelixGV authored
    - Upgraded Azkaban dependency to 2.5.0, and fetch from Maven Central. Removed Azkaban from private-lib.
    - Upgraded Jackson dependency to 1.9.13.
    - Fixed BnP hooks default config (other it failed when unspecified).
    - Rethrow exceptions caught in BnP's run function to maintain previous behavior.
    - Sanity checks and better error reporting of avro configs.
    - Cleaned up duplicate references of UndefinedPropertyException.
Commits on Feb 2, 2015
  1. @cshaxu
Commits on Jan 28, 2015
  1. @FelixGV

    Added support for custom hooks in the VoldemortBuildAndPushJob. Heart…

    FelixGV authored
    …beat hooks run in a daemon thread.
Commits on Dec 30, 2014
Commits on Dec 18, 2014
  1. @arunthirupathi

    Delete Keys CLI

    arunthirupathi authored
    Add a tool Delete Keys CLI. This tool reads from a keyfile and deletes
    the key from the supplied stores. The tools are considered to be in
    human readable format and conversion will be attempted to the
    appropriate key.
    
    First of all understand that
    
    The tool also supports the following options
      1) --delete-all-versions. If you have more than one value with
    conflicting versions, the tool will fail, because it may not have the
    value schema to de-serialize the value and resolve the conflict. The
    conflict resolution needs to happen before the key is deleted.
      2) --nodeid <> --admin-url <>. If you want to delete keys only from a
    particular node. Use the above options. It is useful when you delete the
    keys and if a node went down, you want to rerun the tool with that
    option.
      3) --find-keys-exist <> . After the delete you can run with this
    option to find if any of the keys exist. If the keys are found the tool
    dumps the version of each of the keys. The tool waits for the number of
    keys from each store before it completes.
Commits on Oct 16, 2014
  1. @cshaxu
Commits on Oct 7, 2014
  1. @bhasudha
  2. @bhasudha
Commits on Oct 3, 2014
  1. @FelixGV
Commits on Oct 2, 2014
  1. @cshaxu

    addressed comments

    cshaxu authored
Commits on Oct 1, 2014
  1. @cshaxu
Commits on Sep 27, 2014
  1. @bhasudha

    * Refactoring the CoordinatorProxyService initialize fat clients methods

    bhasudha authored
    * Fixing StoreClientConfigService and refactoring FileBasedStoreClientConfigService
    * Fixing Coordinator unit tests
Commits on Sep 22, 2014
  1. @cshaxu

    create coord-admin-tool-test

    cshaxu authored
Commits on Sep 19, 2014
  1. @cshaxu
Commits on Sep 16, 2014
  1. @cshaxu
  2. @cshaxu

    some fix on admin client

    cshaxu authored
Something went wrong with that request. Please try again.