Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHS-NG M1: Add KVStore abstraction, LevelDB implementation. #3

Closed
wants to merge 30 commits into from

Commits on May 5, 2017

  1. [SPARK-20603][SS][TEST] Set default number of topic partitions to 1 t…

    …o reduce the load
    
    ## What changes were proposed in this pull request?
    
    I checked the logs of https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.2-test-maven-hadoop-2.7/47/ and found it took several seconds to create Kafka internal topic `__consumer_offsets`. As Kafka creates this topic lazily, the topic creation happens in the first test `deserialization of initial offset with Spark 2.1.0` and causes it timeout.
    
    This PR changes `offsets.topic.num.partitions` from the default value 50 to 1 to make creating `__consumer_offsets` (50 partitions -> 1 partition) much faster.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#17863 from zsxwing/fix-kafka-flaky-test.
    zsxwing committed May 5, 2017
    Configuration menu
    Copy the full SHA
    bd57882 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20557][SQL] Support for db column type TIMESTAMP WITH TIME ZONE

    ## What changes were proposed in this pull request?
    
    SparkSQL can now read from a database table with column type [TIMESTAMP WITH TIME ZONE](https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE).
    
    ## How was this patch tested?
    
    Tested against Oracle database.
    
    JoshRosen, you seem to know the class, would you look at this? Thanks!
    
    Author: Jannik Arndt <jannik@jannikarndt.de>
    
    Closes apache#17832 from JannikArndt/spark-20557-timestamp-with-timezone.
    JannikArndt authored and gatorsmile committed May 5, 2017
    Configuration menu
    Copy the full SHA
    b31648c View commit details
    Browse the repository at this point in the history
  3. [SPARK-20616] RuleExecutor logDebug of batch results should show diff…

    … to start of batch
    
    ## What changes were proposed in this pull request?
    
    Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch.
    
    ## How was this patch tested?
    
    Now the debug message prints the diff between start and end of batch.
    
    Author: Juliusz Sompolski <julek@databricks.com>
    
    Closes apache#17875 from juliuszsompolski/SPARK-20616.
    juliuszsompolski authored and rxin committed May 5, 2017
    Configuration menu
    Copy the full SHA
    5d75b14 View commit details
    Browse the repository at this point in the history

Commits on May 6, 2017

  1. [SPARK-20614][PROJECT INFRA] Use the same log4j configuration with Je…

    …nkins in AppVeyor
    
    ## What changes were proposed in this pull request?
    
    Currently, there are flooding logs in AppVeyor (in the console). This has been fine because we can download all the logs. However, (given my observations so far), logs are truncated when there are too many. It has been grown recently and it started to get truncated. For example, see  https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master
    
    Even after the log is downloaded, it looks truncated as below:
    
    ```
    [00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200)
    [00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 (TID 9213)
    [00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage 601.0 (TID 9212). 2473 bytes result sent to driver
    ...
    ```
    
    Probably, it looks better to use the same log4j configuration that we are using for SparkR tests in Jenkins(please see https://github.com/apache/spark/blob/fc472bddd1d9c6a28e57e31496c0166777af597e/R/run-tests.sh#L26 and https://github.com/apache/spark/blob/fc472bddd1d9c6a28e57e31496c0166777af597e/R/log4j.properties)
    ```
    # Set everything to be logged to the file target/unit-tests.log
    log4j.rootCategory=INFO, file
    log4j.appender.file=org.apache.log4j.FileAppender
    log4j.appender.file.append=true
    log4j.appender.file.file=R/target/unit-tests.log
    log4j.appender.file.layout=org.apache.log4j.PatternLayout
    log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n
    
    # Ignore messages below warning level from Jetty, because it's a bit verbose
    log4j.logger.org.eclipse.jetty=WARN
    org.eclipse.jetty.LEVEL=WARN
    ```
    
    ## How was this patch tested?
    
    Manually tested with spark-test account
      - https://ci.appveyor.com/project/spark-test/spark/build/672-r-log4j (there is an example for flaky test here)
      - https://ci.appveyor.com/project/spark-test/spark/build/673-r-log4j (I re-ran the build).
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes apache#17873 from HyukjinKwon/appveyor-reduce-logs.
    HyukjinKwon authored and Felix Cheung committed May 6, 2017
    Configuration menu
    Copy the full SHA
    b433aca View commit details
    Browse the repository at this point in the history

Commits on May 7, 2017

  1. [SPARK-20557][SQL] Support JDBC data type Time with Time Zone

    ### What changes were proposed in this pull request?
    
    This PR is to support JDBC data type TIME WITH TIME ZONE. It can be converted to TIMESTAMP
    
    In addition, before this PR, for unsupported data types, we simply output the type number instead of the type name.
    
    ```
    java.sql.SQLException: Unsupported type 2014
    ```
    After this PR, the message is like
    ```
    java.sql.SQLException: Unsupported type TIMESTAMP_WITH_TIMEZONE
    ```
    
    - Also upgrade the H2 version to `1.4.195` which has the type fix for "TIMESTAMP WITH TIMEZONE". However, it is not fully supported. Thus, we capture the exception, but we still need it to partially test the support of "TIMESTAMP WITH TIMEZONE", because Docker tests are not regularly run.
    
    ### How was this patch tested?
    Added test cases.
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes apache#17835 from gatorsmile/h2.
    gatorsmile committed May 7, 2017
    Configuration menu
    Copy the full SHA
    cafca54 View commit details
    Browse the repository at this point in the history
  2. [SPARK-18777][PYTHON][SQL] Return UDF from udf.register

    ## What changes were proposed in this pull request?
    
    - Move udf wrapping code from `functions.udf` to `functions.UserDefinedFunction`.
    - Return wrapped udf from `catalog.registerFunction` and dependent methods.
    - Update docstrings in `catalog.registerFunction` and `SQLContext.registerFunction`.
    - Unit tests.
    
    ## How was this patch tested?
    
    - Existing unit tests and docstests.
    - Additional tests covering new feature.
    
    Author: zero323 <zero323@users.noreply.github.com>
    
    Closes apache#17831 from zero323/SPARK-18777.
    zero323 authored and gatorsmile committed May 7, 2017
    Configuration menu
    Copy the full SHA
    63d90e7 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20518][CORE] Supplement the new blockidsuite unit tests

    ## What changes were proposed in this pull request?
    
    This PR adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , TempShuffleBlockId , TempLocalBlockId
    
    ## How was this patch tested?
    
    The new unit test.
    
    Author: caoxuewen <cao.xuewen@zte.com.cn>
    
    Closes apache#17794 from heary-cao/blockidsuite.
    heary-cao authored and srowen committed May 7, 2017
    Configuration menu
    Copy the full SHA
    37f963a View commit details
    Browse the repository at this point in the history
  4. [SPARK-20484][MLLIB] Add documentation to ALS code

    ## What changes were proposed in this pull request?
    
    This PR adds documentation to the ALS code.
    
    ## How was this patch tested?
    
    Existing tests were used.
    
    mengxr srowen
    
    This contribution is my original work.  I have the license to work on this project under the Spark project’s open source license.
    
    Author: Daniel Li <dan@danielyli.com>
    
    Closes apache#17793 from danielyli/spark-20484.
    danielyli authored and srowen committed May 7, 2017
    Configuration menu
    Copy the full SHA
    88e6d75 View commit details
    Browse the repository at this point in the history
  5. [SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object s…

    …tore access.
    
    ## What changes were proposed in this pull request?
    
    Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.
    
    It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.
    
    There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.
    
    (this is the successor to apache#12004; I can't re-open it)
    
    ## How was this patch tested?
    
    Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)
    
    Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.
    
    Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.
    
    SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
    maven build `mvn install -Phadoop-cloud -Phadoop-2.7`
    
    This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.
    
    Author: Steve Loughran <stevel@apache.org>
    Author: Steve Loughran <stevel@hortonworks.com>
    
    Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.
    steveloughran authored and srowen committed May 7, 2017
    Configuration menu
    Copy the full SHA
    2cf83c4 View commit details
    Browse the repository at this point in the history
  6. [SPARK-20543][SPARKR][FOLLOWUP] Don't skip tests on AppVeyor

    ## What changes were proposed in this pull request?
    
    add environment
    
    ## How was this patch tested?
    
    wait for appveyor run
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes apache#17878 from felixcheung/appveyorrcran.
    felixcheung authored and Felix Cheung committed May 7, 2017
    Configuration menu
    Copy the full SHA
    7087e01 View commit details
    Browse the repository at this point in the history
  7. [MINOR][SQL][DOCS] Improve unix_timestamp's scaladoc (and typo hunting)

    ## What changes were proposed in this pull request?
    
    * Docs are consistent (across different `unix_timestamp` variants and their internal expressions)
    * typo hunting
    
    ## How was this patch tested?
    
    local build
    
    Author: Jacek Laskowski <jacek@japila.pl>
    
    Closes apache#17801 from jaceklaskowski/unix_timestamp.
    jaceklaskowski authored and gatorsmile committed May 7, 2017
    Configuration menu
    Copy the full SHA
    500436b View commit details
    Browse the repository at this point in the history
  8. [SPARK-20550][SPARKR] R wrapper for Dataset.alias

    ## What changes were proposed in this pull request?
    
    - Add SparkR wrapper for `Dataset.alias`.
    - Adjust roxygen annotations for `functions.alias` (including example usage).
    
    ## How was this patch tested?
    
    Unit tests, `check_cran.sh`.
    
    Author: zero323 <zero323@users.noreply.github.com>
    
    Closes apache#17825 from zero323/SPARK-20550.
    zero323 authored and Felix Cheung committed May 7, 2017
    Configuration menu
    Copy the full SHA
    1f73d35 View commit details
    Browse the repository at this point in the history

Commits on May 8, 2017

  1. [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy

    ## What changes were proposed in this pull request?
    
    Adds Python wrappers for `DataFrameWriter.bucketBy` and `DataFrameWriter.sortBy` ([SPARK-16931](https://issues.apache.org/jira/browse/SPARK-16931))
    
    ## How was this patch tested?
    
    Unit tests covering new feature.
    
    __Note__: Based on work of GregBowyer (f49b9a2)
    
    CC HyukjinKwon
    
    Author: zero323 <zero323@users.noreply.github.com>
    Author: Greg Bowyer <gbowyer@fastmail.co.uk>
    
    Closes apache#17077 from zero323/SPARK-16931.
    zero323 authored and cloud-fan committed May 8, 2017
    Configuration menu
    Copy the full SHA
    f53a820 View commit details
    Browse the repository at this point in the history
  2. [SPARK-12297][SQL] Hive compatibility for Parquet Timestamps

    ## What changes were proposed in this pull request?
    
    This change allows timestamps in parquet-based hive table to behave as a "floating time", without a timezone, as timestamps are for other file formats.  If the storage timezone is the same as the session timezone, this conversion is a no-op.  When data is read from a hive table, the table property is *always* respected.  This allows spark to not change behavior when reading old data, but read newly written data correctly (whatever the source of the data is).
    
    Spark inherited the original behavior from Hive, but Hive is also updating behavior to use the same  scheme in HIVE-12767 / HIVE-16231.
    
    The default for Spark remains unchanged; created tables do not include the new table property.
    
    This will only apply to hive tables; nothing is added to parquet metadata to indicate the timezone, so data that is read or written directly from parquet files will never have any conversions applied.
    
    ## How was this patch tested?
    
    Added a unit test which creates tables, reads and writes data, under a variety of permutations (different storage timezones, different session timezones, vectorized reading on and off).
    
    Author: Imran Rashid <irashid@cloudera.com>
    
    Closes apache#16781 from squito/SPARK-12297.
    squito authored and ueshin committed May 8, 2017
    Configuration menu
    Copy the full SHA
    2269155 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20626][SPARKR] address date test warning with timezone on windows

    ## What changes were proposed in this pull request?
    
    set timezone on windows
    
    ## How was this patch tested?
    
    unit test, AppVeyor
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes apache#17892 from felixcheung/rtimestamptest.
    felixcheung authored and Felix Cheung committed May 8, 2017
    Configuration menu
    Copy the full SHA
    c24bdaa View commit details
    Browse the repository at this point in the history
  4. [SPARK-20380][SQL] Unable to set/unset table comment property using A…

    …LTER TABLE SET/UNSET TBLPROPERTIES ddl
    
    ### What changes were proposed in this pull request?
    Table comment was not getting  set/unset using **ALTER TABLE  SET/UNSET TBLPROPERTIES** query
    eg: ALTER TABLE table_with_comment SET TBLPROPERTIES("comment"= "modified comment)
     when user alter the table properties  and adds/updates table comment,table comment which is a field  of **CatalogTable**  instance is not getting updated and  old table comment if exists was shown to user, inorder  to handle this issue, update the comment field value in **CatalogTable** with the newly added/modified comment along with other table level properties when user executes **ALTER TABLE  SET TBLPROPERTIES** query.
    
    This pr has also taken care of unsetting the table comment when user executes query  **ALTER TABLE  UNSET TBLPROPERTIES** inorder to unset or remove table comment.
    eg: ALTER TABLE table_comment UNSET TBLPROPERTIES IF EXISTS ('comment')
    
    ### How was this patch tested?
    Added test cases  as part of **SQLQueryTestSuite** for verifying  table comment using desc formatted table query after adding/modifying table comment as part of **AlterTableSetPropertiesCommand** and unsetting the table comment using **AlterTableUnsetPropertiesCommand**.
    
    Author: sujith71955 <sujithchacko.2010@gmail.com>
    
    Closes apache#17649 from sujith71955/alter_table_comment.
    sujith71955 authored and gatorsmile committed May 8, 2017
    Configuration menu
    Copy the full SHA
    42cc6d1 View commit details
    Browse the repository at this point in the history
  5. [SPARKR][DOC] fix typo in vignettes

    ## What changes were proposed in this pull request?
    Fix typo in vignettes
    
    Author: Wayne Zhang <actuaryzhang@uber.com>
    
    Closes apache#17884 from actuaryzhang/typo.
    Wayne Zhang authored and Felix Cheung committed May 8, 2017
    Configuration menu
    Copy the full SHA
    2fdaeb5 View commit details
    Browse the repository at this point in the history
  6. [SPARK-20519][SQL][CORE] Modify to prevent some possible runtime exce…

    …ptions
    
    Signed-off-by: liuxian <liu.xian3zte.com.cn>
    
    ## What changes were proposed in this pull request?
    
    When the input parameter is null, may be a runtime exception occurs
    
    ## How was this patch tested?
    Existing unit tests
    
    Author: liuxian <liu.xian3@zte.com.cn>
    
    Closes apache#17796 from 10110346/wip_lx_0428.
    10110346 authored and srowen committed May 8, 2017
    Configuration menu
    Copy the full SHA
    0f820e2 View commit details
    Browse the repository at this point in the history
  7. [SPARK-19956][CORE] Optimize a location order of blocks with topology…

    … information
    
    ## What changes were proposed in this pull request?
    
    When call the method getLocations of BlockManager, we only compare the data block host. Random selection for non-local data blocks, this may cause the selected data block to be in a different rack. So in this patch to increase the sort of the rack.
    
    ## How was this patch tested?
    
    New test case.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Xianyang Liu <xianyang.liu@intel.com>
    
    Closes apache#17300 from ConeyLiu/blockmanager.
    ConeyLiu authored and cloud-fan committed May 8, 2017
    Configuration menu
    Copy the full SHA
    1552665 View commit details
    Browse the repository at this point in the history
  8. [SPARK-20596][ML][TEST] Consolidate and improve ALS recommendAll test…

    … cases
    
    Existing test cases for `recommendForAllX` methods (added in [SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)) test `k < num items` and `k = num items`. Technically we should also test that `k > num items` returns the same results as `k = num items`.
    
    ## How was this patch tested?
    
    Updated existing unit tests.
    
    Author: Nick Pentreath <nickp@za.ibm.com>
    
    Closes apache#17860 from MLnick/SPARK-20596-als-rec-tests.
    Nick Pentreath committed May 8, 2017
    Configuration menu
    Copy the full SHA
    58518d0 View commit details
    Browse the repository at this point in the history
  9. SHS-NG M1: Add KVStore abstraction, LevelDB implementation.

    The interface is described in KVIndex.java (see javadoc). Specifics
    of the LevelDB implementation are discussed in the javadocs of both
    LevelDB.java and LevelDBTypeInfo.java.
    
    Included also are a few small benchmarks just to get some idea of
    latency. Because they're too slow for regular unit test runs, they're
    disabled by default.
    Marcelo Vanzin committed May 8, 2017
    Configuration menu
    Copy the full SHA
    f3b7e0b View commit details
    Browse the repository at this point in the history
  10. SHS-NG M1: Add support for arrays when indexing.

    This is needed because some UI types have compound keys.
    Marcelo Vanzin committed May 8, 2017
    Configuration menu
    Copy the full SHA
    52ed2b4 View commit details
    Browse the repository at this point in the history
  11. SHS-NG M1: Fix counts in LevelDB when updating entries.

    Also add unit test. When updating, the code needs to keep track of
    the aggregated delta to be added to each count stored in the db,
    instead of reading the count from the db for each update.
    Marcelo Vanzin committed May 8, 2017
    Configuration menu
    Copy the full SHA
    4112afe View commit details
    Browse the repository at this point in the history
  12. SHS-NG M1: Try to prevent db use after close.

    This causes JVM crashes in the leveldb library, so try to avoid it;
    if there are still issues, we'll neeed locking.
    Marcelo Vanzin committed May 8, 2017
    Configuration menu
    Copy the full SHA
    718cabd View commit details
    Browse the repository at this point in the history
  13. SHS-NG M1: Use Java 8 lambdas.

    Also rename LevelDBIteratorSuite to work around some super weird
    issue with sbt.
    Marcelo Vanzin committed May 8, 2017
    Configuration menu
    Copy the full SHA
    45a027f View commit details
    Browse the repository at this point in the history
  14. SHS-NG M1: Compress values stored in LevelDB.

    LevelDB has built-in support for snappy compression, but it seems
    to be buggy in the leveldb-jni library; the compression threads
    don't seem to run by default, and when you enable them, there are
    weird issues when stopping the DB.
    
    So just do compression manually using the JRE libraries; it's probably
    a little slower but it saves a good chunk of disk space.
    Marcelo Vanzin committed May 8, 2017
    Configuration menu
    Copy the full SHA
    e592bf6 View commit details
    Browse the repository at this point in the history
  15. SHS-NG M1: Use type aliases as keys in Level DB.

    The type name gets repeated a lot in the store, so using it as the prefix
    for every key causes disk usage to grow unnecessarily. Instead, create a
    short alias for the type and keep a mapping of aliases to known types in
    a map in memory; the map is also saved to the database so it can be read
    later.
    Marcelo Vanzin committed May 8, 2017
    Configuration menu
    Copy the full SHA
    889963f View commit details
    Browse the repository at this point in the history
  16. SHS-NG M1: Separate index introspection from storage.

    The new KVTypeInfo class can help with writing different implementations
    of KVStore without duplicating logic from LevelDBTypeInfo.
    Marcelo Vanzin committed May 8, 2017
    Configuration menu
    Copy the full SHA
    84ab160 View commit details
    Browse the repository at this point in the history
  17. SHS-NG M1: Remove unused methods from KVStore.

    Turns out I ended up not using the raw storage methods in KVStore, so
    this change removes them to simplify the API and save some code.
    Marcelo Vanzin committed May 8, 2017
    Configuration menu
    Copy the full SHA
    7b87021 View commit details
    Browse the repository at this point in the history
  18. SHS-NG M1: Add "max" and "last" to kvstore iterators.

    This makes it easier for callers to control the end of iteration,
    making it easier to write Scala code that automatically closes
    underlying iterator resources. Before, code had to use Scala's
    "takeWhile", convert the result to a list, and manually close the
    iterators; with these two parameters, that can be avoided in a
    bunch of cases, with iterators auto-closing when the last element
    is reached.
    Marcelo Vanzin committed May 8, 2017
    Configuration menu
    Copy the full SHA
    5197c21 View commit details
    Browse the repository at this point in the history