Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHS-NG M1: Add KVStore abstraction, LevelDB implementation. #3

Closed
wants to merge 30 commits into from

Conversation

vanzin
Copy link
Owner

@vanzin vanzin commented Apr 17, 2017

The interface is described in KVIndex.java (see javadoc). Specifics
of the LevelDB implementation are discussed in the javadocs of both
LevelDB.java and LevelDBTypeInfo.java.

Included also are a few small benchmarks just to get some idea of
latency. Because they're too slow for regular unit test runs, they're
disabled by default.

@vanzin vanzin force-pushed the shs-ng/M1 branch 4 times, most recently from 86ea388 to 98838e6 Compare May 1, 2017 22:58
zsxwing and others added 2 commits May 5, 2017 11:08
…o reduce the load

## What changes were proposed in this pull request?

I checked the logs of https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.2-test-maven-hadoop-2.7/47/ and found it took several seconds to create Kafka internal topic `__consumer_offsets`. As Kafka creates this topic lazily, the topic creation happens in the first test `deserialization of initial offset with Spark 2.1.0` and causes it timeout.

This PR changes `offsets.topic.num.partitions` from the default value 50 to 1 to make creating `__consumer_offsets` (50 partitions -> 1 partition) much faster.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#17863 from zsxwing/fix-kafka-flaky-test.
## What changes were proposed in this pull request?

SparkSQL can now read from a database table with column type [TIMESTAMP WITH TIME ZONE](https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE).

## How was this patch tested?

Tested against Oracle database.

JoshRosen, you seem to know the class, would you look at this? Thanks!

Author: Jannik Arndt <jannik@jannikarndt.de>

Closes apache#17832 from JannikArndt/spark-20557-timestamp-with-timezone.
… to start of batch

## What changes were proposed in this pull request?

Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch.

## How was this patch tested?

Now the debug message prints the diff between start and end of batch.

Author: Juliusz Sompolski <julek@databricks.com>

Closes apache#17875 from juliuszsompolski/SPARK-20616.
HyukjinKwon and others added 21 commits May 5, 2017 21:26
…nkins in AppVeyor

## What changes were proposed in this pull request?

Currently, there are flooding logs in AppVeyor (in the console). This has been fine because we can download all the logs. However, (given my observations so far), logs are truncated when there are too many. It has been grown recently and it started to get truncated. For example, see  https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master

Even after the log is downloaded, it looks truncated as below:

```
[00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200)
[00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 (TID 9213)
[00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage 601.0 (TID 9212). 2473 bytes result sent to driver
...
```

Probably, it looks better to use the same log4j configuration that we are using for SparkR tests in Jenkins(please see https://github.com/apache/spark/blob/fc472bddd1d9c6a28e57e31496c0166777af597e/R/run-tests.sh#L26 and https://github.com/apache/spark/blob/fc472bddd1d9c6a28e57e31496c0166777af597e/R/log4j.properties)
```
# Set everything to be logged to the file target/unit-tests.log
log4j.rootCategory=INFO, file
log4j.appender.file=org.apache.log4j.FileAppender
log4j.appender.file.append=true
log4j.appender.file.file=R/target/unit-tests.log
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n

# Ignore messages below warning level from Jetty, because it's a bit verbose
log4j.logger.org.eclipse.jetty=WARN
org.eclipse.jetty.LEVEL=WARN
```

## How was this patch tested?

Manually tested with spark-test account
  - https://ci.appveyor.com/project/spark-test/spark/build/672-r-log4j (there is an example for flaky test here)
  - https://ci.appveyor.com/project/spark-test/spark/build/673-r-log4j (I re-ran the build).

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#17873 from HyukjinKwon/appveyor-reduce-logs.
### What changes were proposed in this pull request?

This PR is to support JDBC data type TIME WITH TIME ZONE. It can be converted to TIMESTAMP

In addition, before this PR, for unsupported data types, we simply output the type number instead of the type name.

```
java.sql.SQLException: Unsupported type 2014
```
After this PR, the message is like
```
java.sql.SQLException: Unsupported type TIMESTAMP_WITH_TIMEZONE
```

- Also upgrade the H2 version to `1.4.195` which has the type fix for "TIMESTAMP WITH TIMEZONE". However, it is not fully supported. Thus, we capture the exception, but we still need it to partially test the support of "TIMESTAMP WITH TIMEZONE", because Docker tests are not regularly run.

### How was this patch tested?
Added test cases.

Author: Xiao Li <gatorsmile@gmail.com>

Closes apache#17835 from gatorsmile/h2.
## What changes were proposed in this pull request?

- Move udf wrapping code from `functions.udf` to `functions.UserDefinedFunction`.
- Return wrapped udf from `catalog.registerFunction` and dependent methods.
- Update docstrings in `catalog.registerFunction` and `SQLContext.registerFunction`.
- Unit tests.

## How was this patch tested?

- Existing unit tests and docstests.
- Additional tests covering new feature.

Author: zero323 <zero323@users.noreply.github.com>

Closes apache#17831 from zero323/SPARK-18777.
## What changes were proposed in this pull request?

This PR adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , TempShuffleBlockId , TempLocalBlockId

## How was this patch tested?

The new unit test.

Author: caoxuewen <cao.xuewen@zte.com.cn>

Closes apache#17794 from heary-cao/blockidsuite.
## What changes were proposed in this pull request?

This PR adds documentation to the ALS code.

## How was this patch tested?

Existing tests were used.

mengxr srowen

This contribution is my original work.  I have the license to work on this project under the Spark project’s open source license.

Author: Daniel Li <dan@danielyli.com>

Closes apache#17793 from danielyli/spark-20484.
…tore access.

## What changes were proposed in this pull request?

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to apache#12004; I can't re-open it)

## How was this patch tested?

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <stevel@apache.org>
Author: Steve Loughran <stevel@hortonworks.com>

Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.
## What changes were proposed in this pull request?

add environment

## How was this patch tested?

wait for appveyor run

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes apache#17878 from felixcheung/appveyorrcran.
## What changes were proposed in this pull request?

* Docs are consistent (across different `unix_timestamp` variants and their internal expressions)
* typo hunting

## How was this patch tested?

local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes apache#17801 from jaceklaskowski/unix_timestamp.
## What changes were proposed in this pull request?

- Add SparkR wrapper for `Dataset.alias`.
- Adjust roxygen annotations for `functions.alias` (including example usage).

## How was this patch tested?

Unit tests, `check_cran.sh`.

Author: zero323 <zero323@users.noreply.github.com>

Closes apache#17825 from zero323/SPARK-20550.
## What changes were proposed in this pull request?

Adds Python wrappers for `DataFrameWriter.bucketBy` and `DataFrameWriter.sortBy` ([SPARK-16931](https://issues.apache.org/jira/browse/SPARK-16931))

## How was this patch tested?

Unit tests covering new feature.

__Note__: Based on work of GregBowyer (f49b9a2)

CC HyukjinKwon

Author: zero323 <zero323@users.noreply.github.com>
Author: Greg Bowyer <gbowyer@fastmail.co.uk>

Closes apache#17077 from zero323/SPARK-16931.
## What changes were proposed in this pull request?

This change allows timestamps in parquet-based hive table to behave as a "floating time", without a timezone, as timestamps are for other file formats.  If the storage timezone is the same as the session timezone, this conversion is a no-op.  When data is read from a hive table, the table property is *always* respected.  This allows spark to not change behavior when reading old data, but read newly written data correctly (whatever the source of the data is).

Spark inherited the original behavior from Hive, but Hive is also updating behavior to use the same  scheme in HIVE-12767 / HIVE-16231.

The default for Spark remains unchanged; created tables do not include the new table property.

This will only apply to hive tables; nothing is added to parquet metadata to indicate the timezone, so data that is read or written directly from parquet files will never have any conversions applied.

## How was this patch tested?

Added a unit test which creates tables, reads and writes data, under a variety of permutations (different storage timezones, different session timezones, vectorized reading on and off).

Author: Imran Rashid <irashid@cloudera.com>

Closes apache#16781 from squito/SPARK-12297.
## What changes were proposed in this pull request?

set timezone on windows

## How was this patch tested?

unit test, AppVeyor

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes apache#17892 from felixcheung/rtimestamptest.
…LTER TABLE SET/UNSET TBLPROPERTIES ddl

### What changes were proposed in this pull request?
Table comment was not getting  set/unset using **ALTER TABLE  SET/UNSET TBLPROPERTIES** query
eg: ALTER TABLE table_with_comment SET TBLPROPERTIES("comment"= "modified comment)
 when user alter the table properties  and adds/updates table comment,table comment which is a field  of **CatalogTable**  instance is not getting updated and  old table comment if exists was shown to user, inorder  to handle this issue, update the comment field value in **CatalogTable** with the newly added/modified comment along with other table level properties when user executes **ALTER TABLE  SET TBLPROPERTIES** query.

This pr has also taken care of unsetting the table comment when user executes query  **ALTER TABLE  UNSET TBLPROPERTIES** inorder to unset or remove table comment.
eg: ALTER TABLE table_comment UNSET TBLPROPERTIES IF EXISTS ('comment')

### How was this patch tested?
Added test cases  as part of **SQLQueryTestSuite** for verifying  table comment using desc formatted table query after adding/modifying table comment as part of **AlterTableSetPropertiesCommand** and unsetting the table comment using **AlterTableUnsetPropertiesCommand**.

Author: sujith71955 <sujithchacko.2010@gmail.com>

Closes apache#17649 from sujith71955/alter_table_comment.
## What changes were proposed in this pull request?
Fix typo in vignettes

Author: Wayne Zhang <actuaryzhang@uber.com>

Closes apache#17884 from actuaryzhang/typo.
…ptions

Signed-off-by: liuxian <liu.xian3zte.com.cn>

## What changes were proposed in this pull request?

When the input parameter is null, may be a runtime exception occurs

## How was this patch tested?
Existing unit tests

Author: liuxian <liu.xian3@zte.com.cn>

Closes apache#17796 from 10110346/wip_lx_0428.
… information

## What changes were proposed in this pull request?

When call the method getLocations of BlockManager, we only compare the data block host. Random selection for non-local data blocks, this may cause the selected data block to be in a different rack. So in this patch to increase the sort of the rack.

## How was this patch tested?

New test case.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Xianyang Liu <xianyang.liu@intel.com>

Closes apache#17300 from ConeyLiu/blockmanager.
… cases

Existing test cases for `recommendForAllX` methods (added in [SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)) test `k < num items` and `k = num items`. Technically we should also test that `k > num items` returns the same results as `k = num items`.

## How was this patch tested?

Updated existing unit tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes apache#17860 from MLnick/SPARK-20596-als-rec-tests.
The interface is described in KVIndex.java (see javadoc). Specifics
of the LevelDB implementation are discussed in the javadocs of both
LevelDB.java and LevelDBTypeInfo.java.

Included also are a few small benchmarks just to get some idea of
latency. Because they're too slow for regular unit test runs, they're
disabled by default.
This is needed because some UI types have compound keys.
Also add unit test. When updating, the code needs to keep track of
the aggregated delta to be added to each count stored in the db,
instead of reading the count from the db for each update.
This causes JVM crashes in the leveldb library, so try to avoid it;
if there are still issues, we'll neeed locking.
Marcelo Vanzin added 6 commits May 8, 2017 10:18
Also rename LevelDBIteratorSuite to work around some super weird
issue with sbt.
LevelDB has built-in support for snappy compression, but it seems
to be buggy in the leveldb-jni library; the compression threads
don't seem to run by default, and when you enable them, there are
weird issues when stopping the DB.

So just do compression manually using the JRE libraries; it's probably
a little slower but it saves a good chunk of disk space.
The type name gets repeated a lot in the store, so using it as the prefix
for every key causes disk usage to grow unnecessarily. Instead, create a
short alias for the type and keep a mapping of aliases to known types in
a map in memory; the map is also saved to the database so it can be read
later.
The new KVTypeInfo class can help with writing different implementations
of KVStore without duplicating logic from LevelDBTypeInfo.
Turns out I ended up not using the raw storage methods in KVStore, so
this change removes them to simplify the API and save some code.
This makes it easier for callers to control the end of iteration,
making it easier to write Scala code that automatically closes
underlying iterator resources. Before, code had to use Scala's
"takeWhile", convert the result to a list, and manually close the
iterators; with these two parameters, that can be avoided in a
bunch of cases, with iterators auto-closing when the last element
is reached.
@vanzin vanzin closed this May 8, 2017
vanzin pushed a commit that referenced this pull request Nov 5, 2019
### What changes were proposed in this pull request?
`org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite` failed lately. After had a look at the logs it just shows the following fact without any details:
```
Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database
```
Since the issue is intermittent and not able to reproduce it we should add more debug information and wait for reproduction with the extended logs.

### Why are the changes needed?
Failing test doesn't give enough debug information.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
I've started the test manually and checked that such additional debug messages show up:
```
>>> KrbApReq: APOptions are 00000000 00000000 00000000 00000000
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Looking for keys for: kafka/localhostEXAMPLE.COM
Added key: 17version: 0
Added key: 23version: 0
Added key: 16version: 0
Found unsupported keytype (3) for kafka/localhostEXAMPLE.COM
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Using builtin default etypes for permitted_enctypes
default etypes for permitted_enctypes: 17 16 23.
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
MemoryCache: add 1571936500/174770/16C565221B70AAB2BEFE31A83D13A2F4/client/localhostEXAMPLE.COM to client/localhostEXAMPLE.COM|kafka/localhostEXAMPLE.COM
MemoryCache: Existing AuthList:
#3: 1571936493/200803/8CD70D280B0862C5DA1FF901ECAD39FE/client/localhostEXAMPLE.COM
#2: 1571936499/985009/BAD33290D079DD4E3579A8686EC326B7/client/localhostEXAMPLE.COM
#1: 1571936499/995208/B76B9D78A9BE283AC78340157107FD40/client/localhostEXAMPLE.COM
```

Closes apache#26252 from gaborgsomogyi/SPARK-29580.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet