Add a build in CI for pyspark 3.0 #521

liangz1 · 2020-03-26T13:05:20Z

The pypi package is not released yet. Once it is available, I'll update the code.

codecov · 2020-03-26T13:23:39Z

Codecov Report

Merging #521 into master will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #521      +/-   ##
==========================================
+ Coverage   86.18%   86.24%   +0.05%     
==========================================
  Files          81       81              
  Lines        4467     4471       +4     
  Branches      717      718       +1     
==========================================
+ Hits         3850     3856       +6     
+ Misses        505      503       -2     
  Partials      112      112

Impacted Files	Coverage Δ
petastorm/unischema.py	`95.79% <100.00%> (+1.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4df17f3...2cab881. Read the comment docs.

WeichenXu123 · 2020-03-30T03:34:02Z

@liangz1
I build latest master pyspark and upload to S3.
Use pip install pip install https://ml-team-public-read.s3-us-west-2.amazonaws.com/pyspark-3.1.0.dev0-60dd1a690fed62b1d6442cdc8cf3f89ef4304d5a.tar.gz to install latest spark.

.travis.yml

WeichenXu123 · 2020-03-31T02:48:32Z

petastorm/unischema.py

+    schema_field_indices = {field_name: i for i, field_name in enumerate(unischema.fields)}
+    sorted_dict = OrderedDict(sorted(encoded_dict.items(), key=lambda item: schema_field_indices[item[0]]))
+    return pyspark.Row(**sorted_dict)


Sorted_dict still cannot guarantee the result correct. Python<3.6 does not guarantee the kwargs key order.
See https://issues.apache.org/jira/browse/SPARK-29748

Replace the 3 lines with:

field_list = list(unischema.fields.keys()) # generate a value list which match the schema column order. value_list = [encoded_dict[name] for name in field_list] # create a row by value list row = pyspark.Row(*value_list) # set row fields row.__fields__ = field_list return row

CC @mengxr

WeichenXu123 · 2020-03-31T04:46:31Z

Summary:

This PR add latest spark master version test in CI.
Fix dict_to_spark_row method, See Add a build in CI for pyspark 3.0 #521 (comment)
For spark<3.0 and pyarrow>=0.15, add ARROW_PRE_0_15_IPC_FORMAT=1 in CI, and remove unnecessary skipif in test. See explanation in https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x

* This PR add latest spark master version test in CI. * Fix `dict_to_spark_row` method, See uber#521 (comment) * For spark<3.0 and pyarrow>=0.15, add ARROW_PRE_0_15_IPC_FORMAT=1 in CI, and remove unnecessary skipif in test. See explanation in https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x Co-authored-by: WeichenXu <weichen.xu@databricks.com>

add a build in ci for spark 3.0.0.rc (waiting for pypi release)

c8fecf0

WeichenXu123 mentioned this pull request Mar 27, 2020

[ML-10156] Fix array type field inferred shape #517

Merged

liangz1 added 8 commits March 29, 2020 21:36

download pyspark

02c973c

fix cmd

a92b49a

fix cmd

3e93eda

add force install dependency

146b46e

try fix

414c4fb

try fix

a646359

try fix

6039bff

Merge branch 'master' of github.com:uber/petastorm into ci-spark-3.0

71d80e0

mengxr suggested changes Mar 30, 2020

View reviewed changes

.travis.yml Outdated Show resolved Hide resolved

fix version compare

e8deb11

liangz1 changed the title ~~[WIP] Add a build in CI for spark 3.0.0.rc (waiting for pypi release)~~ Add a build in CI for pyspark 3.0 Mar 31, 2020

WeichenXu123 reviewed Mar 31, 2020

View reviewed changes

WeichenXu123 added 7 commits March 31, 2020 10:59

Merge branch 'master' into ci-spark-3.0

0aed035

fix dict_to_spark_row

61bc05c

ci add ARROW_PRE_0_15_IPC_FORMAT flag

4d2945b

update

2e9e450

fix

43182b3

fix

a3dc60b

fix

4e2c4a2

WeichenXu123 added 2 commits March 31, 2020 12:57

increase CI sleep time for waiting docker start

7f907b0

fix

2cab881

WeichenXu123 approved these changes Mar 31, 2020

View reviewed changes

WeichenXu123 merged commit c4cc57a into uber:master Mar 31, 2020

WeichenXu123 deleted the ci-spark-3.0 branch March 31, 2020 06:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a build in CI for pyspark 3.0 #521

Add a build in CI for pyspark 3.0 #521

liangz1 commented Mar 26, 2020

codecov bot commented Mar 26, 2020 •

edited

Loading

WeichenXu123 commented Mar 30, 2020

WeichenXu123 Mar 31, 2020 •

edited

Loading

WeichenXu123 commented Mar 31, 2020

Add a build in CI for pyspark 3.0 #521

Add a build in CI for pyspark 3.0 #521

Conversation

liangz1 commented Mar 26, 2020

codecov bot commented Mar 26, 2020 • edited Loading

Codecov Report

WeichenXu123 commented Mar 30, 2020

WeichenXu123 Mar 31, 2020 • edited Loading

Choose a reason for hiding this comment

WeichenXu123 commented Mar 31, 2020

codecov bot commented Mar 26, 2020 •

edited

Loading

WeichenXu123 Mar 31, 2020 •

edited

Loading