Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a build in CI for pyspark 3.0 #521

Merged
merged 19 commits into from
Mar 31, 2020
Merged

Conversation

liangz1
Copy link
Collaborator

@liangz1 liangz1 commented Mar 26, 2020

The pypi package is not released yet. Once it is available, I'll update the code.

@codecov
Copy link

codecov bot commented Mar 26, 2020

Codecov Report

Merging #521 into master will increase coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #521      +/-   ##
==========================================
+ Coverage   86.18%   86.24%   +0.05%     
==========================================
  Files          81       81              
  Lines        4467     4471       +4     
  Branches      717      718       +1     
==========================================
+ Hits         3850     3856       +6     
+ Misses        505      503       -2     
  Partials      112      112              
Impacted Files Coverage Δ
petastorm/unischema.py 95.79% <100.00%> (+1.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4df17f3...2cab881. Read the comment docs.

@WeichenXu123
Copy link
Collaborator

@liangz1
I build latest master pyspark and upload to S3.
Use pip install pip install https://ml-team-public-read.s3-us-west-2.amazonaws.com/pyspark-3.1.0.dev0-60dd1a690fed62b1d6442cdc8cf3f89ef4304d5a.tar.gz to install latest spark.

.travis.yml Outdated Show resolved Hide resolved
@liangz1 liangz1 changed the title [WIP] Add a build in CI for spark 3.0.0.rc (waiting for pypi release) Add a build in CI for pyspark 3.0 Mar 31, 2020
Comment on lines 393 to 395
schema_field_indices = {field_name: i for i, field_name in enumerate(unischema.fields)}
sorted_dict = OrderedDict(sorted(encoded_dict.items(), key=lambda item: schema_field_indices[item[0]]))
return pyspark.Row(**sorted_dict)
Copy link
Collaborator

@WeichenXu123 WeichenXu123 Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorted_dict still cannot guarantee the result correct. Python<3.6 does not guarantee the kwargs key order.
See https://issues.apache.org/jira/browse/SPARK-29748

Replace the 3 lines with:

field_list = list(unischema.fields.keys())
# generate a value list which match the schema column order.
value_list = [encoded_dict[name] for name in field_list]
# create a row by value list
row = pyspark.Row(*value_list)
# set row fields
row.__fields__ = field_list
return row

CC @mengxr

@WeichenXu123
Copy link
Collaborator

Summary:

@WeichenXu123 WeichenXu123 merged commit c4cc57a into uber:master Mar 31, 2020
@WeichenXu123 WeichenXu123 deleted the ci-spark-3.0 branch March 31, 2020 06:18
tkakantousis pushed a commit to logicalclocks/petastorm that referenced this pull request Sep 16, 2020
* This PR add latest spark master version test in CI.
* Fix `dict_to_spark_row` method, See uber#521 (comment)
* For spark<3.0 and pyarrow>=0.15, add ARROW_PRE_0_15_IPC_FORMAT=1 in CI, and remove unnecessary skipif in test. See explanation in https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x

Co-authored-by: WeichenXu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants