-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a build in CI for pyspark 3.0 #521
Conversation
Codecov Report
@@ Coverage Diff @@
## master #521 +/- ##
==========================================
+ Coverage 86.18% 86.24% +0.05%
==========================================
Files 81 81
Lines 4467 4471 +4
Branches 717 718 +1
==========================================
+ Hits 3850 3856 +6
+ Misses 505 503 -2
Partials 112 112
Continue to review full report at Codecov.
|
@liangz1 |
petastorm/unischema.py
Outdated
schema_field_indices = {field_name: i for i, field_name in enumerate(unischema.fields)} | ||
sorted_dict = OrderedDict(sorted(encoded_dict.items(), key=lambda item: schema_field_indices[item[0]])) | ||
return pyspark.Row(**sorted_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorted_dict still cannot guarantee the result correct. Python<3.6 does not guarantee the kwargs key order.
See https://issues.apache.org/jira/browse/SPARK-29748
Replace the 3 lines with:
field_list = list(unischema.fields.keys())
# generate a value list which match the schema column order.
value_list = [encoded_dict[name] for name in field_list]
# create a row by value list
row = pyspark.Row(*value_list)
# set row fields
row.__fields__ = field_list
return row
CC @mengxr
Summary:
|
* This PR add latest spark master version test in CI. * Fix `dict_to_spark_row` method, See uber#521 (comment) * For spark<3.0 and pyarrow>=0.15, add ARROW_PRE_0_15_IPC_FORMAT=1 in CI, and remove unnecessary skipif in test. See explanation in https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x Co-authored-by: WeichenXu <weichen.xu@databricks.com>
The pypi package is not released yet. Once it is available, I'll update the code.