Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String-to-boolean conversion is different from Pandas #8549

Merged
merged 11 commits into from
Jul 1, 2021

Conversation

skirui-source
Copy link
Contributor

@skirui-source skirui-source commented Jun 18, 2021

Fixes: #7875

Previously: Pandas treats all non-empty strings as true values when it converts strings to booleans, whereas cuDF accepts only those that match with the true string (which is True by default).

This PR resolves the mismatch by introducing the str_to_boolean method, which filters a string column to check if len(StringColumn)> 0 and replaces NAN values with False to mimick Pandas behavior

Example:

>>> import pandas as pd
>>> import cudf

>>> gs = cudf.Series(["True", None, "", "True", "False", "False"])
>>> gs
0     True
1     <NA>
2         
3     True
4    False
5    False
dtype: object

>>> gs.astype(bool)
0     True
1    False
2    False
3     True
4     True
5     True
dtype: bool

>>> gs.to_pandas().astype(bool)
0     True
1    False
2    False
3     True
4     True
5     True
dtype: bool

@skirui-source skirui-source added bug Something isn't working Python Affects Python cuDF API. labels Jun 18, 2021
@skirui-source skirui-source self-assigned this Jun 18, 2021
@ttnghia
Copy link
Contributor

ttnghia commented Jun 21, 2021

FYI, please see https://nvidia.slack.com/archives/CDTANRCTT/p1616688610161600

@skirui-source skirui-source added the breaking Breaking change label Jun 30, 2021
@skirui-source skirui-source marked this pull request as ready for review June 30, 2021 19:36
@skirui-source skirui-source requested a review from a team as a code owner June 30, 2021 19:36
@skirui-source skirui-source added the 3 - Ready for Review Ready for review by team label Jun 30, 2021
@codecov
Copy link

codecov bot commented Jun 30, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@fa50b7d). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 702ab58 differs from pull request most recent head 3ece4ca. Consider uploading reports for the commit 3ece4ca to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8549   +/-   ##
===============================================
  Coverage                ?   10.61%           
===============================================
  Files                   ?      109           
  Lines                   ?    18645           
  Branches                ?        0           
===============================================
  Hits                    ?     1980           
  Misses                  ?    16665           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa50b7d...3ece4ca. Read the comment docs.

@skirui-source
Copy link
Contributor Author

@gpucibot re-run tests

@skirui-source skirui-source removed the 3 - Ready for Review Ready for review by team label Jul 1, 2021
@skirui-source
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 51b1c23 into rapidsai:branch-21.08 Jul 1, 2021
@skirui-source skirui-source deleted the strbool branch October 19, 2021 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] String-to-boolean conversion is different from Pandas'
3 participants