Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Modify concatenate_columns ignore_empty output #1166

Draft
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

Fu-Jie
Copy link

@Fu-Jie Fu-Jie commented Sep 8, 2022

PR Description

Please describe the changes proposed in the pull request:

  • modify concatenate_column ignore_empty output (pd.NA,pd.NaT,None,np.nan => "")
  • modify concatenate_columns test case

This PR resolves #1164.

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.md.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@Fu-Jie Fu-Jie changed the title Modify concatenate_column ignore_empty output [BUG] Modify concatenate_column ignore_empty output Sep 8, 2022
@Fu-Jie Fu-Jie changed the title [BUG] Modify concatenate_column ignore_empty output Modify concatenate_column ignore_empty output Sep 8, 2022
@Fu-Jie Fu-Jie changed the title Modify concatenate_column ignore_empty output [ENH] Modify concatenate_column ignore_empty output Sep 8, 2022
@Fu-Jie Fu-Jie changed the title [ENH] Modify concatenate_column ignore_empty output [BUG] Modify concatenate_column ignore_empty output Sep 8, 2022
@codecov
Copy link

codecov bot commented Sep 8, 2022

Codecov Report

Merging #1166 (66ea4c0) into dev (68b8bb0) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev    #1166      +/-   ##
==========================================
- Coverage   98.04%   98.01%   -0.03%     
==========================================
  Files          76       76              
  Lines        3524     3525       +1     
==========================================
  Hits         3455     3455              
- Misses         69       70       +1     

Copy link
Member

@Zeroto521 Zeroto521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this problem.

I thought ignore_empty=True we could fill nan.

    df[new_column_name] = (
        df[column_names].fillna("").astype(str).agg(sep.join, axis=1)
        if ignore_empty
        else df[column_names].astype(str).agg(sep.join, axis=1)
    )

And It's better to do the fill operation and then do the change type operation.
It's hard to say how many nan strings there have.

df.fillna("").astype(str)

df.astype(str).replace(["NaT", "nan", "<NA>"], "")

AUTHORS.md Outdated Show resolved Hide resolved
janitor/functions/concatenate_columns.py Show resolved Hide resolved
Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fu-Jie thank you for your contribution here! I'm noticing that the change will likely be a breaking change, i.e. it modifies the old expected behaviour of the function. Can we ensure that the suggested changes are toggleable via function arguments?

Comment on lines 55 to 58
df[column_names]
.astype(str)
.replace(["NaT", "nan", "<NA>"], "")
.agg(sep.join, axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want an argument here to toggle between old and new behaviours. Would you be open to doing so?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, the translation software I used may not have described it clearly。
For ignoring null values, my idea comes from the implementation of Excel, feeling that the implementation of Excel is more in line with the actual use
https://support.microsoft.com/en-us/office/textjoin-function-357b449a-ec91-49d0-80c3-0e8fc845691c

Copy link
Author

@Fu-Jie Fu-Jie Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericmjl
It is my understanding that if null values are not ignored, then this is more reasonable.

Pseudo-code

split = ','
if ignore_empty =  False   then  1,2,pd.NA  ->  1,2,
if ignore_empty =  True    then  1,2,pd.NA  ->  1,2

Fu-Jie and others added 2 commits September 13, 2022 10:21
Co-authored-by: 40% <Zeroto521@gmail.com>
Co-authored-by: 40% <Zeroto521@gmail.com>
@Fu-Jie
Copy link
Author

Fu-Jie commented Sep 13, 2022

And It's better to do the fill operation and then do the change type operation. It's hard to say how many nan strings there have.

df.fillna("").astype(str)

df.astype(str).replace(["NaT", "nan", "<NA>"], "")

@Zeroto521 I am test None、pd.NA、pd.NaT、np.nan,there should be nothing else.

@Fu-Jie Fu-Jie changed the title [BUG] Modify concatenate_column ignore_empty output [BUG] Modify concatenate_columns ignore_empty output Sep 13, 2022
@Fu-Jie
Copy link
Author

Fu-Jie commented Sep 13, 2022

Thanks for fixing this problem.

I thought ignore_empty=True we could fill nan.

    df[new_column_name] = (
        df[column_names].fillna("").astype(str).agg(sep.join, axis=1)
        if ignore_empty
        else df[column_names].astype(str).agg(sep.join, axis=1)
    )

@Zeroto521
At first, I also thought of this solution,but there are some problems:
Fillna first, then astype, can not handle columns of type int or float。

@Zeroto521
Copy link
Member

Zeroto521 commented Sep 13, 2022

import numpy as np
import pandas as pd


def fillna_astype(df, sep="-"):
    return df.fillna("").astype(str).agg(sep.join, axis=1)


def astype_fillna(df, sep="-"):
    return df.astype(str).replace(["NaT", "nan", "<NA>"], "").agg(sep.join, axis=1)

# normal case
# both of them passed, but `astype_fillna` need to replace `None` and `'NaN'`.
pd.DataFrame(
    {
        "a": ["string", 1, 1.5, np.nan],
        "b": ["another_string", 0, pd.NA, None],
    }
).pipe(fillna_astype)
# 0    string-another_string
# 1                      1-0
# 2                     1.5-
# 3                        -
# dtype: object

pd.DataFrame(
    {
        "a": ["string", 1, 1.5, np.nan],
        "b": ["another_string", 0, pd.NA, None],
    }
).pipe(astype_fillna)
# 0    string-another_string
# 1                      1-0
# 2                     1.5-
# 3                    -None
# dtype: object


# this one is a special case. `astype_fillna` is failed.
# we only want to fill na value.
pd.DataFrame(
    {
        "a": ["string", np.nan, pd.NA, None],
        "b": ["another_string", "nan", "<NA>", "None"],
    }
).pipe(fillna_astype)
# 0    string-another_string
# 1                     -nan
# 2                    -<NA>
# 3                    -None
# dtype: object

pd.DataFrame(
    {
        "a": ["string", np.nan, pd.NA, None],
        "b": ["another_string", "nan", "<NA>", "None"],
    }
).pipe(astype_fillna)
# 0    string-another_string
# 1                        -  # wrong
# 2                        -  # wrong
# 3                None-None  # wrong
# dtype: object

@Fu-Jie
Copy link
Author

Fu-Jie commented Sep 13, 2022

@Zeroto521

normal case

It could be a pandas(1.3.5) version issue,my env both None and np.nan astype for "nan" ,Should need to increase the na value.

def astype_fillna(df, sep="-"):
    return df.astype(str).replace(["NaT", "nan", "<NA>","None"], "").agg(sep.join, axis=1)

about fillna astype float or int issue

import pandas as pd 
def fillna_astype(df, sep="-"):
    return df.fillna("").astype(str).agg(sep.join, axis=1)
pd.DataFrame(
    {
        "b": [1, 0, pd.NA, 3],
    }
    ,dtype=pd.Float32Dtype()
).pipe(fillna_astype)
##
## TypeError: <U1 cannot be converted to a FloatingDtype

special case

In my opinion, what should be dealt with is the null value of the column, not the text that represents the empty meaning.

@Zeroto521
Copy link
Member

I think that isn't a good example for fillna_astype.
Once you point out the type of data, some of the methods you can't use. This is what's the dtype meaning.

import pandas as pd 
def fillna_astype(df, sep="-"):
    return df.fillna("").astype(str).agg(sep.join, axis=1)
pd.DataFrame(
    {
        "b": [1, 0, pd.NA, 3],
    }
    ,dtype=pd.Float32Dtype()
).pipe(fillna_astype)
##
## TypeError: <U1 cannot be converted to a FloatingDtype

@Fu-Jie
Copy link
Author

Fu-Jie commented Sep 13, 2022

I think that isn't a good example for fillna_astype. Once you point out the type of data, some of the methods you can't use. This is what's the dtype meaning.

import pandas as pd 
def fillna_astype(df, sep="-"):
    return df.fillna("").astype(str).agg(sep.join, axis=1)
pd.DataFrame(
    {
        "b": [1, 0, pd.NA, 3],
    }
    ,dtype=pd.Float32Dtype()
).pipe(fillna_astype)
##
## TypeError: <U1 cannot be converted to a FloatingDtype

This situation may appear in the read_sql int column(pandas dafaultl parse) or read_excel specify the dtype = 'int'
In addition to this problem, I also think it will be better to use Fillna first。

use astype("string")
@samukweku
Copy link
Collaborator

samukweku commented Sep 20, 2022

Any thoughts on the progress of this PR @thatlittleboy @ericmjl @Zeroto521 ? @Fu-Jie kindly rebase so that this PR is updated to the latest

Copy link
Contributor

@thatlittleboy thatlittleboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're close! just need to clear up some inconsistencies against the docstrings

janitor/functions/concatenate_columns.py Show resolved Hide resolved
@@ -28,7 +28,7 @@ def test_concatenate_columns_null_values(missingdata_df):
new_column_name="index",
ignore_empty=True,
)
expected_values = ["1.0-1", "2.0-2", "nan-3"] * 3
expected_values = ["1.0-1", "2.0-2", "3"] * 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the docstrings for the test.

@@ -28,7 +28,7 @@ def test_concatenate_columns_null_values(missingdata_df):
new_column_name="index",
ignore_empty=True,
)
expected_values = ["1.0-1", "2.0-2", "nan-3"] * 3
expected_values = ["1.0-1", "2.0-2", "3"] * 3
assert expected_values == df["index"].tolist()


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think it might be worth writing a test merging a custom dataframe with a float column (NaN), a datetime column (NaT) and a string column (None/NA?).

And assert the expected output accordingly.

Then, mention this PR or the attached issue in the test docstring as well, please.

@Zeroto521 Zeroto521 marked this pull request as draft November 28, 2022 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] concatenate_columns Ignoring null values is not available
5 participants