Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clean): add clean_df function #559

Merged
merged 1 commit into from Apr 21, 2021
Merged

feat(clean): add clean_df function #559

merged 1 commit into from Apr 21, 2021

Conversation

AndyWangSFU
Copy link
Contributor

@AndyWangSFU AndyWangSFU commented Apr 7, 2021

Description

clean_df function: conduct a set of operations that would be useful for cleaning and standardizing a full Pandas DataFrame. Closes #503.

How Has This Been Tested?

I have tested this function using a few real-world datasets. I will also add my test function later.

Checklist:

  • My code follows the style guidelines of this project
  • I have already squashed the commits and make the commit message conform to the project standard.
  • I have already marked the commit with "BREAKING CHANGE" or "Fixes #" if needed.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

@codecov
Copy link

codecov bot commented Apr 9, 2021

Codecov Report

Merging #559 (1842308) into develop (0ca49e7) will decrease coverage by 1.49%.
The diff coverage is 0.00%.

❗ Current head 1842308 differs from pull request most recent head 5ffcdbc. Consider uploading reports for the commit 5ffcdbc to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #559      +/-   ##
===========================================
- Coverage    85.72%   84.22%   -1.50%     
===========================================
  Files           98       99       +1     
  Lines         8621     8768     +147     
===========================================
- Hits          7390     7385       -5     
- Misses        1231     1383     +152     
Impacted Files Coverage Δ
dataprep/clean/clean_df.py 0.00% <0.00%> (ø)
dataprep/eda/correlation/compute/overview.py 98.49% <0.00%> (-0.76%) ⬇️
dataprep/eda/missing/compute/common.py 84.21% <0.00%> (-0.41%) ⬇️
dataprep/eda/dtypes.py 84.89% <0.00%> (-0.32%) ⬇️
dataprep/eda/create_report/formatter.py 95.48% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ca49e7...5ffcdbc. Read the comment docs.

The desired way to check data types.
* If ’semantic’, then perform a column-wise semantic and atomic type detection.
* If 'atomic', then return the best inferred atomic data type from Python default ones.
* If 'none', then no results will be returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the input can't be 'none' here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess "none" means that the users do not want to perform any data type detection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess "none" means that the users do not want to perform any data type detection?

Based on my understanding of the implementation here, the input parameter of _infer_data_type_df() will not be 'none'. If it's not 'semantic', you will perform the atomic detection. Which means it can only be 'semantic' and 'atomic'?

Copy link
Contributor

@qidanrui qidanrui Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess "none" means that the users do not want to perform any data type detection?

Based on my understanding of the implementation here, the input parameter of _infer_data_type_df() will not be 'none'. If it's not 'semantic', you will perform the atomic detection. Which means it can only be 'semantic' and 'atomic'?

I think Andy has filtered none parameter in the clean_df() function:
Screen Shot 2021-04-20 at 5 29 48 PM
However, I think there is a missing typo. The None should be none~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Danrui! Greatly appreciated. I did not find this typo before.

print(f"\tMemory reducted from {old_stat} to {nclnd}. New size: ({pclnd}%)")
else:
print("Downcast Memory not Performed.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to add an illegal checking here, to avoid illegal call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yi. Sorry I did not get it, what type of illegal call it can possibly be? Do you mean downcasted memory > old memory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that input option parameter can be an illegal string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very sure whether it could happen, as the _create_report() function is only callable inside clean_df() by the if report: (line 131) condition. But users might directly call this function accidentally and receive unexpected errors. So I am OK to add a check condition inside this function.

By the way, it just reminds me that users could get confused by this _create_report() with the other report functions in utils.py? Is it necessary to rename this function to _create_report_clean_df()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what I'm referring to is potential accidental touch by users. But I'm OK with either way so just let me know your decision. For the function name, do you think create_report_df() or create_df_report() will work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do both. I feel like create_report_df() works (and then add illegal checks). We can also listen to Danrui's opinion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the newest commit!

@yxie66 yxie66 requested a review from qidanrui April 14, 2021 23:52
@dovahcrow dovahcrow added this to the 0.3.0 milestone Apr 15, 2021
@qidanrui
Copy link
Contributor

I think the pass in _standardize_missing_values_df() is redundant, because your if statements are enough.
image
If you think there still should be placeholder, how about changing it like this:
Screen Shot 2021-04-20 at 4 50 44 PM
I think it is more concrete~

@qidanrui
Copy link
Contributor

Sorry for that I'm little confusing why there are two very similar functions returning opposite values. Is there any possibility of combining them?
Screen Shot 2021-04-20 at 4 59 36 PM

@qidanrui
Copy link
Contributor

I changed some return types according to Andy's logic:

  • Changing the return type of clean_df() function from pd.DataFrame to Union[Tuple[pd.DataFrame, pd.DataFrame], pd.DataFrame]. That is because if the parameterdata_type_detection = 'none', the function returns only one dataframe (origin dataframe). However, if the parameterdata_type_detection != 'none', the function returns two dataframes. Thus I think it is better to use the union returning type.

Screen Shot 2021-04-20 at 5 20 56 PM

* Changing the return type of `_infer_semantic_data_type()` from `Any` to `str`

Screen Shot 2021-04-20 at 5 36 29 PM

* Changing the return type of `_infer_atomic_data_type()` from `Any` to `str`

Screen Shot 2021-04-20 at 5 36 45 PM

No more comments for this function @yxie66. I think after Andy's small refining we can merge it.

@AndyWangSFU
Copy link
Contributor Author

I think the pass in _standardize_missing_values_df() is redundant, because your if statements are enough.
image
If you think there still should be placeholder, how about changing it like this:
Screen Shot 2021-04-20 at 4 50 44 PM
I think it is more concrete~

This makes sense! I fixed it in the latest commit, using the second suggestion.

@AndyWangSFU
Copy link
Contributor Author

I changed some return types according to Andy's logic:

  • Changing the return type of clean_df() function from pd.DataFrame to Union[Tuple[pd.DataFrame, pd.DataFrame], pd.DataFrame]. That is because if the parameterdata_type_detection = 'none', the function returns only one dataframe (origin dataframe). However, if the parameterdata_type_detection != 'none', the function returns two dataframes. Thus I think it is better to use the union returning type.
Screen Shot 2021-04-20 at 5 20 56 PM
  • Changing the return type of _infer_semantic_data_type() from Any to str
Screen Shot 2021-04-20 at 5 36 29 PM
  • Changing the return type of _infer_atomic_data_type() from Any to str
Screen Shot 2021-04-20 at 5 36 45 PM

No more comments for this function @yxie66. I think after Andy's small refining we can merge it.

Point 1 totally makes sense and I modified it. Regarding the return types, telling from the pylint test, I guess the infer_type() function from Pandas does not always return str and the default is Any. It might be better to keep this.

@AndyWangSFU
Copy link
Contributor Author

Sorry for that I'm little confusing why there are two very similar functions returning opposite values. Is there any possibility of combining them?
Screen Shot 2021-04-20 at 4 59 36 PM

This is a very good question. I was hoping to combine them together, and there is definitely a way to achieve it by adding one more parameter. For example, I tried check_values(data: str, valid: bool) to switch the check conditions between valid or invalid.

However, the code efficiency drops a lot if adding this check condition, as it is mapped to every string and every column. Also, I was thinking that it would be more clear to leave two useful functions, because users can directly call one of them to use. There are equal scenarios when checking "valid" or "null". It will be more confusing to add optional parameters.

Based on these two reasons, I finally decided to keep both of them. Hope it makes sense!

@qidanrui
Copy link
Contributor

Sorry for that I'm little confusing why there are two very similar functions returning opposite values. Is there any possibility of combining them?
Screen Shot 2021-04-20 at 4 59 36 PM

This is a very good question. I was hoping to combine them together, and there is definitely a way to achieve it by adding one more parameter. For example, I tried check_values(data: str, valid: bool) to switch the check conditions between valid or invalid.

However, the code efficiency drops a lot if adding this check condition, as it is mapped to every string and every column. Also, I was thinking that it would be more clear to leave two useful functions, because users can directly call one of them to use. There are equal scenarios when checking "valid" or "null". It will be more confusing to add optional parameters.

Based on these two reasons, I finally decided to keep both of them. Hope it makes sense!

Thanks for your replying Andy! Now I think they're reasonable~

@qidanrui
Copy link
Contributor

I changed some return types according to Andy's logic:

  • Changing the return type of clean_df() function from pd.DataFrame to Union[Tuple[pd.DataFrame, pd.DataFrame], pd.DataFrame]. That is because if the parameterdata_type_detection = 'none', the function returns only one dataframe (origin dataframe). However, if the parameterdata_type_detection != 'none', the function returns two dataframes. Thus I think it is better to use the union returning type.
Screen Shot 2021-04-20 at 5 20 56 PM
  • Changing the return type of _infer_semantic_data_type() from Any to str
Screen Shot 2021-04-20 at 5 36 29 PM
  • Changing the return type of _infer_atomic_data_type() from Any to str
Screen Shot 2021-04-20 at 5 36 45 PM No more comments for this function @yxie66. I think after Andy's small refining we can merge it.

Point 1 totally makes sense and I modified it. Regarding the return types, telling from the pylint test, I guess the infer_type() function from Pandas does not always return str and the default is Any. It might be better to keep this.

Got! Reasonable to me~Thanks Andy!

qidanrui
qidanrui previously approved these changes Apr 20, 2021
@yxie66 yxie66 merged commit f89a172 into develop Apr 21, 2021
@yxie66 yxie66 deleted the clean/clean_df branch April 21, 2021 01:13
devinllu pushed a commit to devinllu/dataprep that referenced this pull request Nov 9, 2021
feat(clean): add clean_df function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Proposal: clean_df functionality in clean module
4 participants