New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clean): add clean_df function #559
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #559 +/- ##
===========================================
- Coverage 85.72% 84.22% -1.50%
===========================================
Files 98 99 +1
Lines 8621 8768 +147
===========================================
- Hits 7390 7385 -5
- Misses 1231 1383 +152
Continue to review full report at Codecov.
|
The desired way to check data types. | ||
* If ’semantic’, then perform a column-wise semantic and atomic type detection. | ||
* If 'atomic', then return the best inferred atomic data type from Python default ones. | ||
* If 'none', then no results will be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the input can't be 'none' here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess "none" means that the users do not want to perform any data type detection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess "none" means that the users do not want to perform any data type detection?
Based on my understanding of the implementation here, the input parameter of _infer_data_type_df()
will not be 'none'. If it's not 'semantic', you will perform the atomic detection. Which means it can only be 'semantic' and 'atomic'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess "none" means that the users do not want to perform any data type detection?
Based on my understanding of the implementation here, the input parameter of
_infer_data_type_df()
will not be 'none'. If it's not 'semantic', you will perform the atomic detection. Which means it can only be 'semantic' and 'atomic'?
I think Andy has filtered none
parameter in the clean_df()
function:
However, I think there is a missing typo. The None
should be none
~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Danrui! Greatly appreciated. I did not find this typo before.
print(f"\tMemory reducted from {old_stat} to {nclnd}. New size: ({pclnd}%)") | ||
else: | ||
print("Downcast Memory not Performed.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to add an illegal checking here, to avoid illegal call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Yi. Sorry I did not get it, what type of illegal call it can possibly be? Do you mean downcasted memory > old memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that input option
parameter can be an illegal string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very sure whether it could happen, as the _create_report()
function is only callable inside clean_df()
by the if report: (line 131)
condition. But users might directly call this function accidentally and receive unexpected errors. So I am OK to add a check condition inside this function.
By the way, it just reminds me that users could get confused by this _create_report()
with the other report functions in utils.py
? Is it necessary to rename this function to _create_report_clean_df()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what I'm referring to is potential accidental touch by users. But I'm OK with either way so just let me know your decision. For the function name, do you think create_report_df()
or create_df_report()
will work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do both. I feel like create_report_df()
works (and then add illegal checks). We can also listen to Danrui's opinion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the newest commit!
I changed some return types according to Andy's logic:
No more comments for this function @yxie66. I think after Andy's small refining we can merge it. |
Point 1 totally makes sense and I modified it. Regarding the return types, telling from the pylint test, I guess the |
Got! Reasonable to me~Thanks Andy! |
1842308
to
5ffcdbc
Compare
5ffcdbc
to
b750284
Compare
feat(clean): add clean_df function
Description
clean_df
function: conduct a set of operations that would be useful for cleaning and standardizing a full Pandas DataFrame. Closes #503.How Has This Been Tested?
I have tested this function using a few real-world datasets. I will also add my test function later.
Checklist: