feat(clean): add clean_df function #559

AndyWangSFU · 2021-04-07T01:13:49Z

Description

clean_df function: conduct a set of operations that would be useful for cleaning and standardizing a full Pandas DataFrame. Closes #503.

How Has This Been Tested?

I have tested this function using a few real-world datasets. I will also add my test function later.

Checklist:

codecov · 2021-04-09T00:56:47Z

Codecov Report

Merging #559 (1842308) into develop (0ca49e7) will decrease coverage by 1.49%.
The diff coverage is 0.00%.

❗ Current head 1842308 differs from pull request most recent head 5ffcdbc. Consider uploading reports for the commit 5ffcdbc to get more accurate results

@@             Coverage Diff             @@
##           develop     #559      +/-   ##
===========================================
- Coverage    85.72%   84.22%   -1.50%     
===========================================
  Files           98       99       +1     
  Lines         8621     8768     +147     
===========================================
- Hits          7390     7385       -5     
- Misses        1231     1383     +152

Impacted Files	Coverage Δ
dataprep/clean/clean_df.py	`0.00% <0.00%> (ø)`
dataprep/eda/correlation/compute/overview.py	`98.49% <0.00%> (-0.76%)`	⬇️
dataprep/eda/missing/compute/common.py	`84.21% <0.00%> (-0.41%)`	⬇️
dataprep/eda/dtypes.py	`84.89% <0.00%> (-0.32%)`	⬇️
dataprep/eda/create_report/formatter.py	`95.48% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ca49e7...5ffcdbc. Read the comment docs.

yxie66 · 2021-04-14T23:34:26Z

dataprep/clean/clean_df.py

+        The desired way to check data types.
+        * If ’semantic’, then perform a column-wise semantic and atomic type detection.
+        * If 'atomic', then return the best inferred atomic data type from Python default ones.
+        * If 'none', then no results will be returned.


I think the input can't be 'none' here

I guess "none" means that the users do not want to perform any data type detection?

I guess "none" means that the users do not want to perform any data type detection?

Based on my understanding of the implementation here, the input parameter of _infer_data_type_df() will not be 'none'. If it's not 'semantic', you will perform the atomic detection. Which means it can only be 'semantic' and 'atomic'?

I guess "none" means that the users do not want to perform any data type detection?

Based on my understanding of the implementation here, the input parameter of _infer_data_type_df() will not be 'none'. If it's not 'semantic', you will perform the atomic detection. Which means it can only be 'semantic' and 'atomic'?

I think Andy has filtered none parameter in the clean_df() function:

However, I think there is a missing typo. The None should be none~

Thanks Danrui! Greatly appreciated. I did not find this typo before.

yxie66 · 2021-04-14T23:46:09Z

dataprep/clean/clean_df.py

+            print(f"\tMemory reducted from {old_stat} to {nclnd}. New size: ({pclnd}%)")
+        else:
+            print("Downcast Memory not Performed.")
+


Maybe it's better to add an illegal checking here, to avoid illegal call

Thanks Yi. Sorry I did not get it, what type of illegal call it can possibly be? Do you mean downcasted memory > old memory?

What I mean is that input option parameter can be an illegal string.

I am not very sure whether it could happen, as the _create_report() function is only callable inside clean_df() by the if report: (line 131) condition. But users might directly call this function accidentally and receive unexpected errors. So I am OK to add a check condition inside this function.

By the way, it just reminds me that users could get confused by this _create_report() with the other report functions in utils.py? Is it necessary to rename this function to _create_report_clean_df()?

Yes, what I'm referring to is potential accidental touch by users. But I'm OK with either way so just let me know your decision. For the function name, do you think create_report_df() or create_df_report() will work?

We can do both. I feel like create_report_df() works (and then add illegal checks). We can also listen to Danrui's opinion!

Fixed in the newest commit!

qidanrui · 2021-04-20T08:51:09Z

I think the pass in _standardize_missing_values_df() is redundant, because your if statements are enough.

If you think there still should be placeholder, how about changing it like this:

I think it is more concrete~

qidanrui · 2021-04-20T09:01:32Z

Sorry for that I'm little confusing why there are two very similar functions returning opposite values. Is there any possibility of combining them?

qidanrui · 2021-04-20T09:44:59Z

I changed some return types according to Andy's logic:

Changing the return type of clean_df() function from pd.DataFrame to Union[Tuple[pd.DataFrame, pd.DataFrame], pd.DataFrame]. That is because if the parameterdata_type_detection = 'none', the function returns only one dataframe (origin dataframe). However, if the parameterdata_type_detection != 'none', the function returns two dataframes. Thus I think it is better to use the union returning type.

* Changing the return type of `_infer_semantic_data_type()` from `Any` to `str`

* Changing the return type of `_infer_atomic_data_type()` from `Any` to `str`

No more comments for this function @yxie66. I think after Andy's small refining we can merge it.

AndyWangSFU · 2021-04-20T19:32:32Z

I think the pass in _standardize_missing_values_df() is redundant, because your if statements are enough.

If you think there still should be placeholder, how about changing it like this:

I think it is more concrete~

This makes sense! I fixed it in the latest commit, using the second suggestion.

AndyWangSFU · 2021-04-20T19:35:35Z

I changed some return types according to Andy's logic:

Changing the return type of clean_df() function from pd.DataFrame to Union[Tuple[pd.DataFrame, pd.DataFrame], pd.DataFrame]. That is because if the parameterdata_type_detection = 'none', the function returns only one dataframe (origin dataframe). However, if the parameterdata_type_detection != 'none', the function returns two dataframes. Thus I think it is better to use the union returning type.

Changing the return type of _infer_semantic_data_type() from Any to str

Changing the return type of _infer_atomic_data_type() from Any to str

No more comments for this function @yxie66. I think after Andy's small refining we can merge it.

Point 1 totally makes sense and I modified it. Regarding the return types, telling from the pylint test, I guess the infer_type() function from Pandas does not always return str and the default is Any. It might be better to keep this.

AndyWangSFU · 2021-04-20T19:57:25Z

Sorry for that I'm little confusing why there are two very similar functions returning opposite values. Is there any possibility of combining them?

This is a very good question. I was hoping to combine them together, and there is definitely a way to achieve it by adding one more parameter. For example, I tried check_values(data: str, valid: bool) to switch the check conditions between valid or invalid.

However, the code efficiency drops a lot if adding this check condition, as it is mapped to every string and every column. Also, I was thinking that it would be more clear to leave two useful functions, because users can directly call one of them to use. There are equal scenarios when checking "valid" or "null". It will be more confusing to add optional parameters.

Based on these two reasons, I finally decided to keep both of them. Hope it makes sense!

qidanrui · 2021-04-20T23:08:28Z

Sorry for that I'm little confusing why there are two very similar functions returning opposite values. Is there any possibility of combining them?

This is a very good question. I was hoping to combine them together, and there is definitely a way to achieve it by adding one more parameter. For example, I tried check_values(data: str, valid: bool) to switch the check conditions between valid or invalid.

However, the code efficiency drops a lot if adding this check condition, as it is mapped to every string and every column. Also, I was thinking that it would be more clear to leave two useful functions, because users can directly call one of them to use. There are equal scenarios when checking "valid" or "null". It will be more confusing to add optional parameters.

Based on these two reasons, I finally decided to keep both of them. Hope it makes sense!

Thanks for your replying Andy! Now I think they're reasonable~

qidanrui · 2021-04-20T23:12:10Z

I changed some return types according to Andy's logic:

Changing the return type of clean_df() function from pd.DataFrame to Union[Tuple[pd.DataFrame, pd.DataFrame], pd.DataFrame]. That is because if the parameterdata_type_detection = 'none', the function returns only one dataframe (origin dataframe). However, if the parameterdata_type_detection != 'none', the function returns two dataframes. Thus I think it is better to use the union returning type.

Changing the return type of _infer_semantic_data_type() from Any to str

Changing the return type of _infer_atomic_data_type() from Any to str

No more comments for this function @yxie66. I think after Andy's small refining we can merge it.

Point 1 totally makes sense and I modified it. Regarding the return types, telling from the pylint test, I guess the infer_type() function from Pandas does not always return str and the default is Any. It might be better to keep this.

Got! Reasonable to me~Thanks Andy!

feat(clean): add clean_df function

yxie66 reviewed Apr 14, 2021

View reviewed changes

yxie66 requested a review from qidanrui April 14, 2021 23:52

dovahcrow added this to the 0.3.0 milestone Apr 15, 2021

qidanrui previously approved these changes Apr 20, 2021

View reviewed changes

AndyWangSFU dismissed qidanrui’s stale review via 5ffcdbc April 21, 2021 00:33

AndyWangSFU force-pushed the clean/clean_df branch from 1842308 to 5ffcdbc Compare April 21, 2021 00:33

feat(clean):add clean_df() function

b750284

AndyWangSFU force-pushed the clean/clean_df branch from 5ffcdbc to b750284 Compare April 21, 2021 00:50

yxie66 merged commit f89a172 into develop Apr 21, 2021

yxie66 deleted the clean/clean_df branch April 21, 2021 01:13

devinllu pushed a commit to devinllu/dataprep that referenced this pull request Nov 9, 2021

Merge pull request sfu-db#559 from sfu-db/clean/clean_df

5943fc6

feat(clean): add clean_df function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(clean): add clean_df function #559

feat(clean): add clean_df function #559

AndyWangSFU commented Apr 7, 2021 •

edited by brandonlockhart

codecov bot commented Apr 9, 2021 •

edited

yxie66 Apr 14, 2021

AndyWangSFU Apr 15, 2021

yxie66 Apr 15, 2021

qidanrui Apr 20, 2021 •

edited

AndyWangSFU Apr 20, 2021

yxie66 Apr 14, 2021

AndyWangSFU Apr 15, 2021 •

edited

yxie66 Apr 15, 2021

AndyWangSFU Apr 15, 2021

yxie66 Apr 15, 2021

AndyWangSFU Apr 15, 2021 •

edited

AndyWangSFU Apr 20, 2021

qidanrui commented Apr 20, 2021

qidanrui commented Apr 20, 2021

qidanrui commented Apr 20, 2021

AndyWangSFU commented Apr 20, 2021

AndyWangSFU commented Apr 20, 2021

AndyWangSFU commented Apr 20, 2021

qidanrui commented Apr 20, 2021

qidanrui commented Apr 20, 2021

feat(clean): add clean_df function #559

feat(clean): add clean_df function #559

Conversation

AndyWangSFU commented Apr 7, 2021 • edited by brandonlockhart

Description

How Has This Been Tested?

Checklist:

codecov bot commented Apr 9, 2021 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qidanrui Apr 20, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyWangSFU Apr 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyWangSFU Apr 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qidanrui commented Apr 20, 2021

qidanrui commented Apr 20, 2021

qidanrui commented Apr 20, 2021

AndyWangSFU commented Apr 20, 2021

AndyWangSFU commented Apr 20, 2021

AndyWangSFU commented Apr 20, 2021

qidanrui commented Apr 20, 2021

qidanrui commented Apr 20, 2021

AndyWangSFU commented Apr 7, 2021 •

edited by brandonlockhart

codecov bot commented Apr 9, 2021 •

edited

qidanrui Apr 20, 2021 •

edited

AndyWangSFU Apr 15, 2021 •

edited

AndyWangSFU Apr 15, 2021 •

edited