Feature Proposal: clean_df functionality in clean module #503

AndyWangSFU · 2021-02-10T00:22:01Z

Summary

Design and implement a clean_df() function to conduct a set of operations that would be useful for cleaning and standardizing a full DataFrame.

Design-level Explanation Actions

Determine a list of appropriate operations and their order to take place
Investigate previous solutions for these set of operations
Consider which parameters to support for each operation
Prioritize the tasks that need to be done throughout the semester

Design-level Explanation

The (tentative) proposed function design for clean_df() is:

def clean_df(
    df: Union[pd.DataFrame, dd.DataFrame],
    clean_headers: bool = True,
    data_type_detection: str = "semantic",
    standardize_missing_values: str = "fill",
    downcast_memory: bool = True,
    remove_duplicate_entries: bool = False,
    report: bool = True,
    progress: bool = True,
) -> pd.DataFrame:
    """
    This function cleans the whole DataFrame with a sequence of general operations.

    Parameters
    ----------
    df
        Pandas or Dask DataFrame
    
    clean_headers
        If True, call the clean_headers() function to clean the column names.

        (default: True)
    data_type_detection {'semantic', 'atomic', 'none'}
        * If ’semantic’, then perform a column-wise semantic type detection to help users call the other build-in functions, e.g. clean_phone()
        * If 'atomic', then simply return the column data types from {'string', 'integer', 'decimal', 'boolean'}
        * If 'none', then no results will be returned.
     
        (default: 'semantic')
    standardize_missing_values {‘fill’, ‘remove’, ’ignore’}
        * If ’fill’, then all detected missing values will be set as np.nan or pd.NaT.
        * If ‘remove’, then any rows with missing values will be deleted (return a complete data-frame).
        * If ‘ignore’, then no action will be taken.

        (default: 'fill')
    remove_duplicate_entries
        If True, remove the repetitive data entries (rows) in the data and report how many entries are removed.

        (default: False)
    downcast_memory
        If True, downcast the memory size of the DataFrame by using subtypes in the numerical columns;
        for categorical types, downcast from `object` to `category`. Return how much memory is reduced.

        (default: True)
    report
        If True, output the summary report. Otherwise, no report is outputted.

        (default: True)
     progress
        If True, enable the progress bar.

        (default: True)
    """

Implementation-level Explanation

Rational and Alternatives

I feel like the order of the clean_df() function matters a lot. My rationale is as follows:

The first thing to do is just simply clean the headers. Because it is the headmost information we get.
The next step we'd better do is to perform a data type detection. This can make use of the headers we just cleaned.
semantic will be great here because it gives more information to users and helps users call the other functions we build later on.
After determining the data types, we can standardize missing values for users. For now, it supports "NaN" for numeric values and "None" for characters.
Once we handle the missing values, the data-frame is more cleaned and we can downcast the memory.
Optionally, we could remove the duplicate entries in the last step.

Prior Art

For downcasting memory:
Pandas - downcast: https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html
Downcast package: https://pypi.org/project/downcast/

For atomic data type detection:
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.api.types.infer_dtype.html In pandas, there's a function infer_dtype() which returns a string summarising the type values in the passed object. We are using it because it's widely used in pandas' internals and is designed efficiently.

For semantic data type detection:
There is no previous usual work to do it. The most relevant work is all recent from two papers and based on deep learning:

Future Possibilities

For now, we standardize missing values by filling them with "NaN" or "None". There are many other options, for example:

For missing numeric values, we can impute them by mean/mode. We can also fill gaps by propagating non-NA values forward or backward.
We can ask the users to specify a string (for example, "missing") that they want to fill NAs in.

Implementation-level Actions

Implement the basic atomic data type detection
Try the existing libraries and implement the semantic data type detection
Implement standardizing missing values
Implement downcast columns to reduce the memory
Implement removal of duplicate entries

Additional Tasks

This task is put into a correct pipeline (Development Backlog or In Progress).
The label of this task is setting correctly.
The issue is assigned to the correct person.
The issue is linked to related Epic.

The documentation is changed accordingly.
Tests are added accordingly.

The text was updated successfully, but these errors were encountered:

brandonlockhart · 2021-02-10T23:02:33Z

Nice feature proposal @AndyWangSFU! I just have a couple of comments/questions:

I think we can downcast categorical data types from object to category if the number of distinct values in the column is small.
I don't think we need the inplace parameter; we can just modify the given dataframe.
How will the errors parameter be used? Ie, what operation can cause an error?

AndyWangSFU · 2021-02-11T06:28:28Z

Thank you Brandon! I will modify my design accordingly. Keep in touch and stay healthy. 😊

harshasridhar · 2021-03-09T18:55:43Z

Hi,
Can I work on this?
Might need some heads up before starting
Thanks!

yxie66 · 2021-03-10T21:28:31Z

Hi,
Can I work on this?
Might need some heads up before starting
Thanks!

Hi,

Thank you very much for your interest towards Dataprep! I encourage you to read Wiki (https://github.com/sfu-db/dataprep/wiki) and submit an issue of the feature in your mind. Then we can discuss the implementation.

Best Regards,
Yi

AndyWangSFU · 2021-03-11T02:20:49Z

Hi,
Can I work on this?
Might need some heads up before starting
Thanks!

Hi Harsha,

Thank you very much for your initiative! We are still developing this clean_df() function. For now, please feel free to leave any suggestions on the design from a user's perspective (e.g. any new functionalities we may add?). Before actually coding, you can go through DataPrep's Wiki as Yi suggested.

Any questions please don't hesitate to let us know! Take care and keep in touch :)

Best,
Andy

AndyWangSFU added the type: enhancement New feature or request label Feb 10, 2021

brandonlockhart mentioned this issue Apr 19, 2021

feat(clean): add clean_df function #559

Merged

10 tasks

brandonlockhart linked a pull request Apr 19, 2021 that will close this issue

docs(clean): add documentation for clean_df #568

Merged

10 tasks

yxie66 closed this as completed in #559 Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Proposal: clean_df functionality in clean module #503

Feature Proposal: clean_df functionality in clean module #503

AndyWangSFU commented Feb 10, 2021 •

edited

brandonlockhart commented Feb 10, 2021

AndyWangSFU commented Feb 11, 2021

harshasridhar commented Mar 9, 2021

yxie66 commented Mar 10, 2021

AndyWangSFU commented Mar 11, 2021

Feature Proposal: clean_df functionality in clean module #503

Feature Proposal: clean_df functionality in clean module #503

Comments

AndyWangSFU commented Feb 10, 2021 • edited

Summary

Design-level Explanation Actions

Design-level Explanation

Implementation-level Explanation

Rational and Alternatives

Prior Art

Future Possibilities

Implementation-level Actions

Additional Tasks

brandonlockhart commented Feb 10, 2021

AndyWangSFU commented Feb 11, 2021

harshasridhar commented Mar 9, 2021

yxie66 commented Mar 10, 2021

AndyWangSFU commented Mar 11, 2021

AndyWangSFU commented Feb 10, 2021 •

edited