Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Proposal: clean_df functionality in clean module #503

Closed
6 of 15 tasks
AndyWangSFU opened this issue Feb 10, 2021 · 5 comments · Fixed by #559 or #568
Closed
6 of 15 tasks

Feature Proposal: clean_df functionality in clean module #503

AndyWangSFU opened this issue Feb 10, 2021 · 5 comments · Fixed by #559 or #568
Labels
type: enhancement New feature or request

Comments

@AndyWangSFU
Copy link
Contributor

AndyWangSFU commented Feb 10, 2021

Summary

Design and implement a clean_df() function to conduct a set of operations that would be useful for cleaning and standardizing a full DataFrame.

Design-level Explanation Actions

  • Determine a list of appropriate operations and their order to take place
  • Investigate previous solutions for these set of operations
  • Consider which parameters to support for each operation
  • Prioritize the tasks that need to be done throughout the semester

Design-level Explanation

The (tentative) proposed function design for clean_df() is:

def clean_df(
    df: Union[pd.DataFrame, dd.DataFrame],
    clean_headers: bool = True,
    data_type_detection: str = "semantic",
    standardize_missing_values: str = "fill",
    downcast_memory: bool = True,
    remove_duplicate_entries: bool = False,
    report: bool = True,
    progress: bool = True,
) -> pd.DataFrame:
    """
    This function cleans the whole DataFrame with a sequence of general operations.

    Parameters
    ----------
    df
        Pandas or Dask DataFrame
    
    clean_headers
        If True, call the clean_headers() function to clean the column names.

        (default: True)
    data_type_detection {'semantic', 'atomic', 'none'}
        * If ’semantic’, then perform a column-wise semantic type detection to help users call the other build-in functions, e.g. clean_phone()
        * If 'atomic', then simply return the column data types from {'string', 'integer', 'decimal', 'boolean'}
        * If 'none', then no results will be returned.
     
        (default: 'semantic')
    standardize_missing_values {‘fill’, ‘remove’, ’ignore’}
        * If ’fill’, then all detected missing values will be set as np.nan or pd.NaT.
        * If ‘remove’, then any rows with missing values will be deleted (return a complete data-frame).
        * If ‘ignore’, then no action will be taken.

        (default: 'fill')
    remove_duplicate_entries
        If True, remove the repetitive data entries (rows) in the data and report how many entries are removed.

        (default: False)
    downcast_memory
        If True, downcast the memory size of the DataFrame by using subtypes in the numerical columns;
        for categorical types, downcast from `object` to `category`. Return how much memory is reduced.

        (default: True)
    report
        If True, output the summary report. Otherwise, no report is outputted.

        (default: True)
     progress
        If True, enable the progress bar.

        (default: True)
    """

Implementation-level Explanation

Rational and Alternatives

I feel like the order of the clean_df() function matters a lot. My rationale is as follows:

  1. The first thing to do is just simply clean the headers. Because it is the headmost information we get.
  2. The next step we'd better do is to perform a data type detection. This can make use of the headers we just cleaned.
    semantic will be great here because it gives more information to users and helps users call the other functions we build later on.
  3. After determining the data types, we can standardize missing values for users. For now, it supports "NaN" for numeric values and "None" for characters.
  4. Once we handle the missing values, the data-frame is more cleaned and we can downcast the memory.
  5. Optionally, we could remove the duplicate entries in the last step.

Prior Art

For downcasting memory:
Pandas - downcast: https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html
Downcast package: https://pypi.org/project/downcast/

For atomic data type detection:
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.api.types.infer_dtype.html In pandas, there's a function infer_dtype() which returns a string summarising the type values in the passed object. We are using it because it's widely used in pandas' internals and is designed efficiently.

For semantic data type detection:
There is no previous usual work to do it. The most relevant work is all recent from two papers and based on deep learning:

  1. Sherlock: https://arxiv.org/pdf/1905.10688.pdf
  2. Sato: https://megagon.ai/blog/learning-to-detect-semantic-types-from-large-table-corpora/

Future Possibilities

For now, we standardize missing values by filling them with "NaN" or "None". There are many other options, for example:

  1. For missing numeric values, we can impute them by mean/mode. We can also fill gaps by propagating non-NA values forward or backward.
  2. We can ask the users to specify a string (for example, "missing") that they want to fill NAs in.

Implementation-level Actions

  • Implement the basic atomic data type detection
  • Try the existing libraries and implement the semantic data type detection
  • Implement standardizing missing values
  • Implement downcast columns to reduce the memory
  • Implement removal of duplicate entries

Additional Tasks

  • This task is put into a correct pipeline (Development Backlog or In Progress).
  • The label of this task is setting correctly.
  • The issue is assigned to the correct person.
  • The issue is linked to related Epic.
  • The documentation is changed accordingly.
  • Tests are added accordingly.
@AndyWangSFU AndyWangSFU added the type: enhancement New feature or request label Feb 10, 2021
@brandonlockhart
Copy link

Nice feature proposal @AndyWangSFU! I just have a couple of comments/questions:

  1. I think we can downcast categorical data types from object to category if the number of distinct values in the column is small.
  2. I don't think we need the inplace parameter; we can just modify the given dataframe.
  3. How will the errors parameter be used? Ie, what operation can cause an error?

@AndyWangSFU
Copy link
Contributor Author

Thank you Brandon! I will modify my design accordingly. Keep in touch and stay healthy. 😊

@harshasridhar
Copy link

Hi,
Can I work on this?
Might need some heads up before starting
Thanks!

@yxie66
Copy link
Contributor

yxie66 commented Mar 10, 2021

Hi,
Can I work on this?
Might need some heads up before starting
Thanks!

Hi,

Thank you very much for your interest towards Dataprep! I encourage you to read Wiki (https://github.com/sfu-db/dataprep/wiki) and submit an issue of the feature in your mind. Then we can discuss the implementation.

Best Regards,
Yi

@AndyWangSFU
Copy link
Contributor Author

Hi,
Can I work on this?
Might need some heads up before starting
Thanks!

Hi Harsha,

Thank you very much for your initiative! We are still developing this clean_df() function. For now, please feel free to leave any suggestions on the design from a user's perspective (e.g. any new functionalities we may add?). Before actually coding, you can go through DataPrep's Wiki as Yi suggested.

Any questions please don't hesitate to let us know! Take care and keep in touch :)

Best,
Andy

@brandonlockhart brandonlockhart linked a pull request Apr 19, 2021 that will close this issue
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New feature or request
Projects
None yet
4 participants