-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Proposal: clean_df functionality in clean module #503
Comments
Nice feature proposal @AndyWangSFU! I just have a couple of comments/questions:
|
Thank you Brandon! I will modify my design accordingly. Keep in touch and stay healthy. 😊 |
Hi, |
Hi, Thank you very much for your interest towards Dataprep! I encourage you to read Wiki (https://github.com/sfu-db/dataprep/wiki) and submit an issue of the feature in your mind. Then we can discuss the implementation. Best Regards, |
Hi Harsha, Thank you very much for your initiative! We are still developing this Any questions please don't hesitate to let us know! Take care and keep in touch :) Best, |
Summary
Design and implement a
clean_df()
function to conduct a set of operations that would be useful for cleaning and standardizing a full DataFrame.Design-level Explanation Actions
Design-level Explanation
The (tentative) proposed function design for
clean_df()
is:Implementation-level Explanation
Rational and Alternatives
I feel like the order of the
clean_df()
function matters a lot. My rationale is as follows:semantic
will be great here because it gives more information to users and helps users call the other functions we build later on.Prior Art
For downcasting memory:
Pandas - downcast: https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html
Downcast package: https://pypi.org/project/downcast/
For atomic data type detection:
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.api.types.infer_dtype.html In pandas, there's a function
infer_dtype()
which returns a string summarising the type values in the passed object. We are using it because it's widely used in pandas' internals and is designed efficiently.For semantic data type detection:
There is no previous usual work to do it. The most relevant work is all recent from two papers and based on deep learning:
Future Possibilities
For now, we standardize missing values by filling them with "NaN" or "None". There are many other options, for example:
Implementation-level Actions
Additional Tasks
The text was updated successfully, but these errors were encountered: