v0.1.0
π¦ Puripy v0.1.0 β First Public Release
Welcome to the initial release of Puripy, a modular and powerful Python package designed for cleaning and preprocessing messy data across Text, Categorical, Numerical, and Datetime fields.
Features
Text Cleaner
- β Contraction expansion, emoji/URL/HTML removal
- β Stopword removal, stemming, lemmatization
- β Spelling correction, profanity filtering, n-gram generation
- β Auto column detection & parallel processing
π·οΈ Categorical Cleaner
- β
Fuzzy typo correction with
thefuzz - β Rare category grouping
- β
OneHot, Ordinal, and Label encoding via
sklearn - β Text normalization and full reporting
Numerical Cleaner
- β Missing value imputation (mean, median, mode)
- β Outlier handling (IQR method)
- β Type conversion and precision control
- β Duplicate detection and domain rule enforcement
Datetime Cleaner
- β Flexible datetime parsing and fuzzy matching
- β Timezone normalization
- β Missing date imputation using STL decomposition
- β Feature extraction (year, month, day, quarter, fiscal, etc.)
What's New in v0.1.0?
- Initial release with full support for text, categorical, numerical, and datetime cleaning.
- Built-in support for parallel processing and logging.
- Highly customizable pipelines using configuration dictionaries.
- Auto-generated cleaning reports for auditability.
Tech Stack
pandas,numpy,nltk,textblob,sklearn,emoji,contractions,better_profanity,tqdm,joblib,pytz,statsmodels, and more.
Known Notes
- This is a pre-1.0 release β APIs and behavior might change in future versions.
- Ideal for testing, experimentation, and feedback.
Contribute
Feedback, issues, and pull requests are welcome!
Star β the repo and help shape Puripy into a go-to tool for data cleaning.
Let me know if you'd also like to generate a sample CHANGELOG.md or GitHub Action workflow for automated releases.