v0.1.10
CleansiPy
CleansiPy ๐งผ๐
Clean your data like a pro โ Text, Categorical, Numerical, and DateTime โ all in one package.
๐ Overview
CleansiPy is an all-in-one Python package designed to clean and preprocess messy datasets with ease and flexibility. It supports four major data types:
๐ Text โ tokenization, stemming, lemmatization, stopword removal, n-gram generation, profanity filtering, emoji & HTML cleaning, and more.
๐งฎ Numerical โ missing value handling, outlier detection, precision adjustment, type correction, and logging.
๐งพ Categorical โ typo correction, standardization, rare value grouping, encoding (OneHot, Label, Ordinal), and fuzzy matching.
๐ DateTime โ flexible parsing, timezone unification, feature extraction (day, month, weekday, etc.), imputation, and validation.
Itโs built for data scientists, ML engineers, and analysts working on real-world data pipelines.
๐ง Installation
bash
Copy
Edit
pip install puripy
๐ฆ Features
โ
Configurable, modular pipelines
โ
Works with pandas DataFrames
โ
Multi-core processing for speed
โ
NLTK/TextBlob integration for NLP
โ
sklearn support for encoding
โ
Detailed logs and cleaning reports
โ
Auto column detection
โ
Type-safe and test-friendly design
โก Quick Start
-
Set up your configuration:
After installing, run the following command to copy the default config.py to your project directory:
cleansipy-configThen edit
config.pyto set your input/output file paths and other options before running the application. -
Install requirements:
pip install -r requirements.txtOr, if you want to use the package mode:
pip install .
-
Run the application:
python -m puripy.appOr, if you installed as a package and set up entry points:
puripy
๐ผ๏ธ Logo
The official Puripy logo is included in the package at CleansiPy/assets/logo.png.
To access or display the logo programmatically:
from CleansiPy import get_logo_path, show_logo
print(get_logo_path())
show_logo()๐ฆ Package Structure
CleansiPy/
__init__.py
__main__.py
app.py
mainnum.py
maincat.py
maintext.py
maindt.py
logo.py
config.py
assets/
logo.png
README.txt
setup.py
requirements.txt
README.md
- All main code is inside the
CleansiPy/directory for packaging. - The logo is in
CleansiPy/assets/logo.pngand accessible viaget_logo_path()andshow_logo(). - To run the app: set up config, install requirements, then run
python -m CleansiPy.app.
in case you dont get the config.py then use this template :
# ==================== CATEGORICAL DATA CONFIG ====================
config2 = {
# File paths
"INPUT_FILE": r"testdata\xx.csv", # Raw data source
"OUTPUT_FILE": r"testdata\cleaned.csv", # Cleaned data destination
"TARGET_COLUMN": None, # Target variable for ML tasks
"FILE_PATH": r"testdata\cleaning_report.txt", # Cleaning audit log
# Column selection
"COLUMNS_TO_CLEAN": ["category", "other_cat_col"], # Specific columns to process
"EXCLUDE_COLUMNS": [], # Columns to skip
# Core cleaning features
"FIX_TYPOS": True, # Auto-correct spelling variations
"GROUP_RARE": True, # Consolidate infrequent categories
"RARE_THRESHOLD": 0.05, # Minimum frequency to keep as separate category
"SIMILARITY_THRESHOLD": 80, # Fuzzy matching sensitivity (0-100)
# Performance
"MEMORY_EFFICIENT": False, # Optimize for large datasets
"PARALLEL_JOBS": -1 # CPU cores to use (-1 = all)
}
# ==================== NUMERICAL DATA CONFIG ====================
DEFAULT_CONFIG = {
# File handling
'input_file': r'testdata\data.csv',
'output_file': r'testdata\cleaned_output.csv',
'report_file': r'testdata\textreport.txt',
# Data type handling
'type_conversion': {
'numeric_cols': ['Sales_Before', 'Sales_After'] # Columns to force-convert to numeric
},
# Missing data
'missing_values': {
'strategy': 'mean', # Imputation method (mean/median/mode)
'threshold': 0.5 # Max allowed missingness per column
},
# Data validation
'data_errors': {
'constraints': { # Value range rules
'Sales_Before': lambda x: (x >= 50) & (x <= 500)
},
'correction': 'median' # How to fix invalid values
},
# Outlier treatment
'outliers': {
'method': 'iqr', # Detection method (iqr/zscore)
'action': 'cap', # Handling (cap/remove)
'columns': ['Sales_Before', 'Sales_After'] # Columns to check
},
# Precision control
'precision': {
'Sales_Before': 2 # Decimal places to round
}
}
# ==================== DATE/TIME CONFIG ====================
config3 = {
# File setup
"INPUT_FILE": r"testdata\dates.csv",
"OUTPUT_FILE": r"testdata\cleaned.csv",
"REPORT_FILE": r"testdata\date_cleaning_report.txt",
# Date processing
"PARSE_DATES": True, # Convert string dates
"IMPUTE_MISSING": True, # Fill missing timestamps
"IMPUTATION_METHOD": "linear", # Filling strategy
"STANDARDIZE_TIMEZONE": False, # Convert to target timezone
# Feature generation
"EXTRACT_FEATURES": True, # Create calendar features
"CALENDAR_FEATURES": ["year", "month"], # Features to extract
"FISCAL_YEAR_START": 10 # Fiscal year starting month
}
# ==================== TEXT CLEANING CONFIG ====================
config = {
# Core text processing
'lowercase': True, # Convert to lowercase
'remove_punctuation': True, # Strip punctuation
'remove_stopwords': True, # Filter common words
'lemmatize': True, # Reduce to base form
# Advanced cleaning
'remove_urls': True, # Strip web addresses
'remove_emojis': True, # Filter emoji characters
'spelling_correction': False, # Fix spelling errors
# File handling
'input_file': r"testdata\test.csv",
'output_file': r"testdata\cleaned_text.csv",
'text_column': None # Auto-detect text column
}Author
Developed by Sambhranta Ghosh
Open to contributions, feedback, and improvements!
For more, see the in-code docstrings and comments
