Data Cleaning Master is a Python application built to streamline the cleaning of large datasets. It handles tasks such as identifying and backing up duplicates, addressing missing values, and saving cleaned data—all in a matter of seconds. This tool has been rigorously tested across various datasets, ensuring high performance and reliability, even with datasets containing thousands of rows. With an intuitive interface, Data Cleaning Master makes data preprocessing easier for analysts and data scientists, preparing clean datasets for deeper analysis.
Key Features:
- Duplicate Management: Identifies and backs up duplicates before removing them.
- Missing Data Handling:
- For numeric columns, missing values are replaced with the column's mean.
- For non-numeric columns, rows with missing values are dropped.
- Format Support: Accepts datasets in CSV and Excel formats.
- High Performance: Efficiently processes datasets with tens of thousands of rows.
- Load datasets from CSV or Excel files.
- Detect and remove duplicate records, saving them separately.
- Handle missing values:
- Replace missing values in numeric columns with the mean.
- Remove rows with missing values in non-numeric columns.
- Save cleaned datasets while retaining backups of duplicates.
- Python 3.x
- Pandas
- Numpy
- Openpyxl
- Xlrd
- OS library
- Jupyter Notebook (for testing and development)
- The user inputs the file path and dataset name.
- The application validates the input, ensuring the file is either in CSV or Excel format.
- Any duplicate rows are saved to a separate file {dataset_name}_duplicates.csv.
- Duplicates are then removed from the dataset.
- Numeric columns: Missing values are replaced with the column mean.
- Non-numeric columns: Rows with missing values are dropped entirely.
- The cleaned dataset is saved as {dataset_name}_Clean_data.csv.
- A success message is displayed after the cleaning process.
The application has been tested on several datasets containing over 10,000 rows, consistently cleaning data within seconds. Testing in Jupyter Notebook confirmed seamless integration with data analysis workflows.
-
Run the application in a Python environment.
-
Input the file path and dataset name as prompted.
-
The application cleans the dataset, backs up duplicates, and saves the cleaned data.
python data_cleaning_application
Welcome to Data Cleaning Master! Please enter dataset path: salesdata.xlsx Please enter dataset name: sales_data
Output:
- Duplicate records saved as:
sales_data_duplicates.csv
- Cleaned data saved as:
sales_data_Clean_data.csv
- screenshots
- Duplicate records saved as:
Data Cleaning Master is a powerful, easy-to-use tool that automates data preprocessing tasks. Its fast execution, careful handling of duplicates, and effective management of missing values make it ideal for preparing data for analysis, enabling smoother workflows for data professionals.