Data Cleaning - Python Application

Overview

Data Cleaning Master is a Python application built to streamline the cleaning of large datasets. It handles tasks such as identifying and backing up duplicates, addressing missing values, and saving cleaned data—all in a matter of seconds. This tool has been rigorously tested across various datasets, ensuring high performance and reliability, even with datasets containing thousands of rows. With an intuitive interface, Data Cleaning Master makes data preprocessing easier for analysts and data scientists, preparing clean datasets for deeper analysis.

Key Features:

Duplicate Management: Identifies and backs up duplicates before removing them.
Missing Data Handling:
- For numeric columns, missing values are replaced with the column's mean.
- For non-numeric columns, rows with missing values are dropped.
Format Support: Accepts datasets in CSV and Excel formats.
High Performance: Efficiently processes datasets with tens of thousands of rows.

Objectives

Load datasets from CSV or Excel files.
Detect and remove duplicate records, saving them separately.
Handle missing values:
- Replace missing values in numeric columns with the mean.
- Remove rows with missing values in non-numeric columns.
Save cleaned datasets while retaining backups of duplicates.

Requirements and Libraries used:

Python 3.x
Pandas
Numpy
Openpyxl
Xlrd
OS library
Jupyter Notebook (for testing and development)

Workflow

1. Dataset Loading

The user inputs the file path and dataset name.
The application validates the input, ensuring the file is either in CSV or Excel format.

2. Duplicate Detection and Backup

Any duplicate rows are saved to a separate file {dataset_name}_duplicates.csv.
Duplicates are then removed from the dataset.

3. Missing Value Handling

Numeric columns: Missing values are replaced with the column mean.
Non-numeric columns: Rows with missing values are dropped entirely.

4. Exporting Clean Data

The cleaned dataset is saved as {dataset_name}_Clean_data.csv.
A success message is displayed after the cleaning process.

Performance & Testing

The application has been tested on several datasets containing over 10,000 rows, consistently cleaning data within seconds. Testing in Jupyter Notebook confirmed seamless integration with data analysis workflows.

Usage

Run the application in a Python environment.
Input the file path and dataset name as prompted.
The application cleans the dataset, backs up duplicates, and saves the cleaned data.
```
   python data_cleaning_application
```
Example of Execution
```
 Welcome to Data Cleaning Master!
 Please enter dataset path: salesdata.xlsx
 Please enter dataset name: sales_data
```
Output:
- Duplicate records saved as: sales_data_duplicates.csv
- Cleaned data saved as: sales_data_Clean_data.csv
- screenshots

Conclusion

Data Cleaning Master is a powerful, easy-to-use tool that automates data preprocessing tasks. Its fast execution, careful handling of duplicates, and effective management of missing values make it ideal for preparing data for analysis, enabling smoother workflows for data professionals.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
README.md		README.md
app.py		app.py
deliverydata.xlsx		deliverydata.xlsx
optimised_code.py		optimised_code.py
salesdata.xlsx		salesdata.xlsx
testapp.ipynb		testapp.ipynb
walmartdata.xlsx		walmartdata.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Cleaning - Python Application

Overview

Objectives

Requirements and Libraries used:

Workflow

1. Dataset Loading

2. Duplicate Detection and Backup

3. Missing Value Handling

4. Exporting Clean Data

Performance & Testing

Usage

Example of Execution

Conclusion

About

Uh oh!

Releases

Packages

Languages

vinayakdon/Data-Cleaning-Python-Application

Folders and files

Latest commit

History

Repository files navigation

Data Cleaning - Python Application

Overview

Objectives

Requirements and Libraries used:

Workflow

1. Dataset Loading

2. Duplicate Detection and Backup

3. Missing Value Handling

4. Exporting Clean Data

Performance & Testing

Usage

Example of Execution

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages