Skip to content

A fast and user-friendly Python tool for efficiently removing duplicates and handling missing values in large datasets.

Notifications You must be signed in to change notification settings

vinayakdon/Data-Cleaning-Python-Application

Repository files navigation

Data Cleaning - Python Application

Overview

Data Cleaning Master is a Python application built to streamline the cleaning of large datasets. It handles tasks such as identifying and backing up duplicates, addressing missing values, and saving cleaned data—all in a matter of seconds. This tool has been rigorously tested across various datasets, ensuring high performance and reliability, even with datasets containing thousands of rows. With an intuitive interface, Data Cleaning Master makes data preprocessing easier for analysts and data scientists, preparing clean datasets for deeper analysis.

Key Features:

  1. Duplicate Management: Identifies and backs up duplicates before removing them.
  2. Missing Data Handling:
    • For numeric columns, missing values are replaced with the column's mean.
    • For non-numeric columns, rows with missing values are dropped.
  3. Format Support: Accepts datasets in CSV and Excel formats.
  4. High Performance: Efficiently processes datasets with tens of thousands of rows.

Objectives

  1. Load datasets from CSV or Excel files.
  2. Detect and remove duplicate records, saving them separately.
  3. Handle missing values:
    • Replace missing values in numeric columns with the mean.
    • Remove rows with missing values in non-numeric columns.
  4. Save cleaned datasets while retaining backups of duplicates.

Requirements and Libraries used:

  • Python 3.x
  • Pandas
  • Numpy
  • Openpyxl
  • Xlrd
  • OS library
  • Jupyter Notebook (for testing and development)

Workflow

1. Dataset Loading

  • The user inputs the file path and dataset name.
  • The application validates the input, ensuring the file is either in CSV or Excel format.

2. Duplicate Detection and Backup

  • Any duplicate rows are saved to a separate file {dataset_name}_duplicates.csv.
  • Duplicates are then removed from the dataset.

3. Missing Value Handling

  • Numeric columns: Missing values are replaced with the column mean.
  • Non-numeric columns: Rows with missing values are dropped entirely.

4. Exporting Clean Data

  • The cleaned dataset is saved as {dataset_name}_Clean_data.csv.
  • A success message is displayed after the cleaning process.

Performance & Testing

The application has been tested on several datasets containing over 10,000 rows, consistently cleaning data within seconds. Testing in Jupyter Notebook confirmed seamless integration with data analysis workflows.

Usage

  • Run the application in a Python environment.

  • Input the file path and dataset name as prompted.

  • The application cleans the dataset, backs up duplicates, and saves the cleaned data.

       python data_cleaning_application
    

    Example of Execution

     Welcome to Data Cleaning Master!
     Please enter dataset path: salesdata.xlsx
     Please enter dataset name: sales_data
    

    Output:

    • Duplicate records saved as: sales_data_duplicates.csv
    • Cleaned data saved as: sales_data_Clean_data.csv
    • screenshots

    datacleaning

    datacleaning2

Conclusion

Data Cleaning Master is a powerful, easy-to-use tool that automates data preprocessing tasks. Its fast execution, careful handling of duplicates, and effective management of missing values make it ideal for preparing data for analysis, enabling smoother workflows for data professionals.

About

A fast and user-friendly Python tool for efficiently removing duplicates and handling missing values in large datasets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages