# Project Title: 
   # Data Cleaning Project for Uncleaned Dataset

# Objective: 
- The primary objective of this project is to clean and 
  preprocess an unclean dataset containing various inconsistencies 
  and mixed data fields to ensure high-quality data for further analysis.

# Dataset Description

- Dataset: The dataset consists of employee records with fields like Name, Date of Birth, Email ID, Job Type, and Department.
- The data is unclean with mixed fields and special characters.

In [None]:
# Import Necessary Libraries
import pandas as pd

import re

In [None]:
data = pd.read_csv(r'unclean_data.csv')

In [None]:
unclean_data = data.copy()

In [None]:
unclean_data

# Data Cleaning Steps
- 1) Remove Special Characters from names, job types, and departments.

- 2) Separate Dates of Birth from email IDs.

- 3) Split Names into First and Last Names.
- 4) Make Proper Sequence of Columns

In [None]:
# Step - 1
#Create function
def to_clean_data(data):
    data['Name'] = re.sub(r'[@$&!#%^]',' ',data['Name'])
    data['Job_Type'] = re.sub(r'[@$&!#%^]','',data['Job_Type'])
    data['Department'] = re.sub(r'[@$&!#%^]','',data['Department'])
    return data

In [None]:
unclean_data = unclean_data.apply(to_clean_data,axis = 1)

In [None]:
unclean_data

In [None]:
# Step - 2
# Create function to seprate Email-ID and DOB
def to_split_mail_dob(data):
    dob = re.search(r'\d{4}-\d{2}-\d{2}',data['Date_of_Birth_EmailID'])
    data['Date_of_Birth'] = dob.group(0)
    data['Email_ID'] = data['Date_of_Birth_EmailID'].replace(data['Date_of_Birth'],'')
    return data

In [None]:
unclean_data = unclean_data.apply(to_split_mail_dob, axis = 1)

In [None]:
unclean_data

In [None]:
# Step - 4
# Create Function To Split name and surname
def to_split_name(data):
    data['First_name'] = data['Name'].split(' ')[0]
    data['Last_name'] = data['Name'].split(' ')[1]
    return data

In [None]:
unclean_data = unclean_data.apply(to_split_name, axis = 1)

In [None]:
unclean_data

In [None]:
unclean_data = unclean_data.drop(columns=['Name','Date_of_Birth_EmailID'])

In [None]:
# Step - 4
unclean_data = unclean_data[['First_name','Last_name','Date_of_Birth','Email_ID',"Department","Job_Type"]]

In [None]:
unclean_data

In [None]:
# Export clean data-set as csv for EDA

In [None]:
unclean_data.to_csv('clean_data.csv',index = True)

# Results
- Summarizing the outcome of the data-cleaning process.

- Cleaned Dataset: The dataset has been cleaned, with special characters removed,
 fields properly separated, and names split into first and last names.
The resulting dataset is now ready for analysis.

# Importance of Data Cleaning
Why data cleaning is crucial in any data-related project.

1) Data Quality: Clean data ensures accuracy, consistency, and reliability of analysis results.

2) Improved Model Performance: Machine learning models perform better with clean data, leading to more accurate predictions.

3) Efficiency: Clean data reduces the time spent on troubleshooting data-related issues during analysis or model training.

4) Insights and Decision-Making: High-quality data leads to more reliable insights, which are essential for informed decision-making.