# Project Title: 
   # Data Cleaning Project for Uncleaned Dataset

# Objective: 
- The primary objective of this project is to clean and 
  preprocess an unclean dataset containing various inconsistencies 
  and mixed data fields to ensure high-quality data for further analysis.

# Dataset Description

- Dataset: The dataset consists of employee records with fields like Name, Date of Birth, Email ID, Job Type, and Department.
- The data is unclean with mixed fields and special characters.

In [1]:
# Import Necessary Libraries
import pandas as pd

import re

In [2]:
data = pd.read_csv(r'unclean_data.csv')

In [3]:
unclean_data = data.copy()

In [4]:
unclean_data

Unnamed: 0,Name,Date_of_Birth_EmailID,Job_Type,Department
0,John@Doe,1990-01-15john.doe@example.com,Full$-Time,Enginee@ring
1,Jane$Smith,jane.smith@company.com1992-03-22,Part!Time,Market#ing
2,Michael&Jordan,1985-07-30michael.jordan@basketball.org,Con%tract,Sa$les
3,Emily!Brown,emily.brown@example.co.uk1988-11-10,Ful^l-Time,Human Reso@urces
4,Robert#Taylor,1995-05-17robert.taylor@example.net,Par^t-Time,Inform@ation Technolo%gy
5,Sophia$Walker,1993-09-12sophia.walker@fashion.com,Full-T^ime,De%sign
6,Chris!Evans,chris.evans@movies.com1982-06-13,Fre$elance,Produc!tion
7,Emma@Watson,1990-04-15emma.watson@hogwarts.edu,Inter^n,Educa@tion
8,Daniel#Craig,daniel.craig@007.com1968-03-02,Con!tract,Secur#ity
9,Olivia%Johnson,olivia.johnson@example.org1991-08-25,Ful$-Time,Opera#tions


# Data Cleaning Steps
- 1) Remove Special Characters from names, job types, and departments.

- 2) Separate Dates of Birth from email IDs.

- 3) Split Names into First and Last Names.
- 4) Make Proper Sequence of Columns

In [5]:
# Step - 1
#Create function
def to_clean_data(data):
    data['Name'] = re.sub(r'[@$&!#%^]',' ',data['Name'])
    data['Job_Type'] = re.sub(r'[@$&!#%^]','',data['Job_Type'])
    data['Department'] = re.sub(r'[@$&!#%^]','',data['Department'])
    return data

In [6]:
unclean_data = unclean_data.apply(to_clean_data,axis = 1)

In [7]:
unclean_data

Unnamed: 0,Name,Date_of_Birth_EmailID,Job_Type,Department
0,John Doe,1990-01-15john.doe@example.com,Full-Time,Engineering
1,Jane Smith,jane.smith@company.com1992-03-22,PartTime,Marketing
2,Michael Jordan,1985-07-30michael.jordan@basketball.org,Contract,Sales
3,Emily Brown,emily.brown@example.co.uk1988-11-10,Full-Time,Human Resources
4,Robert Taylor,1995-05-17robert.taylor@example.net,Part-Time,Information Technology
5,Sophia Walker,1993-09-12sophia.walker@fashion.com,Full-Time,Design
6,Chris Evans,chris.evans@movies.com1982-06-13,Freelance,Production
7,Emma Watson,1990-04-15emma.watson@hogwarts.edu,Intern,Education
8,Daniel Craig,daniel.craig@007.com1968-03-02,Contract,Security
9,Olivia Johnson,olivia.johnson@example.org1991-08-25,Ful-Time,Operations


In [8]:
# Step - 2
# Create function to seprate Email-ID and DOB
def to_split_mail_dob(data):
    dob = re.search(r'\d{4}-\d{2}-\d{2}',data['Date_of_Birth_EmailID'])
    data['Date_of_Birth'] = dob.group(0)
    data['Email_ID'] = data['Date_of_Birth_EmailID'].replace(data['Date_of_Birth'],'')
    return data

In [9]:
unclean_data = unclean_data.apply(to_split_mail_dob, axis = 1)

In [10]:
unclean_data

Unnamed: 0,Name,Date_of_Birth_EmailID,Job_Type,Department,Date_of_Birth,Email_ID
0,John Doe,1990-01-15john.doe@example.com,Full-Time,Engineering,1990-01-15,john.doe@example.com
1,Jane Smith,jane.smith@company.com1992-03-22,PartTime,Marketing,1992-03-22,jane.smith@company.com
2,Michael Jordan,1985-07-30michael.jordan@basketball.org,Contract,Sales,1985-07-30,michael.jordan@basketball.org
3,Emily Brown,emily.brown@example.co.uk1988-11-10,Full-Time,Human Resources,1988-11-10,emily.brown@example.co.uk
4,Robert Taylor,1995-05-17robert.taylor@example.net,Part-Time,Information Technology,1995-05-17,robert.taylor@example.net
5,Sophia Walker,1993-09-12sophia.walker@fashion.com,Full-Time,Design,1993-09-12,sophia.walker@fashion.com
6,Chris Evans,chris.evans@movies.com1982-06-13,Freelance,Production,1982-06-13,chris.evans@movies.com
7,Emma Watson,1990-04-15emma.watson@hogwarts.edu,Intern,Education,1990-04-15,emma.watson@hogwarts.edu
8,Daniel Craig,daniel.craig@007.com1968-03-02,Contract,Security,1968-03-02,daniel.craig@007.com
9,Olivia Johnson,olivia.johnson@example.org1991-08-25,Ful-Time,Operations,1991-08-25,olivia.johnson@example.org


In [11]:
# Step - 4
# Create Function To Split name and surname
def to_split_name(data):
    data['First_name'] = data['Name'].split(' ')[0]
    data['Last_name'] = data['Name'].split(' ')[1]
    return data

In [12]:
unclean_data = unclean_data.apply(to_split_name, axis = 1)

In [13]:
unclean_data

Unnamed: 0,Name,Date_of_Birth_EmailID,Job_Type,Department,Date_of_Birth,Email_ID,First_name,Last_name
0,John Doe,1990-01-15john.doe@example.com,Full-Time,Engineering,1990-01-15,john.doe@example.com,John,Doe
1,Jane Smith,jane.smith@company.com1992-03-22,PartTime,Marketing,1992-03-22,jane.smith@company.com,Jane,Smith
2,Michael Jordan,1985-07-30michael.jordan@basketball.org,Contract,Sales,1985-07-30,michael.jordan@basketball.org,Michael,Jordan
3,Emily Brown,emily.brown@example.co.uk1988-11-10,Full-Time,Human Resources,1988-11-10,emily.brown@example.co.uk,Emily,Brown
4,Robert Taylor,1995-05-17robert.taylor@example.net,Part-Time,Information Technology,1995-05-17,robert.taylor@example.net,Robert,Taylor
5,Sophia Walker,1993-09-12sophia.walker@fashion.com,Full-Time,Design,1993-09-12,sophia.walker@fashion.com,Sophia,Walker
6,Chris Evans,chris.evans@movies.com1982-06-13,Freelance,Production,1982-06-13,chris.evans@movies.com,Chris,Evans
7,Emma Watson,1990-04-15emma.watson@hogwarts.edu,Intern,Education,1990-04-15,emma.watson@hogwarts.edu,Emma,Watson
8,Daniel Craig,daniel.craig@007.com1968-03-02,Contract,Security,1968-03-02,daniel.craig@007.com,Daniel,Craig
9,Olivia Johnson,olivia.johnson@example.org1991-08-25,Ful-Time,Operations,1991-08-25,olivia.johnson@example.org,Olivia,Johnson


In [14]:
unclean_data = unclean_data.drop(columns=['Name','Date_of_Birth_EmailID'])

In [15]:
# Step - 4
unclean_data = unclean_data[['First_name','Last_name','Date_of_Birth','Email_ID',"Department","Job_Type"]]

In [16]:
unclean_data

Unnamed: 0,Job_Type,Department,Date_of_Birth,Email_ID,First_name,Last_name
0,Full-Time,Engineering,1990-01-15,john.doe@example.com,John,Doe
1,PartTime,Marketing,1992-03-22,jane.smith@company.com,Jane,Smith
2,Contract,Sales,1985-07-30,michael.jordan@basketball.org,Michael,Jordan
3,Full-Time,Human Resources,1988-11-10,emily.brown@example.co.uk,Emily,Brown
4,Part-Time,Information Technology,1995-05-17,robert.taylor@example.net,Robert,Taylor
5,Full-Time,Design,1993-09-12,sophia.walker@fashion.com,Sophia,Walker
6,Freelance,Production,1982-06-13,chris.evans@movies.com,Chris,Evans
7,Intern,Education,1990-04-15,emma.watson@hogwarts.edu,Emma,Watson
8,Contract,Security,1968-03-02,daniel.craig@007.com,Daniel,Craig
9,Ful-Time,Operations,1991-08-25,olivia.johnson@example.org,Olivia,Johnson


In [17]:
# Export clean data-set as csv for EDA

In [18]:
unclean_data.to_csv('clean_data.csv',index = True)

# Results
- Summarizing the outcome of the data-cleaning process.

- Cleaned Dataset: The dataset has been cleaned, with special characters removed,
 fields properly separated, and names split into first and last names.
The resulting dataset is now ready for analysis.

# Importance of Data Cleaning
Why data cleaning is crucial in any data-related project.

1) Data Quality: Clean data ensures accuracy, consistency, and reliability of analysis results.

2) Improved Model Performance: Machine learning models perform better with clean data, leading to more accurate predictions.

3) Efficiency: Clean data reduces the time spent on troubleshooting data-related issues during analysis or model training.

4) Insights and Decision-Making: High-quality data leads to more reliable insights, which are essential for informed decision-making.