# Data Cleaning

This notebook focuses on cleaning and preparing the merged job dataset.
Steps include:
- Removing extra spaces
- Extracting experience years
- Converting data types
- Handling duplicates
- Cleaning skills column
- Cleaning location column
- Removing unnecessary columns
- Exporting the cleaned dataset

In [81]:
import pandas as pd
import numpy as np

## Loading the Dataset
Load the merged dataset from the previous step for cleaning.


In [82]:
Data_jobs = pd.read_csv(r"D:\Projects\Year2-Term1\Project Data Science Methodology\Data inspection\Jobs")

## Stripping Extra Spaces From Text Columns
Removes leading and trailing spaces from all text-based columns.

In [83]:
text_cols = ['Title', 'company', 'location','job_type','work_mode','Experience_level','Experience_year','categories','department','skills']
for col in text_cols:
    Data_jobs[col] = Data_jobs[col].str.strip()

## Cleaning the Experience_year Column
- Extracts only the first value if the range is "1 - 3 years"
- Converts valid numbers to integers
- Replaces invalid or empty values with "Unknown"

In [84]:
Data_jobs['Experience_year'] = Data_jobs['Experience_year'].str.split("-").str[0].str.strip()

In [85]:
Data_jobs['Experience_year']

0                               5
1                               3
2                               1
3                               1
4                               3
                   ...           
10834              Administration
10835              Administration
10836              Administration
10837        Business Development
10838    Customer Service/Support
Name: Experience_year, Length: 10839, dtype: object

In [86]:
Data_jobs['Experience_year'] = Data_jobs['Experience_year'].apply(lambda x: int(x) if str(x).isdigit()else np.nan)

In [87]:
Data_jobs['Experience_year'] = Data_jobs['Experience_year'].fillna("Unknown")

## Converting Columns to Categorical Types
Convert appropriate columns to categorical type to reduce memory usage.

In [88]:
Data_jobs['work_mode'] = Data_jobs['work_mode'].astype('category')
Data_jobs['job_type'] = Data_jobs['job_type'].astype('category')
Data_jobs['Experience_level'] = Data_jobs['Experience_level'].astype('category')
Data_jobs['categories'] = Data_jobs['categories'].astype('category')
Data_jobs['department'] = Data_jobs['department'].astype('category')

Data_jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10839 entries, 0 to 10838
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   Title             10839 non-null  object  
 1   company           10839 non-null  object  
 2   location          10839 non-null  object  
 3   job_type          10839 non-null  category
 4   work_mode         6684 non-null   category
 5   Experience_level  10839 non-null  category
 6   Experience_year   10839 non-null  object  
 7   categories        10839 non-null  category
 8   department        10839 non-null  category
 9   skills            10839 non-null  object  
dtypes: category(5), object(5)
memory usage: 528.7+ KB


## Removing Duplicate Rows
Drops any duplicated rows to ensure data quality.


In [89]:
Data_jobs.drop_duplicates(inplace =True)

In [90]:
Data_jobs.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8234 entries, 0 to 10837
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   Title             8234 non-null   object  
 1   company           8234 non-null   object  
 2   location          8234 non-null   object  
 3   job_type          8234 non-null   category
 4   work_mode         5183 non-null   category
 5   Experience_level  8234 non-null   category
 6   Experience_year   8234 non-null   object  
 7   categories        8234 non-null   category
 8   department        8234 non-null   category
 9   skills            8234 non-null   object  
dtypes: category(5), object(5)
memory usage: 475.8+ KB


In [91]:
duplicates = Data_jobs.duplicated().sum()
print(f"Number of duplicates: {duplicates}")

Number of duplicates: 0


## Cleaning the Skills Column
- Remove brackets and quotes
- Split skills into lists
- Create a new column containing the number of skills
- Remove the original skills column

In [92]:
Data_jobs['skills'] = Data_jobs['skills'].str.strip("[]")

In [93]:
Data_jobs['skills'] = Data_jobs['skills'].str.replace("'", "", regex=False)

In [94]:
Data_jobs['skills'] =  Data_jobs['skills'].str.split(', ')

In [95]:
Data_jobs['skills'][0]

['Database Management',
 'Python',
 'Data Analysis',
 'Machine Learning',
 'Data Visualization',
 'SQL',
 'ETL (Extract',
 'Transform',
 'Load)',
 'Big Data Technologies']

In [96]:
Data_jobs["Number_of_skills"] = Data_jobs["skills"].apply(lambda x: len(x))

In [97]:
Data_jobs['Number_of_skills']

0        10
1         9
2         3
3         8
4         8
         ..
10824     7
10827     7
10833     7
10834     7
10837     6
Name: Number_of_skills, Length: 8234, dtype: int64

In [98]:
Data_jobs.drop(columns=["skills"], inplace=True)

In [99]:
Data_jobs.head()

Unnamed: 0,Title,company,location,job_type,work_mode,Experience_level,Experience_year,categories,department,Number_of_skills
0,Data Scientist with Database Expertise,Confidential,"Riyadh, Saudi Arabia",Full Time,On-site,Experienced,5.0,IT/Software Development,Data Science,10
1,Data Analyst,Royal Herbs,"Haram, Giza, Egypt",Full Time,On-site,Experienced,3.0,Analyst/Research,Analyst/Research,9
2,Data Entry Specialist,El-Dahan Company,"Cairo, Egypt",Full Time,On-site,Entry Level,1.0,Administration,Administration,3
3,Junior Data Analyst,Yodawy,"Mohandessin, Giza, Egypt",Full Time,On-site,Entry Level,1.0,Logistics/Supply Chain,Operations/Management,8
4,Data Analytics Specialist,MEAHCO - Saudi German Health,"Katameya, Cairo, Egypt",Full Time,On-site,Experienced,3.0,Medical/Healthcare,Quality,8


## Cleaning the Location Column
- Remove any extra quotes
- Split by comma and keep only the main location (country/city)
- Standardize format

In [100]:
Data_jobs['location'] = Data_jobs['location'].str.strip('"')

In [101]:
Data_jobs['location'] =  Data_jobs['location'].str.split(', ')

In [102]:
Data_jobs['location'] =  Data_jobs['location'].str[0]

In [103]:
Data_jobs['location']

0             Riyadh
1              Haram
2              Cairo
3        Mohandessin
4           Katameya
            ...     
10824          Dubai
10827         Riyadh
10833         Riyadh
10834         Riyadh
10837         Riyadh
Name: location, Length: 8234, dtype: object

## Checking for Empty Work Mode Values
Count how many rows have empty or invalid work mode entries.


In [104]:
Data_jobs['work_mode'].unique()

['On-site', 'Hybrid', 'Remote', NaN]
Categories (3, object): ['Hybrid', 'On-site', 'Remote']

In [105]:
empty_count = Data_jobs['work_mode'].astype(str).str.strip().isin(["", "nan", "None"]).sum()
print(empty_count)

3051


In [106]:
Data_jobs.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8234 entries, 0 to 10837
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   Title             8234 non-null   object  
 1   company           8234 non-null   object  
 2   location          8234 non-null   object  
 3   job_type          8234 non-null   category
 4   work_mode         5183 non-null   category
 5   Experience_level  8234 non-null   category
 6   Experience_year   8234 non-null   object  
 7   categories        8234 non-null   category
 8   department        8234 non-null   category
 9   Number_of_skills  8234 non-null   int64   
dtypes: category(5), int64(1), object(4)
memory usage: 733.9+ KB


## Removing the work_mode Column
The column contains too many missing/inconsistent values, so it will be removed.

In [107]:
Data_jobs.drop(columns=["work_mode"], inplace=True)

In [108]:
Data_jobs.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8234 entries, 0 to 10837
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   Title             8234 non-null   object  
 1   company           8234 non-null   object  
 2   location          8234 non-null   object  
 3   job_type          8234 non-null   category
 4   Experience_level  8234 non-null   category
 5   Experience_year   8234 non-null   object  
 6   categories        8234 non-null   category
 7   department        8234 non-null   category
 8   Number_of_skills  8234 non-null   int64   
dtypes: category(4), int64(1), object(4)
memory usage: 725.7+ KB


## Displaying Unique Values for Each Column
This code iterates through all columns in the `Data_jobs` DataFrame and prints all unique values for each column. A separator line is added for readability.

In [109]:
for col in Data_jobs.columns:
    unique_vals = Data_jobs[col].unique()
    print(f"{col}:")
    print(unique_vals)
    print("-" * 50) 


Title:
['Data Scientist with Database Expertise' 'Data Analyst'
 'Data Entry Specialist' ... 'Agent-Purchasing'
 'Director of Rooms - Ritz Carlton Amaala, Saudi Arabia' 'AsstDir-Sales I']
--------------------------------------------------
company:
['Confidential' 'Royal Herbs' 'El-Dahan Company' ... 'Groupe Clarins'
 'Paradox EN' 'Bechtel Corporation']
--------------------------------------------------
location:
['Riyadh' 'Haram' 'Cairo' 'Mohandessin' 'Katameya' 'Dokki' 'Heliopolis'
 'Abu Rawash' 'Nasr City' 'New Cairo' 'Giza' 'Sheraton' 'Sheikh Zayed'
 'Maadi' '6th of October' 'Sidi Gaber' 'Alexandria' 'Abu Dhabi'
 '10th of Ramadan City' 'Obour City' 'New Nozha' 'Mallawi' 'Badr City'
 'New Capital' 'Hadayek Alahram' '15th May City' 'Menia' 'Madinaty'
 'Bourj Alarab' 'Smouha' 'Ameria' 'Vancouver' 'Alsadat City' 'Al Ahmadi'
 'Dkhaila' 'Agouza' 'Mokattam' 'Luxor' 'Manakh' 'Ataqah' 'Mahalla Kubra'
 'Batang' 'London' 'Shorouk City' 'Dabaa' 'Helwan' 'Bahariya Oasis'
 'Smart Village' 'Berkel

## Mapping `job_type` Values to English
The `job_type` column contains values in different languages or formats. This code maps all entries to a consistent English format using a dictionary and updates the column accordingly.

In [110]:
job_type_map = {
     "دوام كامل" : 'Full Time',
     'Freelance / Project' : 'Freelance / Project',
    'Full Time' : 'Full Time', 
    'Internship' : 'Internship', 
    'Part Time' : 'Part Time', 
    'Shift Based' : 'Shift Based', 
    'Volunteering' : 'Volunteering'
}
Data_jobs['job_type'] = Data_jobs['job_type'].astype(str).map(job_type_map)

In [111]:
unique_vals = Data_jobs['Experience_level'].unique()
print(f"Experience_level:")
print(unique_vals)


Experience_level:
['Experienced', 'Entry Level', 'Manager', 'Senior Management', 'Student', 'Not specified', 'ذو خبرة', 'مستوى مبتدئ', 'مدير']
Categories (9, object): ['Entry Level', 'Experienced', 'Manager', 'Not specified', ..., 'Student', 'ذو خبرة', 'مدير', 'مستوى مبتدئ']


## Mapping `Experience_level` Values to English
The `Experience_level` column contains values in different languages or formats. This code maps all entries to a consistent English format using a dictionary and updates the column accordingly.

In [112]:
Experience_level_map = {
    "ذو خبرة" : 'Experienced',
    'مستوى مبتدئ' : 'Entry Level',
    'مدير' : 'Manager', 
    'Experienced' : 'Experienced', 
    'Entry Level' : 'Entry Level', 
    'Manager' : 'Manager', 
    'Senior Management' : 'Senior Management',
    'Student' : 'Student',
    'Not specified' : 'Not specified'
}
Data_jobs['Experience_level'] = Data_jobs['Experience_level'].astype(str).map(Experience_level_map)

In [113]:
unique_vals = Data_jobs['Experience_level'].unique()
print(f"Experience_level:")
print(unique_vals)

Experience_level:
['Experienced' 'Entry Level' 'Manager' 'Senior Management' 'Student'
 'Not specified']


In [114]:
for col in Data_jobs.columns:
    unique_vals = Data_jobs[col].unique()
    print(f"{col}:")
    print(unique_vals)
    print("-" * 50) 

Title:
['Data Scientist with Database Expertise' 'Data Analyst'
 'Data Entry Specialist' ... 'Agent-Purchasing'
 'Director of Rooms - Ritz Carlton Amaala, Saudi Arabia' 'AsstDir-Sales I']
--------------------------------------------------
company:
['Confidential' 'Royal Herbs' 'El-Dahan Company' ... 'Groupe Clarins'
 'Paradox EN' 'Bechtel Corporation']
--------------------------------------------------
location:
['Riyadh' 'Haram' 'Cairo' 'Mohandessin' 'Katameya' 'Dokki' 'Heliopolis'
 'Abu Rawash' 'Nasr City' 'New Cairo' 'Giza' 'Sheraton' 'Sheikh Zayed'
 'Maadi' '6th of October' 'Sidi Gaber' 'Alexandria' 'Abu Dhabi'
 '10th of Ramadan City' 'Obour City' 'New Nozha' 'Mallawi' 'Badr City'
 'New Capital' 'Hadayek Alahram' '15th May City' 'Menia' 'Madinaty'
 'Bourj Alarab' 'Smouha' 'Ameria' 'Vancouver' 'Alsadat City' 'Al Ahmadi'
 'Dkhaila' 'Agouza' 'Mokattam' 'Luxor' 'Manakh' 'Ataqah' 'Mahalla Kubra'
 'Batang' 'London' 'Shorouk City' 'Dabaa' 'Helwan' 'Bahariya Oasis'
 'Smart Village' 'Berkel

## Saving the Cleaned Dataset
Export the final cleaned dataset to a new CSV file.

In [115]:
Data_jobs.to_csv("cleaned_jobs.csv", index=False)