# Data cleaning for Job Posting Data:

_Author: Matěj Srna_
***
<img src="https://static.wixstatic.com/media/2d21b4_568fb2e8e16e4a3b8dc39ddbd8802708~mv2.png" style="width:350px; height:350px; margin-left: auto; margin-right: auto" />
 
> _**Disclaimer: This data cleaning is only used for the educational purposes**_

#### Content:
1. [Module imports](#Module-imports:)
2. [Import of the data from CSV](#Import-of-the-data-from-CSV:)
3. [Jobs filtering based on the key words](#Jobs-filtering-based-on-the-key-words:)
4. [Deleting all unnecessary job postings from the dataset](#Deleting-all-unnecessary-job-postings-from-the-dataset:)
5. [Renaming of all job postings with uncommon names](#Renaming-of-all-job-postings-with-uncommon-names:)
6. [Saving the clean dataset to CSV for further analysis](#Saving-the-clean-dataset-to-CSV-for-further-analysis:)

### Module imports:

***
For this analysis, I had to import several libraries. Mainly I used Pandas to work easaly with Data in DataFrames.

Secondly I used the numpy to easily work with data. I used this library to put the date into the list.

***

In [73]:
import pandas as pd
from datetime import date
from time import sleep
import glob
import numpy as np

### Import of the data from CSV:

***
Here is the data loaded to the Pandas library. The cleaned and summarized data was prepared in the WebScraping tool where all the datasets from each day were merged together into one summarized dataset. To have an overview in size of the table I print the shape of the dataframe.
***

In [74]:
path =r"DATA/summary_data.csv"

raw_data = pd.read_csv(path)

In [75]:
print(f"Size of the table is: {raw_data.shape}")

Size of the table is: (2380, 8)


### Jobs filtering based on the key words:

***
This part of the code locates the names of the job postings which does not contain the keywords such as "Data, Analyst, Machine, Business, Science, and AI". This data will be deleted because it does not contain the relevant data for this analysis.

***

In [76]:
to_be_deleted_raw = raw_data.loc[~(raw_data["name"].str.contains(r"(Data)|(Analyst)|(Machine)|(Business)|(Science)|(AI)")),"name"].unique()
to_be_deleted_raw = np.ndarray.tolist(to_be_deleted_raw)
to_be_deleted = []
for n in range(len(to_be_deleted_raw)):
    new_value = to_be_deleted_raw[n]
    to_be_deleted.append(new_value)

#print(to_be_deleted)

  to_be_deleted_raw = raw_data.loc[~(raw_data["name"].str.contains(r"(Data)|(Analyst)|(Machine)|(Business)|(Science)|(AI)")),"name"].unique()


<div class="alert alert-block alert-info">
<b>Tip:</b> To easily and automatically delete all unnecessary data I put all the names into the list. Therefore, I can iterate through this list later on.</div>

### Deleting all unnecessary job postings from the dataset:

***
Based on the list created above I iterated the whole data frame to make sure I drop all the unnecessary data. Therefore the dataset is free of all the job postings not related to this analysis. 
***

In [77]:
length = len(to_be_deleted)

for n in range(length): 
    #print(to_be_deleted[n])
    raw_data.drop(raw_data.index[raw_data["name"]==to_be_deleted[n]], inplace=True)

<div class="alert alert-block alert-success">
<b>Success:</b> To delete all the unnecessary data I iterate through the list with function of drop in pandas dataset.
</div>

In [78]:
raw_data = raw_data.reset_index(drop=True)
clean_data = raw_data.drop_duplicates()
#print(raw_data)
clean_data.shape

(1725, 8)

In [90]:
display(clean_data["name"].value_counts())

Data Scientist                                             1298
Senior Data Scientist                                       156
Big Data Engineer                                           107
Junior Data Scientist                                        57
Lead Data Scientist                                          54
Data Analyst                                                 24
AI Engineer                                                  12
Staff Data Scientist                                          4
Senior Data Engineer                                          4
Data Engineer                                                 3
Senior Data Analyst                                           1
Business Intelligence Developers                              1
Data Crawling Engineer                                        1
Lead Data Engineer                                            1
Staff Data Engineer                                           1
Sap Advanced Business Application Progra

### Renaming of all job postings with uncommon names:

***
In this part of the code, all the names of job postings are converted into more common names. Most of the recruiters have tendencies to differ their job offers to attract more applicants, therefore in this part the names with added specifications are converted into common.

***

In [80]:
#clear fo all data scientist
data_science_rem = ["Machine Learning Engineer", "Embedded SW Engineer - Machine Learning", "Machine Learning SW Engineer", "Data Science Engineer - ML Projects", "Data Scientist, TikTok Creation and Consumption",
                    "Data Scientist - TikTok Account", "Data Scientist - Search", "Data Scientist, Ads Analytics", "Associate, Data Science", "Data Scientist - Relationship discovery", "Data Scientist (SAS)",
                    "Data Scientist, Special Projects", "Data Scientist - Hybrid", "Data Scientist I", "Data Scientist - Risk Data Mining", "Data Scientist - Hybrid - Urgent", 
                    "Multi Asset  Quant/Data Scientist role", "Data Scientist, Tiktok Ads- Growth Marketing", "NLP Researcher (Data Scientist)", "Data Scientist, Consultant", "Data Science Specialist",
                    "Data Scientist - Tiktok Ads, Ads Measurement", "Data Scientist- W2 ONLY", "Data Scientist - Marketing Analytics, Lakeland", "Data Scientist - NIH", 
                    "Data Scientist, User Growth - TikTok US - Tech Services", "Data Scientist - Pricing", "Staff Data Scientist - AI Conversational", "Data Scientist/Biostatician- RWE",
                    "Customer Data Scientist", "Data Scientist - North America", "Data Scientist, Tiktok Experience", "Data Scientist/Statistician", "Data Scientist, Tiktok Ads Interfaces",
                    "Marketing Data Scientist", "Staff Data Scientist - Strategy & Insights", "Data Scientist Analyst", "Data Scientist Intern", "Data Scientist, Analytics (Core Product)",
                    "Associate Data Scientist", "Data Scientist/Python Programmer (Hybrid)", "Data Scientist / AI Engineer", "Entry Level Data Scientist", "Director, Data & Technology Scientist", 
                    "Data Scientist - Rare Diseases", 
                   ]

for n in data_science_rem:
    clean_data.loc[clean_data["name"] == n, "name"] = "Data Scientist"

    
#clear for all senior data scinetist    
data_science_s_rem = ["Senior Data analyst/Scientist", "Senior Data Scientist for Time Series projects", "Senior Data Engineer (Store No8 | Health & Wellness)", 
                      "Senior Machine Learning Engineer, Recommendation - US Tech Services", "Senior Software Engineer, Machine Learning Platform", "Python, Data Science and Machine Learning",
                      "Data Scientist II, Analytics", "Sr. Data Scientist", "Senior Data Scientist (Freelance)", "Data Scientist II, Product Analytics", "Senior Data Scientist for Dataclair AI Centre (m/f)",
                      "Senior Data Scientist - Banking", "Senior Machine Learning Engineer", "Data Scientist III, Analytics", "Machine Learning Researcher", "Data Scientist, Analytics II", 
                      "Data Scientist, Machine Learning", "Senior Data Scientist – Healthcare (MCMC)", 
                     ]

for n in data_science_s_rem:
    clean_data.loc[clean_data["name"] == n, "name"] = "Senior Data Scientist"


#clear for all lead data scientist
data_science_l_rem = ["Software developer in the field of Machine Vision", "Tech Lead, Machine Learning Engineer, Recommendation & Algorithm", "Senior Staff Machine Learning Engineer", 
                      "Lead Data Scientist - Vision Care", "Data Science Lead",
    
                     ]

for n in data_science_l_rem:
    clean_data.loc[clean_data["name"] == n, "name"] = "Lead Data Scientist"


#clear for all data analyst
data_analyst_rem = ["Employee Benefits Underwriter - Data Analyst", "Data Analyst – Great opportunity in a world of Telecommunications", "Fleet Project & Process & Data Analyst", 
                    "Datový analytik - oblast Data Governance", "Data analytik", "Data Analytics\/Integration Developer", "Entry Level Data Analyst", 
                    "Qualitative Data Analyst, Vaccine Equity Partner Engagement",
    
                   ]
for n in data_analyst_rem:
    clean_data.loc[clean_data["name"] == n, "name"] = "Data Analyst"

    
#clean for all Big data engineer
big_data_rem = ["Junior Java Software Engineer (Big Data processing and Data Mining)", "Full Stack Software Engineer (.NET and react.js), Data & Services, Locations Program", 
                "Data Engineer - Big Data + Digital Marketing", "Cloud Data Engineer | Renewable Energy Trading Firm | Boston", "Sr. Data Engineer", "Data Engineer II", 
                "Java Software Engineer – Big data / Datalake",
                    
               ]
for n in big_data_rem:
    clean_data.loc[clean_data["name"] == n, "name"] = "Big Data Engineer"


***
Here you can see all names of job postings with common names ready for further analysis.

***

In [93]:
clean_data["name"].value_counts()

Data Scientist                                             1298
Senior Data Scientist                                       156
Big Data Engineer                                           107
Junior Data Scientist                                        57
Lead Data Scientist                                          54
Data Analyst                                                 24
AI Engineer                                                  12
Staff Data Scientist                                          4
Senior Data Engineer                                          4
Data Engineer                                                 3
Senior Data Analyst                                           1
Business Intelligence Developers                              1
Data Crawling Engineer                                        1
Lead Data Engineer                                            1
Staff Data Engineer                                           1
Sap Advanced Business Application Progra

In [82]:
# additional drop
print(f"Size of the table is: {clean_data.shape}")

Size of the table is: (1725, 8)


***
In this part of the code, I am making sure that all the data is unique and that there are no duplicates in this dataset.

***

In [83]:
clean_data = clean_data.drop_duplicates()
clean_data.shape

(1725, 8)

In [94]:
display(clean_data.isnull().value_counts())

name   company_name  ago    contract  location  date   description  link 
False  False         False  False     False     False  False        False    1725
dtype: int64

In [85]:
clean_data["company_name"].value_counts()

HP                           130
LHH                           98
TikTok                        97
NXP Semiconductors            81
Ataccama                      68
                            ... 
Experis IT Czech Republic      1
DoDo                           1
MONETA Money Bank              1
ITAB Group                     1
PTC                            1
Name: company_name, Length: 216, dtype: int64

In [86]:
clean_data.loc[clean_data["location"] == "Praha", "location"] = "Hlavní město Praha, Česko"

clean_data.loc[clean_data["location"] == "New York, NY", "location"] = "New York, United States"  
clean_data.loc[clean_data["location"] == "New York City Metropolitan Area", "location"] = "New York, United States"

In [87]:
clean_data["location"].value_counts()

Hlavní město Praha, Česko    443
Brno                         110
Austin, TX                   100
New York, United States       75
Mountain View, CA             65
                            ... 
Schaumburg, IL                 1
Zlín a okolí                   1
Hlavní město Praha             1
Jacksonville, FL               1
Denver Metropolitan Area       1
Name: location, Length: 114, dtype: int64

***
Before saving all of the data from the dataset into the CSV file, I display all web-scraping in descending order based on the date with the number of unique jobs. 

***

In [88]:
clean_data["date"].value_counts().sort_index()

2022-05-15    66
2022-05-16    65
2022-05-17    90
2022-05-18    78
2022-05-19    97
2022-05-20    65
2022-05-21    79
2022-05-22    56
2022-05-23    51
2022-05-24    95
2022-05-25    75
2022-05-26    85
2022-05-27    73
2022-05-28    61
2022-05-29    54
2022-05-30    60
2022-05-31    97
2022-06-01    69
2022-06-02    57
2022-06-03    39
2022-06-04    78
2022-06-05    82
2022-06-06    62
2022-06-07    38
2022-06-08    53
Name: date, dtype: int64

### Saving the clean dataset to CSV for further analysis:

In [89]:
clean_data.to_csv("DATA_IMPORT/clean_data.csv", index=False, encoding="utf-8")
print("Done")

Done


<div class="alert alert-block alert-success">
<b>Success:</b> All the data was saved into CSV file.
</div>

<br>
<br>
<span style="color:green"><p style="text-align:right; font-style: italic"><b>Matěj Srna</b></p></span>
<hr></hr>
<em>My links:</em>
<br>
<table style="float:left; width: 250px; border-collapse: separate;">
<thead>
<tr><th><a href="https://www.linkedin.com/in/matejsrna" target="_blank" rel="noopener noreferrer"><img src="https://static.wixstatic.com/media/2d21b4_80567ee7301a4a50ada13620eaa1028d~mv2.png" style="width:48px; height:48px"/></a></th><th><a href="https://github.com/srnamaty" target="_blank" rel="noopener noreferrer"><img src="https://static.wixstatic.com/media/2d21b4_1d247a3f36384cd8b0eecf23b2010fae~mv2.png" style="width:58px; height:58px" /></a></th><th><a href="https://www.srnamatej.com" target="_blank" rel="noopener noreferrer"><img src="https://static.wixstatic.com/media/2d21b4_17665bc36b10443c8446ea225ce8f748~mv2.png" style="width:58px; height:58px" /></a></th></tr>
</thead>
</table>