Python 3.9.13
OS: Windows 

#                                            HOW TO CLEAN DIRTY DATA

### Content

The dataset use for this exercise was taken from Kaggle and has following fields:

* `country`
* `country_code`
* `date_added`
* `has_expired` - Always `false`.
* `job_description` - The primary field for this dataset, containing the bulk of the information on what the job is about.
* `job_title`
* `job_type` - The type of tasks and skills involved in the job. For example, "management".
* `location`
* `organization`
* `page_url`
* `salary`
* `sector` - The industry sector the job is in. For example, "Medical services".

### I generally classify dirty data into 2 categories: Structure Dirty and Content Dirty

In [1]:
# First step load libraries
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import csv, parquet

### Lets pass from csv to parquet because in big data project is one of the most used an is good to get used to this format

In [2]:

table = csv.read_csv("Data/monster_com-job_sample.csv")
parquet.write_table(table, "Data/monster_com-job_sample.parquet")

In [3]:
df = pd.read_parquet("Data/monster_com-job_sample.parquet")

In [4]:
df.shape

(22000, 14)

In [5]:
df.info()
# If wee see the cuantity of null values it seems that there is none

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22000 entries, 0 to 21999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   country          22000 non-null  object
 1   country_code     22000 non-null  object
 2   date_added       22000 non-null  object
 3   has_expired      22000 non-null  object
 4   job_board        22000 non-null  object
 5   job_description  22000 non-null  object
 6   job_title        22000 non-null  object
 7   job_type         22000 non-null  object
 8   location         22000 non-null  object
 9   organization     22000 non-null  object
 10  page_url         22000 non-null  object
 11  salary           22000 non-null  object
 12  sector           22000 non-null  object
 13  uniq_id          22000 non-null  object
dtypes: object(14)
memory usage: 2.3+ MB


In [6]:
df.head()
# However if we see the data directly  we can see that there are empty cells for instance in date_added an organization
# so haw to identify null values?

Unnamed: 0,country,country_code,date_added,has_expired,job_board,job_description,job_title,job_type,location,organization,page_url,salary,sector,uniq_id
0,United States of America,US,,No,jobs.monster.com,TeamSoft is seeing an IT Support Specialist to...,IT Support Technician Job in Madison,Full Time Employee,"Madison, WI 53702",,http://jobview.monster.com/it-support-technici...,,IT/Software Development,11d599f229a80023d2f40e7c52cd941e
1,United States of America,US,,No,jobs.monster.com,The Wisconsin State Journal is seeking a flexi...,Business Reporter/Editor Job in Madison,Full Time,"Madison, WI 53708",Printing and Publishing,http://jobview.monster.com/business-reporter-e...,,,e4cbb126dabf22159aff90223243ff2a
2,United States of America,US,,No,jobs.monster.com,Report this job About the Job DePuy Synthes Co...,Johnson & Johnson Family of Companies Job Appl...,"Full Time, Employee",DePuy Synthes Companies is a member of Johnson...,Personal and Household Services,http://jobview.monster.com/senior-training-lea...,,,839106b353877fa3d896ffb9c1fe01c0
3,United States of America,US,,No,jobs.monster.com,Why Join Altec? If you’re considering a career...,Engineer - Quality Job in Dixon,Full Time,"Dixon, CA",Altec Industries,http://jobview.monster.com/engineer-quality-jo...,,Experienced (Non-Manager),58435fcab804439efdcaa7ecca0fd783
4,United States of America,US,,No,jobs.monster.com,Position ID# 76162 # Positions 1 State CT C...,Shift Supervisor - Part-Time Job in Camphill,Full Time Employee,"Camphill, PA",Retail,http://jobview.monster.com/shift-supervisor-pa...,,Project/Program Management,64d0272dc8496abfd9523a8df63c184c


In [7]:
df.eq('').sum()
# With this method we identify the quantity of empty values
# Acording to our first shape command we get 22000 rows 
# lets findout the procentage of empty values

country                0
country_code           0
date_added         21878
has_expired            0
job_board              0
job_description        0
job_title              0
job_type            1628
location               0
organization        6867
page_url               0
salary             18554
sector              5194
uniq_id                0
dtype: int64

In [8]:
perc = df.eq('').sum()/22000
perc*100

# What we can consider is that columns with more than 30% with empty values is not worth to study, nevertheless, column salary is important
# since it could be a label data for a ML 
# What can I do then in order to solve this situation?
# First thing that comes to my mind is how can I fill the empty spaces in salary column?

country             0.000000
country_code        0.000000
date_added         99.445455
has_expired         0.000000
job_board           0.000000
job_description     0.000000
job_title           0.000000
job_type            7.400000
location            0.000000
organization       31.213636
page_url            0.000000
salary             84.336364
sector             23.609091
uniq_id             0.000000
dtype: float64

# Structure Dirty

In [11]:
df[df['salary']!='']
# if we want to solve th salary column issue we have to start to clean our data
# salary has a range between two values and some string caracteres
# Lets split them and give then their own schema

Unnamed: 0,country,country_code,date_added,has_expired,job_board,job_description,job_title,job_type,location,organization,page_url,salary,sector,uniq_id
13,United States of America,US,,No,jobs.monster.com,Launch your teaching career with the Leader in...,Primrose Private Preschool Teacher Job in Houston,Full Time,"Houston, TX 77098",Education,http://jobview.monster.com/Primrose-Private-Pr...,9.00 - 13.00 $ /hour,Entry Level,b43c077756d5a326c4854e1399fd2464
14,United States of America,US,,No,jobs.monster.com,Construction Professional For more than 15 yea...,Superintendent Job in Houston,Full Time Employee,"Houston, TX",Construction - Industrial Facilities and Infra...,http://jobview.monster.com/Superintendent-Job-...,"80,000.00 - 95,000.00 $ /year",Building Construction/Skilled Trades,d8491fcefe14d1398de419984dccf427
19,United States of America,US,,No,jobs.monster.com,"Competitive compensation package, excellent co...",Technician - Robot & Multi-Axis CNC Field Serv...,Full Time,"Carter Lake, IA 51510",,http://jobview.monster.com/Technician-Robot-Mu...,"60,000.00 - 72,000.00 $ /year",Experienced (Non-Manager),3bef462fc38d743c7fbce17cf50ee7d5
23,United States of America,US,,No,jobs.monster.com,"Well respected, rapidly growing, and expandin...",Estimator - Construction Job in Denver,Full Time,"Denver, CO 80215",,http://jobview.monster.com/Estimator-Construct...,Excellent Pay and Incentives,,c552f63b5497f720942aaf943d629b1c
29,United States of America,US,,No,jobs.monster.com,Experis is working with a Pharmaceutical start...,Sr. Process Engineer,Full Time Employee,"Sr. Process Engineer, Manufacturing","Chicago, IL",http://jobview.monster.com/Sr-Process-Engineer...,"70,000.00 - 100,000.00 $ /year",Engineering,779bb4c9bf038b7fb775134736d36fd4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21987,United States of America,US,,No,jobs.monster.com,Own Your TerritoryAsk yourself these questions...,Experienced Territory Sales,Full Time,"Chicago, IL 60602",,http://jobview.monster.com/Experienced-Territo...,"$80,000.00+ /year",Experienced (Non-Manager),a696d26d6db9c5abe572117b9483e847
21995,United States of America,US,,No,jobs.monster.com,This is a major premier Cincinnati based finan...,Assistant Vice President - Controller Job in C...,Full Time,"Cincinnati, OH",,http://jobview.monster.com/Assistant-Vice-Pres...,"120,000.00 - 160,000.00 $ /yearbonus",,a80bc8cc3a90c17eef418963803bc640
21996,United States of America,US,,No,jobs.monster.com,Luxury homebuilder in Cincinnati seeking multi...,Accountant Job in Cincinnati,Full Time,"Cincinnati, OH 45236",Construction - Residential & Commercial/Office,http://jobview.monster.com/Accountant-Job-Cinc...,"45,000.00 - 60,000.00 $ /year",Manager (Manager/Supervisor of Staff),419a3714be2b30a10f628de207d041de
21998,United States of America,US,,No,jobs.monster.com,Jernberg Industries was established in 1937 an...,Electrician - Experienced Forging Electrician ...,Full Time Employee,"Chicago, IL 60609","Jernberg Industries, Inc.",http://jobview.monster.com/Electrician-Experie...,25.00 - 28.00 $ /hour,Installation/Maintenance/Repair,40161cf61c283af9dc2b0a62947a5f1b


Corrrelation

In [9]:
df.dtypes

country            object
country_code       object
date_added         object
has_expired        object
job_board          object
job_description    object
job_title          object
job_type           object
location           object
organization       object
page_url           object
salary             object
sector             object
uniq_id            object
dtype: object