Python 3.9.13
OS: Windows 

#                                            HOW TO CLEAN DIRTY DATA

### Content

The dataset use for this exercise was taken from Kaggle and has following fields:

* `country`
* `country_code`
* `date_added`
* `has_expired` - Always `false`.
* `job_description` - The primary field for this dataset, containing the bulk of the information on what the job is about.
* `job_title`
* `job_type` - The type of tasks and skills involved in the job. For example, "management".
* `location`
* `organization`
* `page_url`
* `salary`
* `sector` - The industry sector the job is in. For example, "Medical services".

### I generally classify dirty data into 2 categories: Structure Dirty and Content Dirty

In [497]:
# First step load libraries
import pandas as pd
import pyarrow as pa
import numpy as np
import pyarrow.parquet as pq
from pyarrow import csv, parquet

### Lets load CVS file to work with it

In [498]:

df = pd.read_csv("Data/monster_com-job_sample.csv")

In [499]:
type(df)
# Lets find out the type of the dataset,offcourse it has to  be pandas

pandas.core.frame.DataFrame

In [500]:
shape_=df.shape
# We can see that we have 22000 rows and 14 columns, this information is very useful when you clean data because allow you to check changes

In [501]:
df.isnull().sum()/shape_[0]*100
# With this method we identify the quantity of null values
# Acording to our first shape command we get 22000 rows 
# lets findout the procentage of empty values

country             0.000000
country_code        0.000000
date_added         99.445455
has_expired         0.000000
job_board           0.000000
job_description     0.000000
job_title           0.000000
job_type            7.400000
location            0.000000
organization       31.213636
page_url            0.000000
salary             84.336364
sector             23.609091
uniq_id             0.000000
dtype: float64

In [502]:
df = df.set_index('uniq_id')
df.head(5)
# Before doing anything we are going to ser the column 'uniq_id' as an index, this way we are noi going to have integrity issues when makin join o merge

Unnamed: 0_level_0,country,country_code,date_added,has_expired,job_board,job_description,job_title,job_type,location,organization,page_url,salary,sector
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
11d599f229a80023d2f40e7c52cd941e,United States of America,US,,No,jobs.monster.com,TeamSoft is seeing an IT Support Specialist to...,IT Support Technician Job in Madison,Full Time Employee,"Madison, WI 53702",,http://jobview.monster.com/it-support-technici...,,IT/Software Development
e4cbb126dabf22159aff90223243ff2a,United States of America,US,,No,jobs.monster.com,The Wisconsin State Journal is seeking a flexi...,Business Reporter/Editor Job in Madison,Full Time,"Madison, WI 53708",Printing and Publishing,http://jobview.monster.com/business-reporter-e...,,
839106b353877fa3d896ffb9c1fe01c0,United States of America,US,,No,jobs.monster.com,Report this job About the Job DePuy Synthes Co...,Johnson & Johnson Family of Companies Job Appl...,"Full Time, Employee",DePuy Synthes Companies is a member of Johnson...,Personal and Household Services,http://jobview.monster.com/senior-training-lea...,,
58435fcab804439efdcaa7ecca0fd783,United States of America,US,,No,jobs.monster.com,Why Join Altec? If you’re considering a career...,Engineer - Quality Job in Dixon,Full Time,"Dixon, CA",Altec Industries,http://jobview.monster.com/engineer-quality-jo...,,Experienced (Non-Manager)
64d0272dc8496abfd9523a8df63c184c,United States of America,US,,No,jobs.monster.com,Position ID# 76162 # Positions 1 State CT C...,Shift Supervisor - Part-Time Job in Camphill,Full Time Employee,"Camphill, PA",Retail,http://jobview.monster.com/shift-supervisor-pa...,,Project/Program Management


### For sure we are not going to need data_added columns since it more than 99% of null values
- One observartio to have in mind is that salary is a label column so spit of this to have so many null values we have to find a way to use it cleaning structure an data

#### Lets to drop columns that has no information an are no really necesary, in this case salary is not going to be erase despite of the high porcentage of missing data because can be use later as a lable column

- 'date_added' column has 99.45 of null data so is necesary to drop it

In [503]:
# each of These columns have the same information so we can take for granted that are nos usefull, erasing these columns we can get better performance.

df['country_code'].unique() # the unique value for country code is US 

df['job_board'].unique() # The unique value for  Job_board jobs.monster.com

df['has_expired'].unique() # The unique value form has_expired  is No

df['page_url'].unique() #Page_url has the same information of job_title

df['country'].unique() # The unique value for country is United States of America

array(['United States of America'], dtype=object)

In [504]:
df = df.drop(['date_added','page_url','country_code','has_expired','country','job_board'], axis = 1)
df.tail(10)

# Now lets drop each one of the columns that have the same information

Unnamed: 0_level_0,job_description,job_title,job_type,location,organization,salary,sector
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1a48a8576057b15e2930281ad00fba7f,To apply to this Finance Manager role in Chica...,Finance Manager Job in Chicago,Full Time Employee,"Chicago, IL 60603",,,Business/Strategic Management
9796e104240789dd33cc436f6c383892,"Full-Time Amber Park Cincinnati, OH 3801 East ...",Licensed Practical Nurse LPN Job in Cincinnati,Full Time Employee,"Cincinnati, OH 45236",Healthcare Services Other/Not Classified,,Medical/Health
0d8f312b55b917cc81b8cf1342d6f033,Sales Management TraineeLooking to use your sk...,Customer Service Professionals Consider a Care...,Full Time Employee,"Cincinnati, OH 45237",,,Field SalesGeneral/Other: Sales/Business Devel...
c1bb7e57dd8f5fb4fecc8e9d7bb69a19,We are looking to recruit a personal individua...,Patient Access Representative Job in Chicago,Full Time Temporary/Contract/Project,"Chicago, IL 60603",,,Customer Support/Client Care
abd9ad3e0ec3c934b5a59f3776012865,What The Job Is AboutSales Support Representat...,Immediate Customer Service Position Job in Cin...,Part Time,"Cincinnati, OH 45202",All,,Entry Level
a80bc8cc3a90c17eef418963803bc640,This is a major premier Cincinnati based finan...,Assistant Vice President - Controller Job in C...,Full Time,"Cincinnati, OH",,"120,000.00 - 160,000.00 $ /yearbonus",
419a3714be2b30a10f628de207d041de,Luxury homebuilder in Cincinnati seeking multi...,Accountant Job in Cincinnati,Full Time,"Cincinnati, OH 45236",Construction - Residential & Commercial/Office,"45,000.00 - 60,000.00 $ /year",Manager (Manager/Supervisor of Staff)
5a590350b73b2cec46b05750a208e345,RE: Adobe AEM- Client - Loca...,AEM/CQ developer Job in Chicago,Full Time,"Chicago, IL 60602",,,
40161cf61c283af9dc2b0a62947a5f1b,Jernberg Industries was established in 1937 an...,Electrician - Experienced Forging Electrician ...,Full Time Employee,"Chicago, IL 60609","Jernberg Industries, Inc.",25.00 - 28.00 $ /hour,Installation/Maintenance/Repair
cb49f16ad72627b109e434e0cac97f7a,Contract AdministratorCan you be the point per...,Contract Administrator Job in Cincinnati,Full Time,"Cincinnati, OH",,"40,000.00 - 46,000.00 $ /year+ annual bonus (u...",Experienced (Non-Manager)


In [505]:
 df = df.dropna(subset=['job_type','organization','sector'], how='any') 
 df.tail(10)
 # Now lets drop Na data of these three columns 

Unnamed: 0_level_0,job_description,job_title,job_type,location,organization,salary,sector
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8e9f8638556bc1fd671bb99f4f01ac4d,CULINARY CAREER WEST CHESTER OHIO!EXECUTIVE CH...,EXECUTIVE CHEF WEST CHESTER OHIO $K-$K PLUS! B...,Full Time,"West Chester, OH",All,"75,000.00 - 85,000.00 $ /yearHighly Competitiv...",Manager (Manager/Supervisor of Staff)
ebce61a714f4dd7d15b0263fab42751e,"McCormick & Company, Incorporated, a global le...",Customer Business Manager Job in Cincinnati,"Full Time, Employee","Cincinnati, OH 45202",Food and Beverage Production,,Marketing/Product
8a36252a31d7b06e901be0596bb6501a,About the JobWhat is more secure than the repl...,Outside Sales Representative Job in Cincinnati,Full Time Employee,"Cincinnati, OH 45202",AllEnergy and UtilitiesBusiness Services - Other,,Experienced (Non-Manager)
a53be963aac0a938a50a4d4cf7bc3ca3,Company DescriptionProSource—a total office so...,Help Desk Support Engineer Job in West Chester,Full Time,"West Chester, OH",All,,Experienced (Non-Manager)
1d37a888ca65fd919e459147a4c33457,"About Us: Viox Services, a wholly owned subsid...",Custodian Lead Job in Cincinnati,Full Time Employee,Location:,Real Estate/Property Management,,Installation/Maintenance/Repair
7502ee8f0d324f86334c531fa8bcf663,RESPONSIBILITIES: ...,Accountant Job in Cincinnati,Full Time,"Cincinnati, OH 45249",Healthcare Services,,Entry Level
9796e104240789dd33cc436f6c383892,"Full-Time Amber Park Cincinnati, OH 3801 East ...",Licensed Practical Nurse LPN Job in Cincinnati,Full Time Employee,"Cincinnati, OH 45236",Healthcare Services Other/Not Classified,,Medical/Health
abd9ad3e0ec3c934b5a59f3776012865,What The Job Is AboutSales Support Representat...,Immediate Customer Service Position Job in Cin...,Part Time,"Cincinnati, OH 45202",All,,Entry Level
419a3714be2b30a10f628de207d041de,Luxury homebuilder in Cincinnati seeking multi...,Accountant Job in Cincinnati,Full Time,"Cincinnati, OH 45236",Construction - Residential & Commercial/Office,"45,000.00 - 60,000.00 $ /year",Manager (Manager/Supervisor of Staff)
40161cf61c283af9dc2b0a62947a5f1b,Jernberg Industries was established in 1937 an...,Electrician - Experienced Forging Electrician ...,Full Time Employee,"Chicago, IL 60609","Jernberg Industries, Inc.",25.00 - 28.00 $ /hour,Installation/Maintenance/Repair



#### In the last shape command we got 20000 rows and now after dropping na values of some of columnd we get 11847 rows, the only column that remain with na vlues is salary with 83 % of null values for future propouses

In [506]:
shape_ = df.shape
df.isnull().sum()/shape_[0]*100

job_description     0.000000
job_title           0.000000
job_type            0.000000
location            0.000000
organization        0.000000
salary             83.801806
sector              0.000000
dtype: float64

# Structure Dirty

Salary column has a lot information that can be useful but we have to repair the structure of the information 
 
 - This column has a range between two values so we have to split them
 Lets split them and give then their own schema (first_sl float, second_sl float, mean_sl float, period_sl string)
 we have to erase son characteres like '-' and '$'


### Salary clean Structure

In [507]:
df['salary'].unique()
# We can see that thera are numbers, '-', strings ,'$' 

array([nan, '9.00 - 13.00 $ /hour', '80,000.00 - 95,000.00 $ /year', ...,
       '55,000.00 - 60,000.00 $ /yearFull Benefits Package, Life Insurance, 401K, Relocation Support',
       '75,000.00 - 85,000.00 $ /yearHighly Competitive Base Salary PLUS Lucrative Bonus Plan in a Highly Diverse, Dynamic, Successful Company! Performance Based Upward Mobility Assured.\u200b Do Not Hesitate, Apply Today and Grow with this Nationally Present Award Winning Restaurant Group!',
       '25.00 - 28.00 $ /hour'], dtype=object)

In [508]:
df_sal_split = df['salary']
df_sal_split.tail(10)

uniq_id
8e9f8638556bc1fd671bb99f4f01ac4d    75,000.00 - 85,000.00 $ /yearHighly Competitiv...
ebce61a714f4dd7d15b0263fab42751e                                                  NaN
8a36252a31d7b06e901be0596bb6501a                                                  NaN
a53be963aac0a938a50a4d4cf7bc3ca3                                                  NaN
1d37a888ca65fd919e459147a4c33457                                                  NaN
7502ee8f0d324f86334c531fa8bcf663                                                  NaN
9796e104240789dd33cc436f6c383892                                                  NaN
abd9ad3e0ec3c934b5a59f3776012865                                                  NaN
419a3714be2b30a10f628de207d041de                        45,000.00 - 60,000.00 $ /year
40161cf61c283af9dc2b0a62947a5f1b                                25.00 - 28.00 $ /hour
Name: salary, dtype: object

In [509]:
df_sal_split= df_sal_split.str.split('/',expand=True)
df_sal_split
# We have to split the salary column and expand the results 


Unnamed: 0_level_0,0,1,2,3,4,5
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
58435fcab804439efdcaa7ecca0fd783,,,,,,
64d0272dc8496abfd9523a8df63c184c,,,,,,
1e2637cb5f7a2c4615a99a26c0566c66,,,,,,
a6a2b5e825b8ce1c3b517adb2497c5ed,,,,,,
2f8bdf60db4d85627ab8f040e67aa78d,,,,,,
...,...,...,...,...,...,...
7502ee8f0d324f86334c531fa8bcf663,,,,,,
9796e104240789dd33cc436f6c383892,,,,,,
abd9ad3e0ec3c934b5a59f3776012865,,,,,,
419a3714be2b30a10f628de207d041de,"45,000.00 - 60,000.00 $",year,,,,


In [510]:
df_sal_split[0].unique()
# There are some characteres that replace the range of salary because those companies don't want to say at first the payment

array([nan, '9.00 - 13.00 $ ', '80,000.00 - 95,000.00 $ ',
       '70,000.00 - 100,000.00 $ ', '68,000.00 - 72,000.00 $ ',
       '58,000.00 - 65,000.00 $ ', 'Up to $32000.00',
       'Salary, plus commission', '45,000.00 - 100,000.00 $ ',
       '40,000.00 - 50,000.00 $ ', '13.75 - 16.75 $ ',
       '35,000.00 - 45,000.00 $ ',
       'bonus, 401K matching, medical, vacation',
       '31,000.00 - 33,000.00 $ ', '17.00 - 22.00 $ ',
       '56,000.00 - 64,000.00 $ ', '45.00 - 50.00 $ ',
       '75,000.00 - 130,000.00 $ ', 'Up to $45000.00',
       '0.00 - 85,000.00 $ ', 'Negotiable based on experience',
       '60,000.00 - 110,000.00 $ ', 'Competitive Wages',
       '50,000.00 - 100,000.00 $ ',
       'Burg Simpson offers excellent benefits and compensation commensurate with experience.',
       '$40,000.00+ ', 'Excellent compensation and benefits',
       '69,000.00 - 101,000.00 $ ', '15.00 - 19.00 $ ',
       '15.00 - 21.00 $ ', '13.00 - 16.00 $ ', '25,000.00 - 57,000.00 $ ',
       '1

In [511]:
df_sal_split[1].unique()
# These are the aditional conditions and benefits that one person can obtain into his contract

array([nan, 'hour', 'year', None, 'yearBonus, Benefits, 401k',
       'yearsalary', 'hourYear End Bonus',
       'yearHighly Competitive Base Salary Plus Lucrative Bonus Plan, Benefit Package, in a Highly Diverse, Very Successful Company! Performance Based Upward Mobility Assured.\u200b Do Not Hesitate, Apply Today.\u200b Grow with this Nationally Present, Dynamic Restaurant Group',
       'yearGenerous Commission plan', 'hourPlus benefits',
       'hourBenefits Package',
       'yearBase + Uncapped Commissions + Benefits',
       'hourBenefits + Annual Bonus', 'week',
       'Dental Benefits; 401k Employer Match',
       'yearbase salary plus bonus', 'year+ profit sharing',
       'yearpackage', 'yearPACKAGE', 'yearbonus, tips and comp time',
       'year5', 'hourgenerous benefit package',
       'year$8,000 Recruitment Incentive Pay may be available.',
       'yearPTO, 401K',
       'hourBenefits (401K, Health Insurance) and bonuses are potentials',
       'yearHealth Benefits', 'yea

In [512]:
df_sal_split[2].unique()
# Fron here stull we can obtein some benefits perhaps we can join the column one to four

array([nan, None, 'life insurance, 401K', 'Bonus', '401K',
       ' monthly performance bonus & commission', ' Incentives',
       'paid time off', 'Outlet', ' Bonus Structure', 'dental',
       'experience', 'Dental'], dtype=object)

In [513]:
df_sal_split['conditions'] = df_sal_split[1].astype(str)+' '+df_sal_split[2].astype(str)+' '+df_sal_split[3].astype(str)+' '+df_sal_split[4].astype(str)+' '+df_sal_split[5].astype(str)
df_sal_split.tail(10)

# We join columns split before in just one to manage the strings of conditions in a easier way


Unnamed: 0_level_0,0,1,2,3,4,5,conditions
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8e9f8638556bc1fd671bb99f4f01ac4d,"75,000.00 - 85,000.00 $",yearHighly Competitive Base Salary PLUS Lucrat...,,,,,yearHighly Competitive Base Salary PLUS Lucrat...
ebce61a714f4dd7d15b0263fab42751e,,,,,,,nan nan nan nan nan
8a36252a31d7b06e901be0596bb6501a,,,,,,,nan nan nan nan nan
a53be963aac0a938a50a4d4cf7bc3ca3,,,,,,,nan nan nan nan nan
1d37a888ca65fd919e459147a4c33457,,,,,,,nan nan nan nan nan
7502ee8f0d324f86334c531fa8bcf663,,,,,,,nan nan nan nan nan
9796e104240789dd33cc436f6c383892,,,,,,,nan nan nan nan nan
abd9ad3e0ec3c934b5a59f3776012865,,,,,,,nan nan nan nan nan
419a3714be2b30a10f628de207d041de,"45,000.00 - 60,000.00 $",year,,,,,year None None None None
40161cf61c283af9dc2b0a62947a5f1b,25.00 - 28.00 $,hour,,,,,hour None None None None


In [514]:
df_sal_split['conditions'].unique()

array(['nan nan nan nan nan', 'hour None None None None',
       'year None None None None', 'None None None None None',
       'yearBonus, Benefits, 401k None None None None',
       'yearsalary None None None None',
       'hourYear End Bonus None None None None',
       'yearHighly Competitive Base Salary Plus Lucrative Bonus Plan, Benefit Package, in a Highly Diverse, Very Successful Company! Performance Based Upward Mobility Assured.\u200b Do Not Hesitate, Apply Today.\u200b Grow with this Nationally Present, Dynamic Restaurant Group None None None None',
       'yearGenerous Commission plan None None None None',
       'hourPlus benefits None None None None',
       'hourBenefits Package None None None None',
       'yearBase + Uncapped Commissions + Benefits None None None None',
       'hourBenefits + Annual Bonus None None None None',
       'week None None None None',
       'Dental Benefits; 401k Employer Match None None None None',
       'yearbase salary plus bonus None No

In [515]:
df_sal_split_= df_sal_split['conditions']
df_sal_split_

#Now lets take just the column that collects whole the information of conditions and benefits gaven by the companies

uniq_id
58435fcab804439efdcaa7ecca0fd783         nan nan nan nan nan
64d0272dc8496abfd9523a8df63c184c         nan nan nan nan nan
1e2637cb5f7a2c4615a99a26c0566c66         nan nan nan nan nan
a6a2b5e825b8ce1c3b517adb2497c5ed         nan nan nan nan nan
2f8bdf60db4d85627ab8f040e67aa78d         nan nan nan nan nan
                                              ...           
7502ee8f0d324f86334c531fa8bcf663         nan nan nan nan nan
9796e104240789dd33cc436f6c383892         nan nan nan nan nan
abd9ad3e0ec3c934b5a59f3776012865         nan nan nan nan nan
419a3714be2b30a10f628de207d041de    year None None None None
40161cf61c283af9dc2b0a62947a5f1b    hour None None None None
Name: conditions, Length: 11847, dtype: object

In [516]:
df_sal_split_.shape

(11847,)

In [517]:
df_sal_range= df_sal_split[0].str.split('-',expand=True)
df_sal_range
#  Now we are gonna focus on column cero with the ranges of salary an later on we are going back to the condition column
# Here we create a split that give us agaon 4 columns but

Unnamed: 0_level_0,0,1,2,3
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
58435fcab804439efdcaa7ecca0fd783,,,,
64d0272dc8496abfd9523a8df63c184c,,,,
1e2637cb5f7a2c4615a99a26c0566c66,,,,
a6a2b5e825b8ce1c3b517adb2497c5ed,,,,
2f8bdf60db4d85627ab8f040e67aa78d,,,,
...,...,...,...,...
7502ee8f0d324f86334c531fa8bcf663,,,,
9796e104240789dd33cc436f6c383892,,,,
abd9ad3e0ec3c934b5a59f3776012865,,,,
419a3714be2b30a10f628de207d041de,45000.00,"60,000.00 $",,


In [518]:
df_sal_range[0].unique()
# Here we manage to get the initial salary but it is messy with information of companies that don't like to say the salary for the possition

array([nan, '9.00 ', '80,000.00 ', '70,000.00 ', '68,000.00 ',
       '58,000.00 ', 'Up to $32000.00', 'Salary, plus commission',
       '45,000.00 ', '40,000.00 ', '13.75 ', '35,000.00 ',
       'bonus, 401K matching, medical, vacation', '31,000.00 ', '17.00 ',
       '56,000.00 ', '45.00 ', '75,000.00 ', 'Up to $45000.00', '0.00 ',
       'Negotiable based on experience', '60,000.00 ',
       'Competitive Wages', '50,000.00 ',
       'Burg Simpson offers excellent benefits and compensation commensurate with experience.',
       '$40,000.00+ ', 'Excellent compensation and benefits',
       '69,000.00 ', '15.00 ', '13.00 ', '25,000.00 ', '85,000.00 ',
       '10.00 ', '12.00 ', '11.85 ', '14.00 ',
       'Salary based on experience.', '$10.87+ ', '$12.00+ ', '30.93 ',
       '45,000.00+ ', '20.00 ', '13.45 ', '100,000.00 ', '55.00 ',
       '11.00 ', '16.00 ', '$16.40+ ', '90,000.00 ', '55,000.00 ',
       '$120,000.00+ ', '26.92 ', '$13.50+ ', '9.50 ', '19.50 ',
       '54,058.00 ', '

In [519]:
df_sal_range[1].unique()
# In column 1 we find the final salary for the possition that was offered
# There is alsa a character '$' that is no useful in this column

array([nan, ' 13.00 $ ', ' 95,000.00 $ ', ' 100,000.00 $ ',
       ' 72,000.00 $ ', ' 65,000.00 $ ', None, ' 50,000.00 $ ',
       ' 16.75 $ ', ' 45,000.00 $ ', ' 33,000.00 $ ', ' 22.00 $ ',
       ' 64,000.00 $ ', ' 50.00 $ ', ' 130,000.00 $ ', ' 85,000.00 $ ',
       ' 110,000.00 $ ', ' 101,000.00 $ ', ' 19.00 $ ', ' 21.00 $ ',
       ' 16.00 $ ', ' 57,000.00 $ ', ' 17.00 $ ', ' 80,000.00 $ ',
       ' 60,000.00 $ ', ' 11.00 $ ', ' 90,000.00 $ ', ' 14.00 $ ',
       ' 11.85 $ ', ' 30.93 $ ', ' 23.00 $ ', ' 12.25 $ ', ' 13.45 $ ',
       ' 130,000.00 ', ' 50,000.00 ', ' 65.00 $ ', ' 17.00 ',
       ' 85,000.00 ', ' 15.00 $ ', ' 105,000.00 ', ' 26.92 $ ',
       ' 21.90 $ ', ' 75,000.00 $ ', ' 18.00 $ ', ' 100,000.00 ',
       ' 19.50 $ ', ' 79,174.00 $ ', ' 19.00 ', ' 55,000.00 $ ',
       'On Bonus', ' 66,000.00 $ ', ' 15.00 ', ' 35.00 $ ', ' 120.00 $ ',
       ' 200,000.00 $ ', ' 20.00 $ ', ' 100.00 $ ', ' 75.00 $ ',
       ' 80.00 $ ', ' 220,000.00 $ ', ' 32.00 $ ', ' 12.00 $ ',
  

In [520]:
#df_sal_range_ = pd.DataFrame(columns=['initial_sal','final_sal'])
#df_sal_range_

df_sal_range_ = df_sal_range.loc[:,0:1].rename(columns={0:'initial_sal',1:'final_sal'})
df_sal_range_
# What we do here is just taking the columns we consider important for the dataser to be analyse later

Unnamed: 0_level_0,initial_sal,final_sal
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1
58435fcab804439efdcaa7ecca0fd783,,
64d0272dc8496abfd9523a8df63c184c,,
1e2637cb5f7a2c4615a99a26c0566c66,,
a6a2b5e825b8ce1c3b517adb2497c5ed,,
2f8bdf60db4d85627ab8f040e67aa78d,,
...,...,...
7502ee8f0d324f86334c531fa8bcf663,,
9796e104240789dd33cc436f6c383892,,
abd9ad3e0ec3c934b5a59f3776012865,,
419a3714be2b30a10f628de207d041de,45000.00,"60,000.00 $"


In [521]:
df2 = pd.DataFrame(df_sal_split_)
df3 = pd.DataFrame(df_sal_range_)

# We have to give pandas formart to the series of objects that we create before

df_salaries = df2.join(df3, how='left')
df_salaries
#No we df_salaries the two past datasets that split in order to clean the structure of this salary column
#As we can see we create:
# sal_conditions = conditions of the salary
# sal_from = intitial salary
# sal_to = final salary

# But..... still there is something wrong about this chunck of dataset in sal_to column, Can you see it?

Unnamed: 0_level_0,conditions,initial_sal,final_sal
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
58435fcab804439efdcaa7ecca0fd783,nan nan nan nan nan,,
64d0272dc8496abfd9523a8df63c184c,nan nan nan nan nan,,
1e2637cb5f7a2c4615a99a26c0566c66,nan nan nan nan nan,,
a6a2b5e825b8ce1c3b517adb2497c5ed,nan nan nan nan nan,,
2f8bdf60db4d85627ab8f040e67aa78d,nan nan nan nan nan,,
...,...,...,...
7502ee8f0d324f86334c531fa8bcf663,nan nan nan nan nan,,
9796e104240789dd33cc436f6c383892,nan nan nan nan nan,,
abd9ad3e0ec3c934b5a59f3776012865,nan nan nan nan nan,,
419a3714be2b30a10f628de207d041de,year None None None None,45000.00,"60,000.00 $"


In [522]:
df_salaries['initial_sal'].unique()
# Now lets clean the first_sal column by taking out the strings

array([nan, '9.00 ', '80,000.00 ', '70,000.00 ', '68,000.00 ',
       '58,000.00 ', 'Up to $32000.00', 'Salary, plus commission',
       '45,000.00 ', '40,000.00 ', '13.75 ', '35,000.00 ',
       'bonus, 401K matching, medical, vacation', '31,000.00 ', '17.00 ',
       '56,000.00 ', '45.00 ', '75,000.00 ', 'Up to $45000.00', '0.00 ',
       'Negotiable based on experience', '60,000.00 ',
       'Competitive Wages', '50,000.00 ',
       'Burg Simpson offers excellent benefits and compensation commensurate with experience.',
       '$40,000.00+ ', 'Excellent compensation and benefits',
       '69,000.00 ', '15.00 ', '13.00 ', '25,000.00 ', '85,000.00 ',
       '10.00 ', '12.00 ', '11.85 ', '14.00 ',
       'Salary based on experience.', '$10.87+ ', '$12.00+ ', '30.93 ',
       '45,000.00+ ', '20.00 ', '13.45 ', '100,000.00 ', '55.00 ',
       '11.00 ', '16.00 ', '$16.40+ ', '90,000.00 ', '55,000.00 ',
       '$120,000.00+ ', '26.92 ', '$13.50+ ', '9.50 ', '19.50 ',
       '54,058.00 ', '

In [523]:
df_salaries

Unnamed: 0_level_0,conditions,initial_sal,final_sal
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
58435fcab804439efdcaa7ecca0fd783,nan nan nan nan nan,,
64d0272dc8496abfd9523a8df63c184c,nan nan nan nan nan,,
1e2637cb5f7a2c4615a99a26c0566c66,nan nan nan nan nan,,
a6a2b5e825b8ce1c3b517adb2497c5ed,nan nan nan nan nan,,
2f8bdf60db4d85627ab8f040e67aa78d,nan nan nan nan nan,,
...,...,...,...
7502ee8f0d324f86334c531fa8bcf663,nan nan nan nan nan,,
9796e104240789dd33cc436f6c383892,nan nan nan nan nan,,
abd9ad3e0ec3c934b5a59f3776012865,nan nan nan nan nan,,
419a3714be2b30a10f628de207d041de,year None None None None,45000.00,"60,000.00 $"


### Salary clean data
We just clean the structure of the column salary now we are going to clear the data itself

In [524]:
df_salaries['initial_sal_'] = df_salaries['initial_sal'].str.extract('(\d.+)')
df_salaries
# In this ocation we are gonna  erase the string of the column 'initial data' so we can use just numbers

Unnamed: 0_level_0,conditions,initial_sal,final_sal,initial_sal_
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
58435fcab804439efdcaa7ecca0fd783,nan nan nan nan nan,,,
64d0272dc8496abfd9523a8df63c184c,nan nan nan nan nan,,,
1e2637cb5f7a2c4615a99a26c0566c66,nan nan nan nan nan,,,
a6a2b5e825b8ce1c3b517adb2497c5ed,nan nan nan nan nan,,,
2f8bdf60db4d85627ab8f040e67aa78d,nan nan nan nan nan,,,
...,...,...,...,...
7502ee8f0d324f86334c531fa8bcf663,nan nan nan nan nan,,,
9796e104240789dd33cc436f6c383892,nan nan nan nan nan,,,
abd9ad3e0ec3c934b5a59f3776012865,nan nan nan nan nan,,,
419a3714be2b30a10f628de207d041de,year None None None None,45000.00,"60,000.00 $",45000.00


In [525]:
df_salaries=df_salaries.drop(['initial_sal'], axis = 1)
df_salaries

Unnamed: 0_level_0,conditions,final_sal,initial_sal_
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
58435fcab804439efdcaa7ecca0fd783,nan nan nan nan nan,,
64d0272dc8496abfd9523a8df63c184c,nan nan nan nan nan,,
1e2637cb5f7a2c4615a99a26c0566c66,nan nan nan nan nan,,
a6a2b5e825b8ce1c3b517adb2497c5ed,nan nan nan nan nan,,
2f8bdf60db4d85627ab8f040e67aa78d,nan nan nan nan nan,,
...,...,...,...
7502ee8f0d324f86334c531fa8bcf663,nan nan nan nan nan,,
9796e104240789dd33cc436f6c383892,nan nan nan nan nan,,
abd9ad3e0ec3c934b5a59f3776012865,nan nan nan nan nan,,
419a3714be2b30a10f628de207d041de,year None None None None,"60,000.00 $",45000.00


In [526]:
df_salaries['final_sal'] = df_salaries['final_sal'].str.replace('$', '')
df_salaries


Unnamed: 0_level_0,conditions,final_sal,initial_sal_
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
58435fcab804439efdcaa7ecca0fd783,nan nan nan nan nan,,
64d0272dc8496abfd9523a8df63c184c,nan nan nan nan nan,,
1e2637cb5f7a2c4615a99a26c0566c66,nan nan nan nan nan,,
a6a2b5e825b8ce1c3b517adb2497c5ed,nan nan nan nan nan,,
2f8bdf60db4d85627ab8f040e67aa78d,nan nan nan nan nan,,
...,...,...,...
7502ee8f0d324f86334c531fa8bcf663,nan nan nan nan nan,,
9796e104240789dd33cc436f6c383892,nan nan nan nan nan,,
abd9ad3e0ec3c934b5a59f3776012865,nan nan nan nan nan,,
419a3714be2b30a10f628de207d041de,year None None None None,60000.00,45000.00


In [527]:
df_ = df_salaries.join(df, how='left').drop(['salary'], axis = 1)
df_
# Almost las step we have our dataset in some percent clean

Unnamed: 0_level_0,conditions,final_sal,initial_sal_,job_description,job_title,job_type,location,organization,sector
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
58435fcab804439efdcaa7ecca0fd783,nan nan nan nan nan,,,Why Join Altec? If you’re considering a career...,Engineer - Quality Job in Dixon,Full Time,"Dixon, CA",Altec Industries,Experienced (Non-Manager)
64d0272dc8496abfd9523a8df63c184c,nan nan nan nan nan,,,Position ID# 76162 # Positions 1 State CT C...,Shift Supervisor - Part-Time Job in Camphill,Full Time Employee,"Camphill, PA",Retail,Project/Program Management
1e2637cb5f7a2c4615a99a26c0566c66,nan nan nan nan nan,,,Job Description Job #: 720298Apex Systems has...,Construction PM - Charlottesville Job in Charl...,Full Time Employee,"Charlottesville, VA",Computer/IT Services,Experienced (Non-Manager)
a6a2b5e825b8ce1c3b517adb2497c5ed,nan nan nan nan nan,,,"Part-Time, 4:30 pm - 9:30 pm, Mon - Fri Brookd...",Housekeeper Job in Austin,Part Time Employee,"Austin, TX 78746",Hotels and Lodging Personal and Household Serv...,Customer Support/Client Care
2f8bdf60db4d85627ab8f040e67aa78d,nan nan nan nan nan,,,Aflac Insurance Sales Agent While a career in ...,Aflac Insurance Sales Agent Job in Berryville,Full Time,"Berryville, VA 22611",Insurance,Customer Support/Client Care
...,...,...,...,...,...,...,...,...,...
7502ee8f0d324f86334c531fa8bcf663,nan nan nan nan nan,,,RESPONSIBILITIES: ...,Accountant Job in Cincinnati,Full Time,"Cincinnati, OH 45249",Healthcare Services,Entry Level
9796e104240789dd33cc436f6c383892,nan nan nan nan nan,,,"Full-Time Amber Park Cincinnati, OH 3801 East ...",Licensed Practical Nurse LPN Job in Cincinnati,Full Time Employee,"Cincinnati, OH 45236",Healthcare Services Other/Not Classified,Medical/Health
abd9ad3e0ec3c934b5a59f3776012865,nan nan nan nan nan,,,What The Job Is AboutSales Support Representat...,Immediate Customer Service Position Job in Cin...,Part Time,"Cincinnati, OH 45202",All,Entry Level
419a3714be2b30a10f628de207d041de,year None None None None,60000.00,45000.00,Luxury homebuilder in Cincinnati seeking multi...,Accountant Job in Cincinnati,Full Time,"Cincinnati, OH 45236",Construction - Residential & Commercial/Office,Manager (Manager/Supervisor of Staff)


In [528]:
shape=df_.shape
df_.isnull().sum()/shape[0]*100

conditions          0.000000
final_sal          87.802819
initial_sal_       86.021778
job_description     0.000000
job_title           0.000000
job_type            0.000000
location            0.000000
organization        0.000000
sector              0.000000
dtype: float64

### Clean Strucure

#### In order to identify the missing salaries we have to find a way to fill the empty spaces in this column, what I plan to do here is to use the correlation between non-empty salary cells and the other columns
- From this part we are going to clean strcurure identyfing job conditions, yearley, weekly, comissions .......

In [529]:
df_noempty = df_[~df_['initial_sal_'].isna()]
#df_noempty = df_[df_['initial_sal_']!= 'NaN']
df_noempty_cond = df_noempty['conditions']
df_noempty_cond.tail(20)


# First we are goin to select salarys column taht us not empty in order to obtain data that help us to predict 

uniq_id
c87996f53febe5831502936a023422af                             year None None None None
23d59602ade1104ca16fec033fb6235d                             year None None None None
fee44da1aa910d621f179d89be53057c    hourpotential commissions with experience None...
bcc9909fa0e41d30a9bddc1a7ab4cc17                             hour None None None None
5310eafc4a142affdd10ab50de2c1cca        hourOvertime and Benefits None None None None
905e9e0b827bde830fbb401707dc6af4                             year None None None None
28f7367f5f9be4e6bdcf68bb7c602c08                             year None None None None
b5838e6822ee1ffd9174d41ba9ae7b9b                             hour None None None None
4a65f0199487728cc762abd524237360                    hourMedical Dental None None None
9743d9bdcaea4c3a827df5053c74a890                             year None None None None
89f0aa8bbc99eb1d0254882832e3c490                             year None None None None
292810c8b37005b3b49f84c790110594              

In [530]:
#df_noempty_cond_ = df_noempty_cond.replace('nan', '', regex=True)
df_noempty_cond_2 = df_noempty_cond.replace('None', '', regex=True)
df_noempty_cond_2
# The goal is to transponde values to columns

uniq_id
b43c077756d5a326c4854e1399fd2464                                             hour    
d8491fcefe14d1398de419984dccf427                                             year    
779bb4c9bf038b7fb775134736d36fd4                                             year    
ceb44cca7cd280adcb0c84c20f3c6c21                                             year    
eea9b50afc4fece9f9d6ff0dbf659784                                             year    
                                                          ...                        
4e2b0a3e9fe5f8721f6ab4692823d9a9     week7-9% commisson depending on sales volume    
ac3c743eb7612d90bfe80a3a30d3d8be    yearFull Benefits Package, Life Insurance, 401...
8e9f8638556bc1fd671bb99f4f01ac4d    yearHighly Competitive Base Salary PLUS Lucrat...
419a3714be2b30a10f628de207d041de                                             year    
40161cf61c283af9dc2b0a62947a5f1b                                             hour    
Name: conditions, Length: 1656, dtype: object

In [531]:
df_noempty_cond_2= df_noempty_cond_2.str.lower()
df_noempty_cond_2_ = df_noempty_cond_2.to_frame()
df_noempty_cond_2_ 

Unnamed: 0_level_0,conditions
uniq_id,Unnamed: 1_level_1
b43c077756d5a326c4854e1399fd2464,hour
d8491fcefe14d1398de419984dccf427,year
779bb4c9bf038b7fb775134736d36fd4,year
ceb44cca7cd280adcb0c84c20f3c6c21,year
eea9b50afc4fece9f9d6ff0dbf659784,year
...,...
4e2b0a3e9fe5f8721f6ab4692823d9a9,week7-9% commisson depending on sales volume
ac3c743eb7612d90bfe80a3a30d3d8be,"yearfull benefits package, life insurance, 401..."
8e9f8638556bc1fd671bb99f4f01ac4d,yearhighly competitive base salary plus lucrat...
419a3714be2b30a10f628de207d041de,year


In [532]:
df_noempty_cond_split = df_noempty_cond_2_['conditions'].str.split(' ',n=1 ,expand=True)
df_noempty_cond_3 = df_noempty_cond_split.replace([None], '', regex=True)
df_noempty_cond_3.head(20)

Unnamed: 0_level_0,0,1
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1
b43c077756d5a326c4854e1399fd2464,hour,
d8491fcefe14d1398de419984dccf427,year,
779bb4c9bf038b7fb775134736d36fd4,year,
ceb44cca7cd280adcb0c84c20f3c6c21,year,
eea9b50afc4fece9f9d6ff0dbf659784,year,
1f2da47e60173c6667395f081c048713,,
f15dfb5ad12ddb6acb8a26eb04f2220f,"yearbonus,","benefits, 401k"
ea722c991c6d3816965702948e320cb6,yearsalary,
3c2f7c555173e04db07a96b23c1be974,houryear,end bonus
84422981a6356a19a6a4b53aa460e028,year,


In [533]:
df_noempty_cond_3['period'] = df_noempty_cond_3[0].astype(str).str[0:4]
df_noempty_cond_3['period_cond'] = df_noempty_cond_3[0].astype(str).str[4:-1]
df_noempty_period = df_noempty_cond_3.drop(0, axis = 1)
df_noempty_period_ = df_noempty_period.rename(columns= {1: 'benefits'})
df_noempty_period_['benefits'] = df_noempty_period_['benefits'].str.strip()
df_noempty_period_

Unnamed: 0_level_0,benefits,period,period_cond
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
b43c077756d5a326c4854e1399fd2464,,hour,
d8491fcefe14d1398de419984dccf427,,year,
779bb4c9bf038b7fb775134736d36fd4,,year,
ceb44cca7cd280adcb0c84c20f3c6c21,,year,
eea9b50afc4fece9f9d6ff0dbf659784,,year,
...,...,...,...
4e2b0a3e9fe5f8721f6ab4692823d9a9,commisson depending on sales volume,week,7-9
ac3c743eb7612d90bfe80a3a30d3d8be,"benefits package, life insurance, 401k, reloca...",year,ful
8e9f8638556bc1fd671bb99f4f01ac4d,competitive base salary plus lucrative bonus p...,year,highl
419a3714be2b30a10f628de207d041de,,year,


In [534]:
df_2 = df_.join(df_noempty_period_, how='left').drop(columns='conditions')
#df_2= df_2
df_2.head(5)

Unnamed: 0_level_0,final_sal,initial_sal_,job_description,job_title,job_type,location,organization,sector,benefits,period,period_cond
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
58435fcab804439efdcaa7ecca0fd783,,,Why Join Altec? If you’re considering a career...,Engineer - Quality Job in Dixon,Full Time,"Dixon, CA",Altec Industries,Experienced (Non-Manager),,,
64d0272dc8496abfd9523a8df63c184c,,,Position ID# 76162 # Positions 1 State CT C...,Shift Supervisor - Part-Time Job in Camphill,Full Time Employee,"Camphill, PA",Retail,Project/Program Management,,,
1e2637cb5f7a2c4615a99a26c0566c66,,,Job Description Job #: 720298Apex Systems has...,Construction PM - Charlottesville Job in Charl...,Full Time Employee,"Charlottesville, VA",Computer/IT Services,Experienced (Non-Manager),,,
a6a2b5e825b8ce1c3b517adb2497c5ed,,,"Part-Time, 4:30 pm - 9:30 pm, Mon - Fri Brookd...",Housekeeper Job in Austin,Part Time Employee,"Austin, TX 78746",Hotels and Lodging Personal and Household Serv...,Customer Support/Client Care,,,
2f8bdf60db4d85627ab8f040e67aa78d,,,Aflac Insurance Sales Agent While a career in ...,Aflac Insurance Sales Agent Job in Berryville,Full Time,"Berryville, VA 22611",Insurance,Customer Support/Client Care,,,


## job_title


### Job title has information about the position but also has information about job_type an location that already exist in other columns so the only information in this column that reallly maters is the position.

In [535]:
df_job_title = df_['job_title']
df_job_title_= df_job_title.to_frame()
df_job_title_in = df_job_title_['job_title'].str.split('Job in', n=1, expand=True)
df_job_title1 = df_job_title_in[0].str.split('-', n=1, expand=True)
df_job_title1[0] = df_job_title1[0].str.lower()
df_job_title1 


Unnamed: 0_level_0,0,1
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1
58435fcab804439efdcaa7ecca0fd783,engineer,Quality
64d0272dc8496abfd9523a8df63c184c,shift supervisor,Part-Time
1e2637cb5f7a2c4615a99a26c0566c66,construction pm,Charlottesville
a6a2b5e825b8ce1c3b517adb2497c5ed,housekeeper,
2f8bdf60db4d85627ab8f040e67aa78d,aflac insurance sales agent,
...,...,...
7502ee8f0d324f86334c531fa8bcf663,accountant,
9796e104240789dd33cc436f6c383892,licensed practical nurse lpn,
abd9ad3e0ec3c934b5a59f3776012865,immediate customer service position,
419a3714be2b30a10f628de207d041de,accountant,


In [536]:
df_3 = df_2.join(df_job_title1, how= 'left').rename(columns={0:'job'}).drop(columns = ['job_title',1])
df_3['job_type']= df_3['job_type'].replace('Employee', '', regex=True).replace(',','').str.lower()
df_3


Unnamed: 0_level_0,final_sal,initial_sal_,job_description,job_type,location,organization,sector,benefits,period,period_cond,job
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
58435fcab804439efdcaa7ecca0fd783,,,Why Join Altec? If you’re considering a career...,full time,"Dixon, CA",Altec Industries,Experienced (Non-Manager),,,,engineer
64d0272dc8496abfd9523a8df63c184c,,,Position ID# 76162 # Positions 1 State CT C...,full time,"Camphill, PA",Retail,Project/Program Management,,,,shift supervisor
1e2637cb5f7a2c4615a99a26c0566c66,,,Job Description Job #: 720298Apex Systems has...,full time,"Charlottesville, VA",Computer/IT Services,Experienced (Non-Manager),,,,construction pm
a6a2b5e825b8ce1c3b517adb2497c5ed,,,"Part-Time, 4:30 pm - 9:30 pm, Mon - Fri Brookd...",part time,"Austin, TX 78746",Hotels and Lodging Personal and Household Serv...,Customer Support/Client Care,,,,housekeeper
2f8bdf60db4d85627ab8f040e67aa78d,,,Aflac Insurance Sales Agent While a career in ...,full time,"Berryville, VA 22611",Insurance,Customer Support/Client Care,,,,aflac insurance sales agent
...,...,...,...,...,...,...,...,...,...,...,...
7502ee8f0d324f86334c531fa8bcf663,,,RESPONSIBILITIES: ...,full time,"Cincinnati, OH 45249",Healthcare Services,Entry Level,,,,accountant
9796e104240789dd33cc436f6c383892,,,"Full-Time Amber Park Cincinnati, OH 3801 East ...",full time,"Cincinnati, OH 45236",Healthcare Services Other/Not Classified,Medical/Health,,,,licensed practical nurse lpn
abd9ad3e0ec3c934b5a59f3776012865,,,What The Job Is AboutSales Support Representat...,part time,"Cincinnati, OH 45202",All,Entry Level,,,,immediate customer service position
419a3714be2b30a10f628de207d041de,60000.00,45000.00,Luxury homebuilder in Cincinnati seeking multi...,full time,"Cincinnati, OH 45236",Construction - Residential & Commercial/Office,Manager (Manager/Supervisor of Staff),,year,,accountant


## Job_type
### Here we just take the type of job an erease aditional information that might not be usefull

In [537]:
df_4= df_3['job_type']
df_4 = df_4.str.split('/', expand=True)
df_5 = df_3.join(df_4[0], how='left').drop(columns='job_type').rename(columns= {0:'job_type'})
df_5.head()

Unnamed: 0_level_0,final_sal,initial_sal_,job_description,location,organization,sector,benefits,period,period_cond,job,job_type
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
58435fcab804439efdcaa7ecca0fd783,,,Why Join Altec? If you’re considering a career...,"Dixon, CA",Altec Industries,Experienced (Non-Manager),,,,engineer,full time
64d0272dc8496abfd9523a8df63c184c,,,Position ID# 76162 # Positions 1 State CT C...,"Camphill, PA",Retail,Project/Program Management,,,,shift supervisor,full time
1e2637cb5f7a2c4615a99a26c0566c66,,,Job Description Job #: 720298Apex Systems has...,"Charlottesville, VA",Computer/IT Services,Experienced (Non-Manager),,,,construction pm,full time
a6a2b5e825b8ce1c3b517adb2497c5ed,,,"Part-Time, 4:30 pm - 9:30 pm, Mon - Fri Brookd...","Austin, TX 78746",Hotels and Lodging Personal and Household Serv...,Customer Support/Client Care,,,,housekeeper,part time
2f8bdf60db4d85627ab8f040e67aa78d,,,Aflac Insurance Sales Agent While a career in ...,"Berryville, VA 22611",Insurance,Customer Support/Client Care,,,,aflac insurance sales agent,full time


## location
#### This column has information about city, state and other kind that is not usefull
#### In this oportunity I'm just intereste in city and state

In [553]:
df_6 = df_5['location']
df_6 = df_6.str.split(',', n=2 , expand=True)
df_6[['city','state']] = df_6[[0,1]]
df_6 = df_6.drop(columns=[0,1,2])
df_6['city'] = df_6['city'].str.lower()
df_6['state'] = df_6['state'].str.lower()
df_6


Unnamed: 0_level_0,city,state
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1
58435fcab804439efdcaa7ecca0fd783,dixon,ca
64d0272dc8496abfd9523a8df63c184c,camphill,pa
1e2637cb5f7a2c4615a99a26c0566c66,charlottesville,va
a6a2b5e825b8ce1c3b517adb2497c5ed,austin,tx 78746
2f8bdf60db4d85627ab8f040e67aa78d,berryville,va 22611
...,...,...
7502ee8f0d324f86334c531fa8bcf663,cincinnati,oh 45249
9796e104240789dd33cc436f6c383892,cincinnati,oh 45236
abd9ad3e0ec3c934b5a59f3776012865,cincinnati,oh 45202
419a3714be2b30a10f628de207d041de,cincinnati,oh 45236


In [554]:
df_7 = df_5.join(df_6, how='left')
df_7

Unnamed: 0_level_0,final_sal,initial_sal_,job_description,location,organization,sector,benefits,period,period_cond,job,job_type,city,state
uniq_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
58435fcab804439efdcaa7ecca0fd783,,,Why Join Altec? If you’re considering a career...,"Dixon, CA",Altec Industries,Experienced (Non-Manager),,,,engineer,full time,dixon,ca
64d0272dc8496abfd9523a8df63c184c,,,Position ID# 76162 # Positions 1 State CT C...,"Camphill, PA",Retail,Project/Program Management,,,,shift supervisor,full time,camphill,pa
1e2637cb5f7a2c4615a99a26c0566c66,,,Job Description Job #: 720298Apex Systems has...,"Charlottesville, VA",Computer/IT Services,Experienced (Non-Manager),,,,construction pm,full time,charlottesville,va
a6a2b5e825b8ce1c3b517adb2497c5ed,,,"Part-Time, 4:30 pm - 9:30 pm, Mon - Fri Brookd...","Austin, TX 78746",Hotels and Lodging Personal and Household Serv...,Customer Support/Client Care,,,,housekeeper,part time,austin,tx 78746
2f8bdf60db4d85627ab8f040e67aa78d,,,Aflac Insurance Sales Agent While a career in ...,"Berryville, VA 22611",Insurance,Customer Support/Client Care,,,,aflac insurance sales agent,full time,berryville,va 22611
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7502ee8f0d324f86334c531fa8bcf663,,,RESPONSIBILITIES: ...,"Cincinnati, OH 45249",Healthcare Services,Entry Level,,,,accountant,full time,cincinnati,oh 45249
9796e104240789dd33cc436f6c383892,,,"Full-Time Amber Park Cincinnati, OH 3801 East ...","Cincinnati, OH 45236",Healthcare Services Other/Not Classified,Medical/Health,,,,licensed practical nurse lpn,full time,cincinnati,oh 45236
abd9ad3e0ec3c934b5a59f3776012865,,,What The Job Is AboutSales Support Representat...,"Cincinnati, OH 45202",All,Entry Level,,,,immediate customer service position,part time,cincinnati,oh 45202
419a3714be2b30a10f628de207d041de,60000.00,45000.00,Luxury homebuilder in Cincinnati seeking multi...,"Cincinnati, OH 45236",Construction - Residential & Commercial/Office,Manager (Manager/Supervisor of Staff),,year,,accountant,full time,cincinnati,oh 45236


In [None]:

lista = df_noempty_period_['benefits'].str.split(',', expand=True).stack().value_counts()
df_lista = lista.to_frame()
df_lista


Unnamed: 0,count
,1402
benefits,16
bonus,14
401k,8
benefits package,8
...,...
+ ot with bonus potential,1
will be based on experience,1
negoitable,1
rate,1


In [None]:
conteo = df_lista[df_lista['count']>30]#.to_csv('lista.csv')
conteo

Unnamed: 0,count
,1402


In [None]:
empty_word=[' on ',' in ',' this ', ' plus ', ' a ', ' an ', ' end ', ' very ', ' do ' , ' not ' ,' upward ', ' are ', ' and ']

for i in range(len(empty_word)):
    df_noempty_cond_2_["conditions"] = df_noempty_cond_2_["conditions"].apply(lambda x: x.replace(empty_word[i], " "))

type(df_noempty_cond_2_)

pandas.core.frame.DataFrame

In [None]:

# Lets start with the column conditions


In [None]:


# Lets split conditions columns so 

In [None]:
dif = df_con_split[df_con_split[0]!='']
dif[1].unique()

NameError: name 'df_con_split' is not defined

# Change data types

# Clean structures again

In [None]:
 #page_url has same information of job title

# Imputation — the process of replacing missing data with substituted values.

Page_url has the same information than Job title and location