# Part 1 - Asking Questions & Data Cleaning

## Asking Questions

Before you actually perform the data preprocessing steps, we want to ask some data science or machine learning questions that our project intends to answer. Here, we define 4 simple questions that will be our main motivation of doing this project.

1. **Which structured features have the strongest correlation or influence towards fraudulent target variable?**
2. **Which balancing method works best in improving the prediction results for this imbalanced issue of the data?**
3. **What type of features give the best prediction results on fraudulent target variable?**
4. **Does mixed data using deep learning models produce better precision/recall/F1-score compared to individual data?**

### Importing Libraries

Let's import some libraries for data preprocessing and visualisation.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
import string
import random
import contractions

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from num2word import word

from country_list import available_languages
from country_list import countries_for_language
from countryinfo import CountryInfo
import geonamescache

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
# Set the maximum number of columns to display
pd.set_option('display.max_colwidth', 100)
pd.set_option("display.max_columns", 100)

## Data Cleaning
### Initial Steps of Data Cleaning - Exploration

We want to do a brief exploration on the dataset before we jump into data cleaning. This is to ensure that we get some understanding and expectations regarding the features and properties of the dataset which will help us in data cleaning later.

In [3]:
# Read the data into df
df = pd.read_csv("D:/Documents/Data Science Learning/My Project/Recruitment Scam/02-data/recruitment.csv")

# Take a look at the first 10 rows of the dataset
df.head(10)

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset
0,Marketing Intern,"US, NY, New York",Marketing,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,f,t,f,Other,Internship,,,Marketing,f,f
1,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,f,t,f,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,f,f
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,f,t,f,,,,,,f,f
3,Account Executive - Washington DC,"US, DC, Washington",Sales,,<p>Our passion for improving quality of life through geography is at the heart of everything we ...,<p><b>THE COMPANY: ESRI – Environmental Systems Research Institute</b></p>\r\n<p>Our passion for...,"<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s or Master’s in GIS, business administration, or a r...","<p>Our culture is anything but corporate—we have a collaborative, creative environment; phone di...",f,t,f,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,f,f
4,Bill Review Manager,"US, FL, Fort Worth",,,<p>SpotSource Solutions LLC is a Global Human Capital Management Consulting firm headquartered i...,"<p><b>JOB TITLE:</b> Itemization Review Manager</p>\r\n<p><b>LOCATION:</b> Fort Worth, TX<b> ...",<p><b>QUALIFICATIONS:</b></p>\r\n<ul>\r\n<li>RN license in the State of Texas</li>\r\n<li>Diplom...,<p>Full Benefits Offered</p>,f,t,t,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,f,f
5,Accounting Clerk,"US, MD,",,,,<p><b>Job Overview</b></p>\r\n<p>Apex is an environmental consulting firm that offers stable lea...,,,f,f,f,,,,,,f,f
6,Head of Content (m/f),"DE, BE, Berlin",ANDROIDPIT,20000-28000,"<p>Founded in 2009, the <b>Fonpit AG</b> rose with its international web portal <b>ANDROIDPIT</b...",<p><b>Your Responsibilities:</b></p>\r\n<p> </p>\r\n<ul>\r\n<li>Manage the English-speaking edit...,<p><b>Your Know-How:</b></p>\r\n<p><b> ...,<p><b>Your Benefits:</b></p>\r\n<p> </p>\r\n<ul>\r\n<li>Being part of a fast-growing company in ...,f,t,t,Full-time,Mid-Senior level,Master's Degree,Online Media,Management,f,f
7,Lead Guest Service Specialist,"US, CA, San Francisco",,,<p>Airenvy’s mission is to provide lucrative yet hassle free full service short term property ma...,<h3>Who is Airenvy?</h3>\r\n<p>Hey there! We are seasoned entrepreneurs in the heart of San Fran...,"<ul>\r\n<li>Experience with CRM software, live chat, and phones, including one year minimum of c...",<p><b>Competitive Pay.</b> You'll be able to eat steak everyday if you choose to. </p>\r\n<p><b...,f,t,t,,,,,,f,f
8,HP BSM SME,"US, FL, Pensacola",,,<p>Solutions3 is a <b>woman-owned small business </b>whose focus is IT Service Management using ...,<p></p>\r\n<p></p>\r\n<p>Implementation/Configuration/Testing/Training on:</p>\r\n<p>HP Service ...,<p><b>MUST BE A US CITIZEN.</b></p>\r\n<p><b>An active TS/SCI clearance will be required.</b></p...,,f,t,t,Full-time,Associate,,Information Technology and Services,,f,f
9,Customer Service Associate - Part Time,"US, AZ, Phoenix",,,"<p>Novitex Enterprise Solutions, formerly Pitney Bowes Management Services, delivers innovative ...","<p>The Customer Service Associate will be based in Phoenix, AZ. The right candidate will be an i...",<p><b>Minimum Requirements:</b></p>\r\n<ul>\r\n<li>Minimum of 6 months customer service related ...,,f,t,f,Part-time,Entry level,High School or equivalent,Financial Services,Customer Service,f,f


Looking at the first few rows, we noted that there are 4 textual columns with the remaining 14 being structured metadata columns. This allows us to perform some creative approach in the later sections in machine learning.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   title                17880 non-null  object
 1   location             17534 non-null  object
 2   department           6333 non-null   object
 3   salary_range         2868 non-null   object
 4   company_profile      14572 non-null  object
 5   description          17880 non-null  object
 6   requirements         15191 non-null  object
 7   benefits             10684 non-null  object
 8   telecommuting        17880 non-null  object
 9   has_company_logo     17880 non-null  object
 10  has_questions        17880 non-null  object
 11  employment_type      14409 non-null  object
 12  required_experience  10830 non-null  object
 13  required_education   9775 non-null   object
 14  industry             12977 non-null  object
 15  function             11425 non-null  object
 16  frau

All columns are having the object data type, which is pretty interesting from what we usually do for other projects.

In [5]:
df.shape

(17880, 18)

Let's also take a look at the distribution of our target variable.

In [6]:
df['fraudulent'].value_counts()

f    17014
t      866
Name: fraudulent, dtype: int64

In [7]:
df['fraudulent'].value_counts() / df.shape[0]

f    0.951566
t    0.048434
Name: fraudulent, dtype: float64

From this counts alone, we know that this is a highly imbalanced dataset, with 95% of negative observations and only less than 5% being positives. From previous experience, imbalanced dataset is a very prominent issue to deal with as it can really affect the metrics and prediction performance. Several balancing methods are very necessary to handle this problem.

In [8]:
df.isnull().sum()

title                      0
location                 346
department             11547
salary_range           15012
company_profile         3308
description                0
requirements            2689
benefits                7196
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
in_balanced_dataset        0
dtype: int64

There are quite a lot of columns that have missing values to deal with. Seems like they are pretty random, we have to investigate one by one with caution as we can't really just drop them right away.

Let's look at some numerical summaries as a whole.

In [9]:
df.describe()

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset
count,17880,17534,6333,2868,14572,17880,15191,10684,17880,17880,17880,14409,10830,9775,12977,11425,17880,17880
unique,11231,3105,1337,874,1710,15095,12119,6510,2,2,2,5,7,13,131,37,2,2
top,English Teacher Abroad,"GB, LND, London",Sales,0-0,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p>Play with kids, get paid for it </p>\r\n<p>Love travel? Jobs in Asia</p>\r\n<p>$1,500+ USD mo...",<p>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not nec...,<p>See job description</p>,f,t,f,Full-time,Mid-Senior level,Bachelor's Degree,Information Technology and Services,Information Technology,f,f
freq,311,718,551,142,726,376,410,726,17113,14220,9088,11620,3809,5145,1734,1749,17014,16980


It is pretty interesting to see the summaries here. What we observed here is the fact that every single column is being treated as categorical variables, indicating that for the structured data secgtion, we're dealing with only categorical variables.

Now that we have a brief understanding on the data, it is time to draft the actual data cleaning steps.

## Data Cleaning (Structured Data)

Due to the nature of this dataset, we have to divide the data cleaning/preprocessing into 2 sections: one for structured and another for unstructured data. Obviously, we'll deal with the structured data first. In this section, we are only processing structured information.

In [10]:
df.head()

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset
0,Marketing Intern,"US, NY, New York",Marketing,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,f,t,f,Other,Internship,,,Marketing,f,f
1,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,f,t,f,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,f,f
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,f,t,f,,,,,,f,f
3,Account Executive - Washington DC,"US, DC, Washington",Sales,,<p>Our passion for improving quality of life through geography is at the heart of everything we ...,<p><b>THE COMPANY: ESRI – Environmental Systems Research Institute</b></p>\r\n<p>Our passion for...,"<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s or Master’s in GIS, business administration, or a r...","<p>Our culture is anything but corporate—we have a collaborative, creative environment; phone di...",f,t,f,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,f,f
4,Bill Review Manager,"US, FL, Fort Worth",,,<p>SpotSource Solutions LLC is a Global Human Capital Management Consulting firm headquartered i...,"<p><b>JOB TITLE:</b> Itemization Review Manager</p>\r\n<p><b>LOCATION:</b> Fort Worth, TX<b> ...",<p><b>QUALIFICATIONS:</b></p>\r\n<ul>\r\n<li>RN license in the State of Texas</li>\r\n<li>Diplom...,<p>Full Benefits Offered</p>,f,t,t,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,f,f


### 1. Removing Duplicates

The very first step of data cleaning is always checking for duplicate entries, let's check them out.

In [11]:
df[df.duplicated() == True]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset
146,Customer Service Associate,"US, TX, Dallas",,,"<p>Novitex Enterprise Solutions, formerly Pitney Bowes Management Services, delivers innovative ...","<p>The Customer Service Associate will be based in Dallas, TX. The right candidate will be an in...",<p><b>Qualifications</b></p>\r\n<ul>\r\n<li>Minimum of 6 months customer service related experie...,,f,t,f,Full-time,Entry level,High School or equivalent,Telecommunications,Customer Service,f,f
402,Inside Sales Professional-Omaha,"US, NE, Omaha",,,"<p>ABC Supply Co., Inc. is the nation’s largest wholesale distributor of roofing and one of the ...","<p>As a Sales Representative, you will provide assistance to our customers as they purchase the ...","<p>As a Sales Representative, you must have the ability to provide superior customer service and...",<p><br>Your benefits package as a Sales Representative may include:<br><br></p>\r\n<ul>\r\n<li>H...,f,t,f,Full-time,,,Building Materials,Sales,f,f
495,Customer Service Associate - Part Time,"US, IL, Warrenville",,,"<p>Novitex Enterprise Solutions, formerly Pitney Bowes Management Services, delivers innovative ...","<p>The Customer Service Associate will be based in Warrenville, IL. The right candidate will be ...",<p><b>Minimum Requirements:</b></p>\r\n<ul>\r\n<li>Minimum of 6 months customer service related ...,,f,t,f,Full-time,Entry level,High School or equivalent,Insurance,Administrative,f,f
1327,Recruiter/Recruiting Assistant,"US, CA, Inglewood",,,,<p><i>“We believe our best investment is in our people.”</i> – Healthy Spot Core Value #8</p>\r\...,,,f,f,f,,,,,,f,f
1572,Telemarketing professional,"US, MA, Wilmington",BDC,30000-70000,<p>We are a family run business that has been in operation for nearly 40 years. We value long t...,"<p>Bill Dube Hyundai in Wilmington MA just outside of Boston, is a growing Hyundai dealer that i...",<p>Come grow with our exploding Internet Sales Deptartment. Be part of a cutting edge team of p...,<p><br><b><i>Compensation:</i></b></p>\r\n<ul>\r\n<li>$10.00 per hour plus commision. Commission...,f,t,f,Full-time,Not Applicable,Unspecified,Automotive,Customer Service,f,f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17591,Home Based Payroll Typist/Data Entry Clerks Positions Available,"US, MT, Absarokee",Clerical,,,<p>We have several openings available in this area earning $1000.00-$2500.00 per week. </p>\r\n<...,"<p>Basic computer and typing skills, ability to spell and print neatly, ability to follow direct...",<p>All you need is access to the Internet and you can participate. This is an entry level positi...,f,f,f,,,,,,t,t
17612,Urgent Jobs (Part Time Workers Needed),"AU, NSW, Sydney",,,,"<p>Urgent Jobs (Part Time Workers Needed)<br>You can do it all from home, in your free time, at ...",<p>No any experience required.</p>,,f,f,f,Part-time,,,,,t,t
17620,Data Entry Admin/Clerical Positions - Work From Home,"US, NE, Omaha",,,,"<p>ACCEPTING ONLINE APPLICATIONS ONLY</p>\r\n<p><a href=""#URL_355834be989baf102e6108409c8919edca...",,,f,f,f,,,,,,t,t
17742,Data Entry Admin/Clerical Positions - Work From Home,"US, NE, Omaha",,,,"<p>ACCEPTING ONLINE APPLICATIONS ONLY</p>\r\n<p><a href=""#URL_355834be989baf102e6108409c8919edca...",,,f,f,f,,,,,,t,t


Unexpectedly, we have quite a lot of duplicates, with 235 observations. Meaning to say, we'll have 235 rows to be removed. We can try to use drop_duplicates function and see how it works.

In [12]:
df[(df['title'] == "Data Entry Admin/Clerical Positions - Work From Home") & (df['location'] == "US, NE, Omaha")]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset
17531,Data Entry Admin/Clerical Positions - Work From Home,"US, NE, Omaha",,,,"<p>ACCEPTING ONLINE APPLICATIONS ONLY</p>\r\n<p><a href=""#URL_355834be989baf102e6108409c8919edca...",,,f,f,f,,,,,,t,t
17620,Data Entry Admin/Clerical Positions - Work From Home,"US, NE, Omaha",,,,"<p>ACCEPTING ONLINE APPLICATIONS ONLY</p>\r\n<p><a href=""#URL_355834be989baf102e6108409c8919edca...",,,f,f,f,,,,,,t,t
17742,Data Entry Admin/Clerical Positions - Work From Home,"US, NE, Omaha",,,,"<p>ACCEPTING ONLINE APPLICATIONS ONLY</p>\r\n<p><a href=""#URL_355834be989baf102e6108409c8919edca...",,,f,f,f,,,,,,t,t
17791,Data Entry Admin/Clerical Positions - Work From Home,"US, NE, Omaha",,,,"<p>ACCEPTING ONLINE APPLICATIONS ONLY</p>\r\n<p><a href=""#URL_355834be989baf102e6108409c8919edca...",,,f,f,f,,,,,,t,t


The above code cell output confirms the fact that for this job title, we have 4 identical rows and the duplicated() function indeeds identified that 3 of them are duplicates.

In [13]:
# Drop the duplicated rows and save to df2
df2 = df.drop_duplicates()

# Take a look at the shape of df2
df.shape, df2.shape

((17880, 18), (17645, 18))

Now, let's view the job title that we just observed just now and see what happened.

In [14]:
df2[(df2['title'] == "Data Entry Admin/Clerical Positions - Work From Home") & (df2['location'] == "US, NE, Omaha")]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset
17531,Data Entry Admin/Clerical Positions - Work From Home,"US, NE, Omaha",,,,"<p>ACCEPTING ONLINE APPLICATIONS ONLY</p>\r\n<p><a href=""#URL_355834be989baf102e6108409c8919edca...",,,f,f,f,,,,,,t,t


As expected, we now have only one entry of this job title, the duplicates have been dropped. We can double confirm again using duplicated() function.

In [15]:
df2[df2.duplicated()]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset


Now, our new df2 has no duplicates, we have successfully dropped the unwanted rows.

### 2. Dealing with Missing Values

There are 11 columns that contain missing entries and we want to deal with this before we can process the text data. Since we have not much information about the missing values, we should proceed with caution one by one. Let's take a look at the counts again.

In [16]:
df2.isnull().sum()

title                      0
location                 343
department             11362
salary_range           14813
company_profile         3287
description                0
requirements            2650
benefits                7103
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3435
required_experience     6976
required_education      8028
industry                4849
function                6378
fraudulent                 0
in_balanced_dataset        0
dtype: int64

In [17]:
df2.isnull().sum() / df.shape[0] * 100

title                   0.000000
location                1.918345
department             63.545861
salary_range           82.846756
company_profile        18.383669
description             0.000000
requirements           14.821029
benefits               39.725951
telecommuting           0.000000
has_company_logo        0.000000
has_questions           0.000000
employment_type        19.211409
required_experience    39.015660
required_education     44.899329
industry               27.119687
function               35.671141
fraudulent              0.000000
in_balanced_dataset     0.000000
dtype: float64

First of all, we need to clearly define the columns that are going to be cleaned and the columns that will be used for machine learning. This is because there are several columns that are pretty messed up and requires way too much time to be cleaned properly and it is not worth the time and effort as this is not a large scale project.

Meta Columns that we are going to clean and fill up:

- **country**
- **state**
- **required_experience**   
- **required_education**      
- **function**

Meta Columns that we are going to use for machine learning:

- **title**
- **telecommuting**
- **has_company_logo**
- **has_questions**
- **required_experience**
- **required_education**
- **function**

To be considered: Country and State

#### i. Location

The first column to clean up the missing values is the location variable. It has 343 missing entries which occupy 1.9% of the total observations. We should examine them.

In [18]:
df2[df2['location'].isnull()]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset
144,Forward Cap.,,,,,<p>The group has raised a fund for the purchase of homes in the Southeast. The student on this p...,,,f,f,f,,,,,,t,t
204,Junior Python Developer,,Line-Up,,<p>Playfair Capital is an early stage technology investment fund based in London. </p>,<p><b>Who we’re looking for</b><br><i>Maker Mentality</i>Are you focused on the ‘doing’; the cre...,<p><b>Skills and experience</b></p>\r\n<ul>\r\n<li>Degree in Computer Science or equivalent</li>...,,f,t,f,,,,,,f,f
234,Postgraduate Certificate in Social Innovation Management Kenya - March 2015,,,,<p>The Amani Institute is about developing whole individuals who have the knowledge and practica...,"<p>This unique, field-based, full-time program brings together 25 individuals from different cou...",<p>What do we look for in a program participant?</p>\r\n<p>If you meet the majority of the requi...,<p>Sign up for:</p>\r\n<ul>\r\n<li>25 classmates from around the world</li>\r\n<li>Facilitated a...,f,t,t,,,,,,f,f
325,Head of Quality Assurance,,,,<p>Gelato Group is a SaaS company. We've developed a global print engine integrated with the pri...,<p>Following our global expansion we are seeking to add an experienced world-class head of Quali...,<ul>\r\n<li>A minimum of B.S. degree in Information Technology or Computer Science</li>\r\n<li>3...,,f,t,f,,,,,,f,f
349,Embedded Systems / Telematics Security Consultant,,Professional Services,,"<p>Cylance is a global cybersecurity products and service company, specializeing in advanced thr...",<p><b>Summary</b></p>\r\n<ul>\r\n<li>Immediate requirement for an advanced telematics/embedded s...,<p><b>Qualifications</b></p>\r\n<ul>\r\n<li>Bachelor degree in Information Technology/Computer S...,,f,t,f,,,,,,f,f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17792,Rooms Division Manager,,,,<p>Awarded by <i><b>Expatriate Lifestyle Magazine</b></i> with <b>2013 Best Business Hotel Excel...,<p>The Rooms Division Manager is responsible for Executive Housekeeping and Front<br>Office. He/...,<p>High school or equivalent education required. Bachelor's degree and Master's degree<br>in rel...,,f,t,f,,,,,,t,t
17809,Data Entry / Administrative Assitstant / Admin Clerk / Office Assistant / Customer Service Rep,,,,,"<p>As a Data Entry / Administrative Assitstant / Admin Clerk Associate, your duties will inclu...",,,t,f,f,Full-time,Entry level,Unspecified,Telecommunications,Administrative,t,t
17821,Webcam Model,,,,,<p>Internet Modeling is a premier adult modeling agency recruiting and hiring webcam models for ...,"<p><b>In order to be considered for a webcam model position, you MUST:</b> - Be an attractive fe...",<p><b>We provide the following benefits to all our webcam models:</b></p>\r\n<p>- Earn from $0.8...,f,f,t,,,,,,t,t
17822,5 Guys,,,,,<p>Analyze the excel books of the franchise and then post them online for him to use.</p>,,,f,f,f,,,,,,t,t


Let's do the pre-processing for the missing values. There's a lot of work to do.

In [19]:
# countries_for_language returns a list of tuples now, might be changed to an OrderedDict
countries = dict(countries_for_language('en'))

# Take a look
countries

{'AF': 'Afghanistan',
 'AX': 'Åland Islands',
 'AL': 'Albania',
 'DZ': 'Algeria',
 'AS': 'American Samoa',
 'AD': 'Andorra',
 'AO': 'Angola',
 'AI': 'Anguilla',
 'AQ': 'Antarctica',
 'AG': 'Antigua & Barbuda',
 'AR': 'Argentina',
 'AM': 'Armenia',
 'AW': 'Aruba',
 'AU': 'Australia',
 'AT': 'Austria',
 'AZ': 'Azerbaijan',
 'BS': 'Bahamas',
 'BH': 'Bahrain',
 'BD': 'Bangladesh',
 'BB': 'Barbados',
 'BY': 'Belarus',
 'BE': 'Belgium',
 'BZ': 'Belize',
 'BJ': 'Benin',
 'BM': 'Bermuda',
 'BT': 'Bhutan',
 'BO': 'Bolivia',
 'BA': 'Bosnia & Herzegovina',
 'BW': 'Botswana',
 'BV': 'Bouvet Island',
 'BR': 'Brazil',
 'IO': 'British Indian Ocean Territory',
 'VG': 'British Virgin Islands',
 'BN': 'Brunei',
 'BG': 'Bulgaria',
 'BF': 'Burkina Faso',
 'BI': 'Burundi',
 'KH': 'Cambodia',
 'CM': 'Cameroon',
 'CA': 'Canada',
 'CV': 'Cape Verde',
 'BQ': 'Caribbean Netherlands',
 'KY': 'Cayman Islands',
 'CF': 'Central African Republic',
 'TD': 'Chad',
 'CL': 'Chile',
 'CN': 'China',
 'CX': 'Christmas Isla

In [20]:
# Try and see the info for United Kingdom
CountryInfo("United Kingdom").info()

{'name': 'United Kingdom',
 'altSpellings': ['GB', 'UK', 'Great Britain'],
 'area': 242900,
 'borders': ['IRL'],
 'callingCodes': ['44'],
 'capital': 'London',
 'capital_latlng': [51.507322, -0.127647],
 'currencies': ['GBP'],
 'demonym': 'British',
 'flag': '',
 'geoJSON': {'type': 'FeatureCollection',
  'features': [{'type': 'Feature',
    'id': 'GBR',
    'properties': {'name': 'United Kingdom'},
    'geometry': {'type': 'MultiPolygon',
     'coordinates': [[[[-5.661949, 54.554603],
        [-6.197885, 53.867565],
        [-6.95373, 54.073702],
        [-7.572168, 54.059956],
        [-7.366031, 54.595841],
        [-7.572168, 55.131622],
        [-6.733847, 55.17286],
        [-5.661949, 54.554603]]],
      [[[-3.005005, 58.635],
        [-4.073828, 57.553025],
        [-3.055002, 57.690019],
        [-1.959281, 57.6848],
        [-2.219988, 56.870017],
        [-3.119003, 55.973793],
        [-2.085009, 55.909998],
        [-2.005676, 55.804903],
        [-1.114991, 54.624986],
  

In [21]:
# Try another, see the info for Malaysia
CountryInfo("Malaysia").info()

{'name': 'Malaysia',
 'altSpellings': ['MY'],
 'area': 330803,
 'borders': ['BRN', 'IDN', 'THA'],
 'callingCodes': ['60'],
 'capital': 'Kuala Lumpur',
 'capital_latlng': [3.151696, 101.694237],
 'currencies': ['MYR'],
 'demonym': 'Malaysian',
 'flag': '',
 'geoJSON': {'type': 'FeatureCollection',
  'features': [{'type': 'Feature',
    'id': 'MYS',
    'properties': {'name': 'Malaysia'},
    'geometry': {'type': 'MultiPolygon',
     'coordinates': [[[[101.075516, 6.204867],
        [101.154219, 5.691384],
        [101.814282, 5.810808],
        [102.141187, 6.221636],
        [102.371147, 6.128205],
        [102.961705, 5.524495],
        [103.381215, 4.855001],
        [103.438575, 4.181606],
        [103.332122, 3.726698],
        [103.429429, 3.382869],
        [103.502448, 2.791019],
        [103.854674, 2.515454],
        [104.247932, 1.631141],
        [104.228811, 1.293048],
        [103.519707, 1.226334],
        [102.573615, 1.967115],
        [101.390638, 2.760814],
        [1

In [22]:
CountryInfo("MY").capital()

'Kuala Lumpur'

In [23]:
CountryInfo("United Kingdom").capital()

'London'

In [24]:
CountryInfo("United Kingdom").provinces()

['Barking and Dagenham',
 'Barnet',
 'Barnsley',
 'Bath and North East Somerset',
 'Bedfordshire',
 'Bexley',
 'Birmingham',
 'Blackburn with Darwen',
 'Blackpool',
 'Bolton',
 'Bournemouth',
 'Bracknell Forest',
 'Bradford',
 'Brent',
 'Brighton and Hove',
 'Bromley',
 'Buckinghamshire',
 'Bury',
 'Calderdale',
 'Cambridgeshire',
 'Camden',
 'Cheshire',
 'City of Bristol',
 'City of Kingston upon Hull',
 'City of London',
 'Cornwall',
 'Coventry',
 'Croydon',
 'Cumbria',
 'Darlington',
 'Derby',
 'Derbyshire',
 'Devon',
 'Doncaster',
 'Dorset',
 'Dudley',
 'Durham',
 'Ealing',
 'East Riding of Yorkshire',
 'East Sussex',
 'Enfield',
 'Essex',
 'Gateshead',
 'Gloucestershire',
 'Greenwich',
 'Hackney',
 'Halton',
 'Hammersmith and Fulham',
 'Hampshire',
 'Haringey',
 'Harrow',
 'Hartlepool',
 'Havering',
 'Herefordshire',
 'Hertfordshire',
 'Hillingdon',
 'Hounslow',
 'Isle of Wight',
 'Islington',
 'Kensington and Chelsea',
 'Kent',
 'Kingston upon Thames',
 'Kirklees',
 'Knowsley',
 

In [25]:
[countries["AD"], countries["AX"], countries["AQ"], countries["BV"], countries["VG"], countries["BQ"], countries["CW"], 
 countries["ME"], countries["MM"], countries["PS"], countries["RS"], countries["SX"], countries["BL"], countries['MF'], 
 countries["TC"], countries["UM"], countries["VI"], countries["VA"]]

['Andorra',
 'Åland Islands',
 'Antarctica',
 'Bouvet Island',
 'British Virgin Islands',
 'Caribbean Netherlands',
 'Curaçao',
 'Montenegro',
 'Myanmar (Burma)',
 'Palestinian Territories',
 'Serbia',
 'Sint Maarten',
 'St. Barthélemy',
 'St. Martin',
 'Turks & Caicos Islands',
 'U.S. Outlying Islands',
 'U.S. Virgin Islands',
 'Vatican City']

In [26]:
codes = list(countries.keys())
codes = [code for code in codes if code not in ["AX", "AD", "AQ", "BV", "VG", "BQ", "CW", "ME", "MM", "PS", "RS", 
                                                "SX", "BL", "MF", "TC", "UM", "VI", "VA"]]
codes.append("Serbia")
codes

['AF',
 'AL',
 'DZ',
 'AS',
 'AO',
 'AI',
 'AG',
 'AR',
 'AM',
 'AW',
 'AU',
 'AT',
 'AZ',
 'BS',
 'BH',
 'BD',
 'BB',
 'BY',
 'BE',
 'BZ',
 'BJ',
 'BM',
 'BT',
 'BO',
 'BA',
 'BW',
 'BR',
 'IO',
 'BN',
 'BG',
 'BF',
 'BI',
 'KH',
 'CM',
 'CA',
 'CV',
 'KY',
 'CF',
 'TD',
 'CL',
 'CN',
 'CX',
 'CC',
 'CO',
 'KM',
 'CG',
 'CD',
 'CK',
 'CR',
 'CI',
 'HR',
 'CU',
 'CY',
 'CZ',
 'DK',
 'DJ',
 'DM',
 'DO',
 'EC',
 'EG',
 'SV',
 'GQ',
 'ER',
 'EE',
 'SZ',
 'ET',
 'FK',
 'FO',
 'FJ',
 'FI',
 'FR',
 'GF',
 'PF',
 'TF',
 'GA',
 'GM',
 'GE',
 'DE',
 'GH',
 'GI',
 'GR',
 'GL',
 'GD',
 'GP',
 'GU',
 'GT',
 'GG',
 'GN',
 'GW',
 'GY',
 'HT',
 'HM',
 'HN',
 'HK',
 'HU',
 'IS',
 'IN',
 'ID',
 'IR',
 'IQ',
 'IE',
 'IM',
 'IL',
 'IT',
 'JM',
 'JP',
 'JE',
 'JO',
 'KZ',
 'KE',
 'KI',
 'KW',
 'KG',
 'LA',
 'LV',
 'LB',
 'LS',
 'LR',
 'LY',
 'LI',
 'LT',
 'LU',
 'MO',
 'MG',
 'MW',
 'MY',
 'MV',
 'ML',
 'MT',
 'MH',
 'MQ',
 'MR',
 'MU',
 'YT',
 'MX',
 'FM',
 'MD',
 'MC',
 'MN',
 'MS',
 'MA',
 'MZ',
 'NA',

In [27]:
capitals = [CountryInfo(code).capital() for code in codes]

# Remove the empty strings
capitals = [capital for capital in capitals if capital != ""]

In [28]:
#states = [val['name'] for val in gc.get_us_states().values()]
states = CountryInfo("United States").provinces()
states

['Alabama',
 'Alaska',
 'Arizona',
 'Arkansas',
 'California',
 'Colorado',
 'Connecticut',
 'Delaware',
 'District of Columbia',
 'Florida',
 'Georgia',
 'Hawaii',
 'Idaho',
 'Illinois',
 'Indiana',
 'Iowa',
 'Kansas',
 'Kentucky',
 'Louisiana',
 'Maine',
 'Maryland',
 'Massachusetts',
 'Michigan',
 'Minnesota',
 'Mississippi',
 'Missouri',
 'Montana',
 'Nebraska',
 'Nevada',
 'New Hampshire',
 'New Jersey',
 'New Mexico',
 'New York',
 'North Carolina',
 'North Dakota',
 'Ohio',
 'Oklahoma',
 'Oregon',
 'Pennsylvania',
 'Rhode Island',
 'South Carolina',
 'South Dakota',
 'Tennessee',
 'Texas',
 'Utah',
 'Vermont',
 'Virginia',
 'Washington',
 'West Virginia',
 'Wisconsin',
 'Wyoming']

In [29]:
province_dict = {code: CountryInfo(code).provinces() for code in codes}
province_dict

{'AF': ['Badakhshan',
  'Badghis',
  'Baghlan',
  'Balkh',
  'Bamian',
  'Farah',
  'Faryab',
  'Ghazni',
  'Ghowr',
  'Helmand',
  'Herat',
  'Jowzjan',
  'Kabol',
  'Kandahar',
  'Kapisa',
  'Konar',
  'Kondoz',
  'Laghman',
  'Lowgar',
  'Nangarhar',
  'Nimruz',
  'Oruzgan',
  'Paktia',
  'Paktika',
  'Parvan',
  'Samangan',
  'Sar-e Pol',
  'Takhar',
  'Vardak',
  'Zabol'],
 'AL': ['Berat',
  'Bulqize',
  'Delvine',
  'Devoll (Bilisht)',
  'Diber (Peshkopi)',
  'Durres',
  'Elbasan',
  'Fier',
  'Gjirokaster',
  'Gramsh',
  'Has (Krume)',
  'Kavaje',
  'Kolonje (Erseke)',
  'Korce',
  'Kruje',
  'Kucove',
  'Kukes',
  'Kurbin',
  'Lezhe',
  'Librazhd',
  'Lushnje',
  'Malesi e Madhe (Koplik)',
  'Mallakaster (Ballsh)',
  'Mat (Burrel)',
  'Mirdite (Rreshen)',
  'Peqin',
  'Permet',
  'Pogradec',
  'Puke',
  'Sarande',
  'Shkoder',
  'Skrapar (Corovode)',
  'Tepelene',
  'Tirane (Tirana)',
  'Tirane (Tirana)',
  'Tropoje (Bajram Curri)',
  'Vlore'],
 'DZ': ['Adrar',
  'Ain Defla',
 

In [30]:
provinces = []
for province_list in province_dict.values():
    provinces += province_list
provinces

['Badakhshan',
 'Badghis',
 'Baghlan',
 'Balkh',
 'Bamian',
 'Farah',
 'Faryab',
 'Ghazni',
 'Ghowr',
 'Helmand',
 'Herat',
 'Jowzjan',
 'Kabol',
 'Kandahar',
 'Kapisa',
 'Konar',
 'Kondoz',
 'Laghman',
 'Lowgar',
 'Nangarhar',
 'Nimruz',
 'Oruzgan',
 'Paktia',
 'Paktika',
 'Parvan',
 'Samangan',
 'Sar-e Pol',
 'Takhar',
 'Vardak',
 'Zabol',
 'Berat',
 'Bulqize',
 'Delvine',
 'Devoll (Bilisht)',
 'Diber (Peshkopi)',
 'Durres',
 'Elbasan',
 'Fier',
 'Gjirokaster',
 'Gramsh',
 'Has (Krume)',
 'Kavaje',
 'Kolonje (Erseke)',
 'Korce',
 'Kruje',
 'Kucove',
 'Kukes',
 'Kurbin',
 'Lezhe',
 'Librazhd',
 'Lushnje',
 'Malesi e Madhe (Koplik)',
 'Mallakaster (Ballsh)',
 'Mat (Burrel)',
 'Mirdite (Rreshen)',
 'Peqin',
 'Permet',
 'Pogradec',
 'Puke',
 'Sarande',
 'Shkoder',
 'Skrapar (Corovode)',
 'Tepelene',
 'Tirane (Tirana)',
 'Tirane (Tirana)',
 'Tropoje (Bajram Curri)',
 'Vlore',
 'Adrar',
 'Ain Defla',
 'Ain Temouchent',
 'Alger',
 'Annaba',
 'Batna',
 'Bechar',
 'Bejaia',
 'Biskra',
 'Blida

In [31]:
# Had to remove regions because apparently regions are not indicating the name of locations
# Had to remove Centre because the extracted Centre is not the Cameroon's capital
# Same goes to Store and Lib
provinces = [prov for prov in provinces if prov not in ["Northern", "Eastern", "North-East", "Southern", "Centre", "Mus", 
                                                        "Ha", "Man", "Tak", "Sal", "Bor", "Grey", "Store", "Makin",  
                                                        "Lib", "Ig", "Para", "Enga", "Bac", "Est", "Canar", "Paul", 
                                                        "Bay", "Coast", "Valle", "Bie", "Apac", "Cook"]]

In [32]:
len(provinces)

4268

In [33]:
df_new = df2.copy()
df_new = df_new[df_new['location'].isnull()]

Extraction of country, capital, us_state and province from description column.

In [34]:
df_new['country_des'] = df_new["description"].str.extract(r'({})'.format('|'.join([re.sub(r"[()]", "", item) for item in countries.values()])))
df_new['capital_des'] = df_new["description"].str.extract(r'({})'.format('|'.join(capitals)))
df_new['us_state_des'] = df_new["description"].str.extract(r'({})'.format('|'.join(states)))
df_new['province_des'] = df_new["description"].str.extract(r'({})'.format('|'.join([re.sub(r"[()]", "", prov) for prov in provinces])))
df_new

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country_des,capital_des,us_state_des,province_des
144,Forward Cap.,,,,,<p>The group has raised a fund for the purchase of homes in the Southeast. The student on this p...,,,f,f,f,,,,,,t,t,,,,
204,Junior Python Developer,,Line-Up,,<p>Playfair Capital is an early stage technology investment fund based in London. </p>,<p><b>Who we’re looking for</b><br><i>Maker Mentality</i>Are you focused on the ‘doing’; the cre...,<p><b>Skills and experience</b></p>\r\n<ul>\r\n<li>Degree in Computer Science or equivalent</li>...,,f,t,f,,,,,,f,f,,London,,Manchester
234,Postgraduate Certificate in Social Innovation Management Kenya - March 2015,,,,<p>The Amani Institute is about developing whole individuals who have the knowledge and practica...,"<p>This unique, field-based, full-time program brings together 25 individuals from different cou...",<p>What do we look for in a program participant?</p>\r\n<p>If you meet the majority of the requi...,<p>Sign up for:</p>\r\n<ul>\r\n<li>25 classmates from around the world</li>\r\n<li>Facilitated a...,f,t,t,,,,,,f,f,,,,
325,Head of Quality Assurance,,,,<p>Gelato Group is a SaaS company. We've developed a global print engine integrated with the pri...,<p>Following our global expansion we are seeking to add an experienced world-class head of Quali...,<ul>\r\n<li>A minimum of B.S. degree in Information Technology or Computer Science</li>\r\n<li>3...,,f,t,f,,,,,,f,f,,,,
349,Embedded Systems / Telematics Security Consultant,,Professional Services,,"<p>Cylance is a global cybersecurity products and service company, specializeing in advanced thr...",<p><b>Summary</b></p>\r\n<ul>\r\n<li>Immediate requirement for an advanced telematics/embedded s...,<p><b>Qualifications</b></p>\r\n<ul>\r\n<li>Bachelor degree in Information Technology/Computer S...,,f,t,f,,,,,,f,f,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17792,Rooms Division Manager,,,,<p>Awarded by <i><b>Expatriate Lifestyle Magazine</b></i> with <b>2013 Best Business Hotel Excel...,<p>The Rooms Division Manager is responsible for Executive Housekeeping and Front<br>Office. He/...,<p>High school or equivalent education required. Bachelor's degree and Master's degree<br>in rel...,,f,t,f,,,,,,t,t,,,,
17809,Data Entry / Administrative Assitstant / Admin Clerk / Office Assistant / Customer Service Rep,,,,,"<p>As a Data Entry / Administrative Assitstant / Admin Clerk Associate, your duties will inclu...",,,t,f,f,Full-time,Entry level,Unspecified,Telecommunications,Administrative,t,t,,,,
17821,Webcam Model,,,,,<p>Internet Modeling is a premier adult modeling agency recruiting and hiring webcam models for ...,"<p><b>In order to be considered for a webcam model position, you MUST:</b> - Be an attractive fe...",<p><b>We provide the following benefits to all our webcam models:</b></p>\r\n<p>- Earn from $0.8...,f,f,t,,,,,,t,t,,,,
17822,5 Guys,,,,,<p>Analyze the excel books of the franchise and then post them online for him to use.</p>,,,f,f,f,,,,,,t,t,,,,


Extraction of country, capital, us_state and province from company_profile column.

In [35]:
df_new['country_cp'] = df_new["company_profile"].str.extract(r'({})'.format('|'.join([re.sub(r"[()]", "", item) for item in countries.values()])))
df_new['capital_cp'] = df_new["company_profile"].str.extract(r'({})'.format('|'.join(capitals)))
df_new['us_state_cp'] = df_new["company_profile"].str.extract(r'({})'.format('|'.join(states)))
df_new['province_cp'] = df_new["company_profile"].str.extract(r'({})'.format('|'.join([re.sub(r"[()]", "", prov) for prov in provinces])))
df_new

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country_des,capital_des,us_state_des,province_des,country_cp,capital_cp,us_state_cp,province_cp
144,Forward Cap.,,,,,<p>The group has raised a fund for the purchase of homes in the Southeast. The student on this p...,,,f,f,f,,,,,,t,t,,,,,,,,
204,Junior Python Developer,,Line-Up,,<p>Playfair Capital is an early stage technology investment fund based in London. </p>,<p><b>Who we’re looking for</b><br><i>Maker Mentality</i>Are you focused on the ‘doing’; the cre...,<p><b>Skills and experience</b></p>\r\n<ul>\r\n<li>Degree in Computer Science or equivalent</li>...,,f,t,f,,,,,,f,f,,London,,Manchester,,London,,
234,Postgraduate Certificate in Social Innovation Management Kenya - March 2015,,,,<p>The Amani Institute is about developing whole individuals who have the knowledge and practica...,"<p>This unique, field-based, full-time program brings together 25 individuals from different cou...",<p>What do we look for in a program participant?</p>\r\n<p>If you meet the majority of the requi...,<p>Sign up for:</p>\r\n<ul>\r\n<li>25 classmates from around the world</li>\r\n<li>Facilitated a...,f,t,t,,,,,,f,f,,,,,,,,
325,Head of Quality Assurance,,,,<p>Gelato Group is a SaaS company. We've developed a global print engine integrated with the pri...,<p>Following our global expansion we are seeking to add an experienced world-class head of Quali...,<ul>\r\n<li>A minimum of B.S. degree in Information Technology or Computer Science</li>\r\n<li>3...,,f,t,f,,,,,,f,f,,,,,,,,
349,Embedded Systems / Telematics Security Consultant,,Professional Services,,"<p>Cylance is a global cybersecurity products and service company, specializeing in advanced thr...",<p><b>Summary</b></p>\r\n<ul>\r\n<li>Immediate requirement for an advanced telematics/embedded s...,<p><b>Qualifications</b></p>\r\n<ul>\r\n<li>Bachelor degree in Information Technology/Computer S...,,f,t,f,,,,,,f,f,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17792,Rooms Division Manager,,,,<p>Awarded by <i><b>Expatriate Lifestyle Magazine</b></i> with <b>2013 Best Business Hotel Excel...,<p>The Rooms Division Manager is responsible for Executive Housekeeping and Front<br>Office. He/...,<p>High school or equivalent education required. Bachelor's degree and Master's degree<br>in rel...,,f,t,f,,,,,,t,t,,,,,Malaysia,Kuala Lumpur,,
17809,Data Entry / Administrative Assitstant / Admin Clerk / Office Assistant / Customer Service Rep,,,,,"<p>As a Data Entry / Administrative Assitstant / Admin Clerk Associate, your duties will inclu...",,,t,f,f,Full-time,Entry level,Unspecified,Telecommunications,Administrative,t,t,,,,,,,,
17821,Webcam Model,,,,,<p>Internet Modeling is a premier adult modeling agency recruiting and hiring webcam models for ...,"<p><b>In order to be considered for a webcam model position, you MUST:</b> - Be an attractive fe...",<p><b>We provide the following benefits to all our webcam models:</b></p>\r\n<p>- Earn from $0.8...,f,f,t,,,,,,t,t,,,,,,,,
17822,5 Guys,,,,,<p>Analyze the excel books of the franchise and then post them online for him to use.</p>,,,f,f,f,,,,,,t,t,,,,,,,,


In [36]:
df_new['country'] = np.nan
df_new["state"] = np.nan
df_new['city'] = np.nan
df_new['imputed'] = 0

In [37]:
df_new.head(8)

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country_des,capital_des,us_state_des,province_des,country_cp,capital_cp,us_state_cp,province_cp,country,state,city,imputed
144,Forward Cap.,,,,,<p>The group has raised a fund for the purchase of homes in the Southeast. The student on this p...,,,f,f,f,,,,,,t,t,,,,,,,,,,,,0
204,Junior Python Developer,,Line-Up,,<p>Playfair Capital is an early stage technology investment fund based in London. </p>,<p><b>Who we’re looking for</b><br><i>Maker Mentality</i>Are you focused on the ‘doing’; the cre...,<p><b>Skills and experience</b></p>\r\n<ul>\r\n<li>Degree in Computer Science or equivalent</li>...,,f,t,f,,,,,,f,f,,London,,Manchester,,London,,,,,,0
234,Postgraduate Certificate in Social Innovation Management Kenya - March 2015,,,,<p>The Amani Institute is about developing whole individuals who have the knowledge and practica...,"<p>This unique, field-based, full-time program brings together 25 individuals from different cou...",<p>What do we look for in a program participant?</p>\r\n<p>If you meet the majority of the requi...,<p>Sign up for:</p>\r\n<ul>\r\n<li>25 classmates from around the world</li>\r\n<li>Facilitated a...,f,t,t,,,,,,f,f,,,,,,,,,,,,0
325,Head of Quality Assurance,,,,<p>Gelato Group is a SaaS company. We've developed a global print engine integrated with the pri...,<p>Following our global expansion we are seeking to add an experienced world-class head of Quali...,<ul>\r\n<li>A minimum of B.S. degree in Information Technology or Computer Science</li>\r\n<li>3...,,f,t,f,,,,,,f,f,,,,,,,,,,,,0
349,Embedded Systems / Telematics Security Consultant,,Professional Services,,"<p>Cylance is a global cybersecurity products and service company, specializeing in advanced thr...",<p><b>Summary</b></p>\r\n<ul>\r\n<li>Immediate requirement for an advanced telematics/embedded s...,<p><b>Qualifications</b></p>\r\n<ul>\r\n<li>Bachelor degree in Information Technology/Computer S...,,f,t,f,,,,,,f,f,,,,,,,,,,,,0
406,Paid Internship for Africa Program,,Africa Program,,<p><b>Applied Memetics LLC</b> is a professional services company dedicated to integrating and d...,<p>Applied Memetics LLC (AM LLC) is looking for an intern to support the operations of the compa...,"<p>- Knowledge of Africa, specifically of the Sahel region. Preference given to those candidates...",,f,t,f,Other,Internship,Bachelor's Degree,Media Production,,f,f,,,,,,,,,,,,0
411,Art Director,,,,<p><b>WHO WE ARE</b><br>In2media has ever since the early start in 1994 grown into being a full ...,"<p>Please apply for the position as Art Director at In2media by clicking the ""Apply for this job...",,,f,t,f,,,,,,f,f,,,,,,,,,,,,0
483,Senior Frontend Developers,,,,<p>Gelato Group is a SaaS company. We've developed a global print engine integrated with the pri...,<p>Following our global expansion we are seeking to add experienced world-class senior frontend ...,"<ul>\r\n<li>Oral and written fluency in English</li>\r\n<li>HTML knowledge, you use tags efficie...",,f,t,f,,,,,,f,f,,,,,,,,,,,,0


We define a few values for imputation. For country, the values are: US (when us_state are present), Non-US (others?), None (for those that are stil nan)

For state, the values are: US state (when us_state are present), Non-US state (others?), None (for those that are stil nan)

In [38]:
df_new.loc[(df_new['country_des'] == "United States") | ((df_new['country_des'].isnull()) & (df_new['us_state_des'].notnull())) | \
           (df_new['country_cp'] == "United States") | ((df_new['country_cp'].isnull()) & (df_new['us_state_cp'].notnull())), :]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country_des,capital_des,us_state_des,province_des,country_cp,capital_cp,us_state_cp,province_cp,country,state,city,imputed
525,Sr. Ruby,,,13000-16000,,<p>Plated is looking for a full stack ruby on rails engineer to join our team.</p>\r\n<p>Plated ...,<p>Plated is looking for a full stack ruby on rails engineer to join our team.</p>\r\n<p>Plated ...,<p>Plated is looking for a full stack ruby on rails engineer to join our team.</p>\r\n<p>Plated ...,t,f,f,Full-time,Internship,Bachelor's Degree,Banking,,f,f,,,New York,New York,,,,,,,,0
938,Provisions eCommerce Intern,,Provisions,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...",<p>Do you obsess over great products -- both stylish and delicious? Are you the first among your...,<p><b>You may be a good fit for this position if you:</b></p>\r\n<ul>\r\n<li>have a keen eye (an...,,f,t,t,,Internship,,,General Business,f,f,,,,,,,New York,New York,,,,0
2211,Editorial Intern,,Editorial,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Writing experience (published clips and/or personal blog)</li>\r\n<li>Experience wit...,,f,t,t,,Internship,,,Writing/Editing,f,f,,,New York,New York,,,New York,New York,,,,0
2808,Senior Application Developer,,digital,,<p><b>Since 1978</b></p>\r\n<p>Our goal has been to create engaging brand experiences in the mos...,<p>We are seeking a highly skilled Sr. #URL_01a736d89d2f0b19de700923d2c312837e180465650804d0f841...,<ul>\r\n<li>#URL_01a736d89d2f0b19de700923d2c312837e180465650804d0f84105352812bf9a# 2.0 to 4.0 We...,"<ul>\r\n<li>Energetic, creative team and workspace</li>\r\n<li>Warehouse office, downtown Boise<...",f,t,t,Full-time,,,Marketing and Advertising,,f,f,,,,,,,Idaho,Idaho,,,,0
2872,Javascript Engineer (Mobile),,,,<h3>We’re always looking for highly motivated “founder-types” to join us as we grow. Here’s what...,"<p>The Mobile Majority is a rapidly growing ad tech startup based in Santa Monica, CA, with offi...",<p>Required Skills:</p>\r\n<p> * Completed Several Successful Projects</p>\r\n<p> * &gt;2 Year...,"<p>It’s no secret that we work hard, but we also strive to create an office environment where th...",f,t,f,Full-time,Mid-Senior level,,Internet,,f,f,,,New York,New York,,,,,,,,0
3165,Editorial Intern,,Editorial,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Writing experience (published clips and/or personal blog)</li>\r\n<li>Experience wit...,,f,t,t,,Internship,,,Writing/Editing,f,f,,,New York,New York,,,New York,New York,,,,0
3940,Director of Growth Marketing,,Marketing,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>We're Food52, a community for people who love food and cooking, and we're looking for a Direc...",<ul>\r\n<li>5+ years experience in product management and/or marketing role focused on growth</l...,,f,t,t,Full-time,,,,Marketing,f,f,,,,,,,New York,New York,,,,0
4180,Marketing Team Intern,,,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<p>REQUIREMENTS</p>\r\n<ul>\r\n<li>Experience with content management systems a major plus (any ...,,f,t,f,Part-time,,,,,f,f,,,New York,New York,,,New York,New York,,,,0
4235,Cookbook Project Intern,,Editorial,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52 is looking for a part-time, unpaid intern to work closely with our editorial staff on ...","<ul>\r\n<li>Loves cooking, loves talking about it</li>\r\n<li>Cooking experience a major plus (c...",,f,t,t,Part-time,Internship,,,,f,f,,,,,,,New York,New York,,,,0
4698,Operations/Customer Care Intern,,Provisions,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>We're Food52, a community for people who love food and cooking, and we're looking for a Custo...",<ul>\r\n<li>Experience with customer care customer care programs (brownie points if you’ve been ...,,f,t,t,,,,,,f,f,,,,,,,New York,New York,,,,0


In [39]:
df_new.loc[(((df_new['country_des'] != "United States") & (df_new['country_des'].notnull())) & (df_new['us_state_des'].notnull())) | \
           (((df_new['country_cp'] != "United States") & (df_new['country_cp'].notnull())) & (df_new['us_state_cp'].notnull())), :]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country_des,capital_des,us_state_des,province_des,country_cp,capital_cp,us_state_cp,province_cp,country,state,city,imputed
3880,I want to work @involvio!,,,,<p>We launched Involvio as students at Drexel University in Philadelphia out of our frustration ...,"<p>If you'd like to be part of what we do at Involvio, but you don't see an opening specific to ...",,<ul>\r\n<li>Cool midtown office close to Grand Central</li>\r\n<li>Brand new Mac rig of your cho...,f,t,t,,,,,,f,f,,,,,Canada,,New York,New York,,,,0
6069,General Application (US/Canada),,General - US/CAD,,<p>The Heafey Group is a private real estate investment and management conglomerate founded more...,<p>It seems like we have no opennings at the moment but feel free to apply with your resume alon...,,,f,t,f,,,,,,f,f,,,,,Canada,,Florida,Quebec,,,,0
9593,Internship Program,,"New York City or Paris, France",,<p>AREA 17 is an interactive agency. We take an interdisciplinary approach — blending the practi...,"<p>We are seeking interns in our NYC or Paris office that are talented. curious, dedicated, posi...","<p><b>Tell us the following:</b></p>\r\n<ul>\r\n<li>What You Do, Work History (if any)</li>\r\n<...","<ul>\r\n<li>Full access to all staff, including partners</li>\r\n<li>Hands-on experience, not ju...",f,t,t,Other,Internship,Some College Coursework Completed,Internet,,f,f,,Paris,,,France,Paris,New York,New York,,,,0
10323,Head of Human Resources,,,,<p>Babbel enables anyone to learn languages in an easy and interactive way. The learning system ...,<p>The Head of Human Resources participates in setting strategic directives for Babbel and is re...,"<p>Demonstrable track record of success leading the HR function within a high growth, technology...",,f,t,f,,,,,,f,f,,,,,Indonesia,Berlin,New York,Berlin,,,,0
12002,Business Technology Intern,,,,"<p>Street Solutions, Inc. (SSI) develops software solutions for the secondary loan market. Our ...",<p>Internship Program</p>\r\n<p>SSI’s summer internship program takes place over a 10-week perio...,<ul>\r\n<li>\r\nAbility to establish priorities and manage projects with minimal day-to-day supe...,,f,t,t,,,,,,f,f,Jersey,,New Jersey,Jersey,,,,,,,,0
12175,Senior Developer - Contract,,,,<p>Essence is a global digital agency and the world’s largest independent buyer of digital media...,<p></p>\r\n<p><b>The Role:</b></p>\r\n<p>We are looking for a Senior Developer who can write cle...,,,f,t,f,Contract,Mid-Senior level,,Online Media,,f,f,Singapore,London,New York,New York,Singapore,London,New York,New York,,,,0
13317,"None of your openings fit me, hire me anyway!",,"New York City or Paris, France",,<p>AREA 17 is an interactive agency. We take an interdisciplinary approach — blending the practi...,<p>Interested in working for AREA 17 but don't see a job opening for your position? We are alway...,"<p><b>Tell us the following:</b></p>\r\n<ul>\r\n<li>What You Do, Work History</li>\r\n<li>Exampl...",<p><b>Full-time Employment (NYC):</b></p>\r\n<ul>\r\n<li>Generous Health and Dental Package</li>...,f,t,t,,,,,,f,f,,,,,France,Paris,New York,New York,,,,0
13456,Open Applications,,,,<p>MediaMonks is the biggest creative digital production company on the <i>planet</i>. We specia...,<p>Nothing that matches your skill set? We have awesome jobs for awesome people. Tell us about y...,,,f,t,f,,,,,,f,f,,,,,Singapore,Amsterdam,New York,New York,,,,0
14462,HR Coordinator & Office Admin,,,,<p>ServiceTitan is the world's leading CRM software for home services businesses. It powers the ...,"<p>Company Description</p>\r\n<p></p>\r\n<p>ServiceTitan is the world’s leading cloud based, cus...",<ul>\r\n<li>\r\n2+ years similar experience preferred\r\n</li>\r\n<li>\r\nExcellent people skill...,<p></p>\r\n<ul>\r\n<li>\r\nEquity in one of the fastest-growing start-ups in Los Angeles\r\n</li...,f,t,f,Full-time,Entry level,Bachelor's Degree,Information Technology and Services,Human Resources,f,f,Brazil,,California,California,Brazil,,California,California,,,,0


In [40]:
df_new.loc[(((df_new['country_des'] != "United States") & (df_new['country_des'].notnull())) & (df_new['us_state_des'].isnull())) | \
           (((df_new['country_cp'] != "United States") & (df_new['country_cp'].notnull())) & (df_new['us_state_cp'].isnull())), :]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country_des,capital_des,us_state_des,province_des,country_cp,capital_cp,us_state_cp,province_cp,country,state,city,imputed
749,"Account Director, Affiliate Network",,,50000-60000,"<p>Formed in 2006, <b>Saul&amp;Partners </b>is an executive search consulting firm specialising ...",<p><b>Company overview:</b></p>\r\n<p>The client is the leading international performance market...,,,f,t,f,Full-time,Mid-Senior level,Bachelor's Degree,Marketing and Advertising,Marketing,f,f,France,London,,,,,,,,,,0
1952,Growth Analyst Intern,,,,"<p><b>Hi, we are dopios</b></p>\r\n<p><i>“We are here to make any location <b>accessible and ope...",<p>At dopios we are rethinking the way we interact with unknown locations and our goal is to mak...,,,f,t,f,,,,,,f,f,Greece,,,,,,,,,,,0
2202,Key Account Manager - Spain,,Sales,,<p>incrediblue is busting the myth that boating is only for the rich and famous by enabling any ...,"<div>\r\n<h1>As seen on Wired &amp; TechCrunch,\r\nincrediblue is changing how people\r\nexperie...",,,f,t,t,,,,,,f,f,Spain,,,,,,,,,,,0
2542,Key Account Manager - Turkey,,Sales,,<p>incrediblue is busting the myth that boating is only for the rich and famous by enabling any ...,"<div>\r\n<h1>As seen on Wired &amp; TechCrunch,\r\nincrediblue is changing how people\r\nexperie...",,,f,t,t,,,,,,f,f,Turkey,,,,,,,,,,,0
2687,Voluntourist in Kenya,,,,,"<p><a></a> <b>Volunteer for IHF in Nakuru, Kenya</b></p>\r\n<p><b>The international Humanity Fou...","<p>Requirements:</p>\r\n<p></p>\r\n<ul>\r\n<li>Cost: $150 per week, no application fee</li>\r\n<...","<p>We are sure that it will be one of the most rewarding experiences of your life, we work hard ...",f,f,f,Part-time,,,,,f,f,Kenya,Nairobi,,,,,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16711,Web Development Intern,,,,"<p><b>Ideas2Life</b> is a startup team of people in Cyprus, who are passionate about exploring, ...","<p><b>About us</b></p>\r\n<p><b>Ideas2Life</b>&nbsp;is a startup team of people in Cyprus, who a...",<ul>\r\n<li>Current or Recent Graduate/PostGraduate Student in Computer Science related Studies<...,<p>A great opportunity to build Experience!</p>\r\n<p>This will be an unpaid internship.</p>,f,t,t,,,,,,f,f,Cyprus,,,,Cyprus,,,,,,,0
16788,Senior Web Developer,,,,"<p><b>Ideas2Life</b> is a startup team of people in Cyprus, who are passionate about exploring, ...","<p>Ideas2Life, a startup team passionate about exploring, conceptualizing and developing new ide...",,,f,t,f,,,,,,f,f,,,,,Cyprus,,,,,,,0
17092,JS Developer - Platform (w/in 6 months),,,,<p>SilverStripe CMS &amp; Framework is an open source platform of web development tools. The pla...,"<p>The guys would like someone very stong on JavaScript. Ideally, would like to get either Rob C...",,,f,t,f,,,,,,f,t,,,,,New Zealand,,,,,,,0
17187,Market development managers / business development managers,,,,<p>#URL_50cc89ecbf2d4ceda36598af3573463d57b5ad2c45a628f06cf5c12851136fdb# is the crowdfunding pl...,<p>We are searching for market development managers or partners to develop our services and offe...,"<ul>\r\n<li>You have an entrepreneurial mindset</li>\r\n<li>We have a great platform, brand and ...",<p>#URL_50cc89ecbf2d4ceda36598af3573463d57b5ad2c45a628f06cf5c12851136fdb# is a pioneer in terms ...,f,t,f,,,,,,f,t,Belgium,Brussels,,Volta,,,,,,,,0


In [41]:
# For country
df_new.loc[(df_new['country_des'] == "United States") | ((df_new['country_des'].isnull()) & (df_new['us_state_des'].notnull())) | \
           (df_new['country_cp'] == "United States") | ((df_new['country_cp'].isnull()) & (df_new['us_state_cp'].notnull())), 
           "country"] = "US"
df_new.loc[(((df_new['country_des'] != "United States") & (df_new['country_des'].notnull())) & (df_new['us_state_des'].notnull())) | \
           (((df_new['country_cp'] != "United States") & (df_new['country_cp'].notnull())) & (df_new['us_state_cp'].notnull())), 
           "country"] = "US+other"

df_new.loc[(((df_new['country_des'] != "United States") & (df_new['country_des'].notnull())) & (df_new['us_state_des'].isnull())) | \
           (((df_new['country_cp'] != "United States") & (df_new['country_cp'].notnull())) & (df_new['us_state_cp'].isnull())), 
           "country"] = "Non-US"

df_new.loc[(df_new['capital_des'] == "Hamilton") & (df_new['us_state_des'] == "Washington"), "country"] = "Non-US"


# For state
df_new.loc[((df_new['us_state_des'] == "New York") & (df_new['province_des'] == "New York")) | \
           ((df_new['us_state_cp'] == "New York") & (df_new['province_cp'] == "New York")), "state"] = "NY"
df_new.loc[((df_new['us_state_des'] == "Washington") & (df_new['province_des'] == "Washington")) | \
           ((df_new['us_state_cp'] == "Washington") & (df_new['province_cp'] == "Washington")), "state"] = "WA"
df_new.loc[((df_new['us_state_des'] == "California") & (df_new['province_des'] == "California")) | \
           ((df_new['us_state_cp'] == "California") & (df_new['province_cp'] == "California")), "state"] = "CA"
df_new.loc[((df_new['us_state_des'] == "Arizona") & (df_new['province_des'] == "Arizona")) | \
           ((df_new['us_state_cp'] == "Arizona") & (df_new['province_cp'] == "Arizona")), "state"] = "AZ"
df_new.loc[((df_new['us_state_des'] == "Idaho") & (df_new['province_des'] == "Idaho")) | \
           ((df_new['us_state_cp'] == "Idaho") & (df_new['province_cp'] == "Idaho")), "state"] = "ID"
df_new.loc[((df_new['us_state_des'] == "Oklahoma") & (df_new['province_des'] == "Oklahoma")) | \
           ((df_new['us_state_cp'] == "Oklahoma") & (df_new['province_cp'] == "Oklahoma")), "state"] = "OK"
df_new.loc[((df_new['us_state_des'] == "Nevada") & (df_new['province_des'] == "Nevada")) | \
           ((df_new['us_state_cp'] == "Nevada") & (df_new['province_cp'] == "Nevada")), "state"] = "NV"
df_new.loc[((df_new['us_state_des'] == "Utah") & (df_new['province_des'] == "Utah")) | \
           ((df_new['us_state_cp'] == "Utah") & (df_new['province_cp'] == "Utah")), "state"] = "UT"
df_new.loc[df_new['province_des'] == "Angeles", "state"] = "CA"

df_new.loc[(((df_new['country_des'] != "United States") & (df_new['country_des'].notnull())) & (df_new['us_state_des'].notnull())) | \
           (((df_new['country_cp'] != "United States") & (df_new['country_cp'].notnull())) & (df_new['us_state_cp'].notnull())), 
           "state"] = "US+other_state"

df_new.loc[(((df_new['country_des'] != "United States") & (df_new['country_des'].notnull())) & (df_new['us_state_des'].isnull())) | \
           (((df_new['country_cp'] != "United States") & (df_new['country_cp'].notnull())) & (df_new['us_state_cp'].isnull())), 
           "state"] = "Non-US_state"

df_new.loc[(df_new['capital_des'] == "Hamilton") & (df_new['us_state_des'] == "Washington"), "state"] = "Non-US_state"

df_new.loc[13832, "state"] = "US_state"


# For state/country based on capital entries (Remember London and Berlin)
df_new.loc[((df_new['country'].isnull()) | (df_new['state'].isnull())) & \
           (df_new[['country_des', 'capital_des', 'us_state_des', 'province_des', 'country_cp', 
                    'capital_cp', 'us_state_cp', 'province_cp']].notnull().any(axis=1)), "country"] = "Non-US"
df_new.loc[((df_new['country'].isnull()) | (df_new['state'].isnull())) & \
           (df_new[['country_des', 'capital_des', 'us_state_des', 'province_des', 'country_cp', 
                    'capital_cp', 'us_state_cp', 'province_cp']].notnull().any(axis=1)), "state"] = "Non-US_state"

df_new.loc[(df_new['capital_des'] == "London") | (df_new['capital_cp'] == "London"), "country"] = "GB"
df_new.loc[(df_new['capital_des'] == "London") | (df_new['capital_cp'] == "London"), "state"] = "LND"

df_new.loc[(df_new['capital_cp'] == "Berlin") | (df_new['province_cp'] == "Berlin"), "country"] = "DE"
df_new.loc[(df_new['capital_cp'] == "Berlin") | (df_new['province_cp'] == "Berlin"), "state"] = "BE"

Let's take a look first before we settle the rest.

In [42]:
df_new['country'].value_counts()

Non-US      75
US          29
GB          13
DE           8
US+other     7
Name: country, dtype: int64

In [43]:
df_new['state'].value_counts()

Non-US_state      75
NY                18
LND               13
BE                 8
US+other_state     7
WA                 3
CA                 2
NV                 1
UT                 1
OK                 1
ID                 1
AZ                 1
US_state           1
Name: state, dtype: int64

In [44]:
df_new.shape

(343, 30)

In [45]:
df_new['country'].notnull().value_counts()

False    211
True     132
Name: country, dtype: int64

In [46]:
df_new['state'].notnull().value_counts()

False    211
True     132
Name: state, dtype: int64

In [47]:
len(df_new[(df_new['country'].notnull()) & (df_new['state'].isnull())])

0

In [48]:
df_new[(df_new[['country_des', 'capital_des', 'us_state_des', 'province_des', 'country_cp', 'capital_cp', 
                'us_state_cp', 'province_cp']].isnull().all(axis=1))]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country_des,capital_des,us_state_des,province_des,country_cp,capital_cp,us_state_cp,province_cp,country,state,city,imputed
144,Forward Cap.,,,,,<p>The group has raised a fund for the purchase of homes in the Southeast. The student on this p...,,,f,f,f,,,,,,t,t,,,,,,,,,,,,0
234,Postgraduate Certificate in Social Innovation Management Kenya - March 2015,,,,<p>The Amani Institute is about developing whole individuals who have the knowledge and practica...,"<p>This unique, field-based, full-time program brings together 25 individuals from different cou...",<p>What do we look for in a program participant?</p>\r\n<p>If you meet the majority of the requi...,<p>Sign up for:</p>\r\n<ul>\r\n<li>25 classmates from around the world</li>\r\n<li>Facilitated a...,f,t,t,,,,,,f,f,,,,,,,,,,,,0
325,Head of Quality Assurance,,,,<p>Gelato Group is a SaaS company. We've developed a global print engine integrated with the pri...,<p>Following our global expansion we are seeking to add an experienced world-class head of Quali...,<ul>\r\n<li>A minimum of B.S. degree in Information Technology or Computer Science</li>\r\n<li>3...,,f,t,f,,,,,,f,f,,,,,,,,,,,,0
349,Embedded Systems / Telematics Security Consultant,,Professional Services,,"<p>Cylance is a global cybersecurity products and service company, specializeing in advanced thr...",<p><b>Summary</b></p>\r\n<ul>\r\n<li>Immediate requirement for an advanced telematics/embedded s...,<p><b>Qualifications</b></p>\r\n<ul>\r\n<li>Bachelor degree in Information Technology/Computer S...,,f,t,f,,,,,,f,f,,,,,,,,,,,,0
406,Paid Internship for Africa Program,,Africa Program,,<p><b>Applied Memetics LLC</b> is a professional services company dedicated to integrating and d...,<p>Applied Memetics LLC (AM LLC) is looking for an intern to support the operations of the compa...,"<p>- Knowledge of Africa, specifically of the Sahel region. Preference given to those candidates...",,f,t,f,Other,Internship,Bachelor's Degree,Media Production,,f,f,,,,,,,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17771,Ninestone,,,,,<p>This group will be focused on two parts: helping to increase the sites social media strategy ...,,,f,f,f,,,,,,t,t,,,,,,,,,,,,0
17809,Data Entry / Administrative Assitstant / Admin Clerk / Office Assistant / Customer Service Rep,,,,,"<p>As a Data Entry / Administrative Assitstant / Admin Clerk Associate, your duties will inclu...",,,t,f,f,Full-time,Entry level,Unspecified,Telecommunications,Administrative,t,t,,,,,,,,,,,,0
17821,Webcam Model,,,,,<p>Internet Modeling is a premier adult modeling agency recruiting and hiring webcam models for ...,"<p><b>In order to be considered for a webcam model position, you MUST:</b> - Be an attractive fe...",<p><b>We provide the following benefits to all our webcam models:</b></p>\r\n<p>- Earn from $0.8...,f,f,t,,,,,,t,t,,,,,,,,,,,,0
17822,5 Guys,,,,,<p>Analyze the excel books of the franchise and then post them online for him to use.</p>,,,f,f,f,,,,,,t,t,,,,,,,,,,,,0


In [49]:
len(df_new[((df_new['country'].isnull()) | (df_new['state'].isnull())) & \
       (df_new[['country_des', 'capital_des', 'us_state_des', 'province_des', 'country_cp', 'capital_cp', 
                'us_state_cp', 'province_cp']].notnull().any(axis=1))])

0

In [50]:
locations_df = df2['location'].str.extract(r"(?P<location>(?P<country>\w+), (?P<state>[\w+\s+]*), (?P<city>[A-Za-z\s+.,'/]*))")
locations_df

Unnamed: 0,location,country,state,city
0,"US, NY, New York",US,NY,New York
1,"NZ, , Auckland",NZ,,Auckland
2,"US, IA, Wever",US,IA,Wever
3,"US, DC, Washington",US,DC,Washington
4,"US, FL, Fort Worth",US,FL,Fort Worth
...,...,...,...,...
17875,"CA, ON, Toronto",CA,ON,Toronto
17876,"US, PA, Philadelphia",US,PA,Philadelphia
17877,"US, TX, Houston",US,TX,Houston
17878,"NG, LA, Lagos",NG,LA,Lagos


In [51]:
(locations_df == '').sum()

location       0
country        0
state       2109
city        1659
dtype: int64

In [52]:
locations_df.isnull().sum()

location    434
country     434
state       434
city        434
dtype: int64

Let's check how many location entries have been successfully extracted.

In [53]:
locations_df['location'].isnull().value_counts()

False    17211
True       434
Name: location, dtype: int64

In [54]:
locations_df[locations_df['location'].notnull()]

Unnamed: 0,location,country,state,city
0,"US, NY, New York",US,NY,New York
1,"NZ, , Auckland",NZ,,Auckland
2,"US, IA, Wever",US,IA,Wever
3,"US, DC, Washington",US,DC,Washington
4,"US, FL, Fort Worth",US,FL,Fort Worth
...,...,...,...,...
17875,"CA, ON, Toronto",CA,ON,Toronto
17876,"US, PA, Philadelphia",US,PA,Philadelphia
17877,"US, TX, Houston",US,TX,Houston
17878,"NG, LA, Lagos",NG,LA,Lagos


We'll create an indicator column to indicate if the locations are extracted or not. For now, the 434 rows that are null will have "extracted" stay as 0 value.

In [55]:
locations_df['extracted'] = 0
locations_df.loc[locations_df['location'].notnull(), 'extracted'] = 1

A 2nd extraction is necessary because the previous extraction code only account for those complete location formats, whereby the formats that only contains country will not be extracted successfully. We need a new str.extract code for this.

In [56]:
# 2nd extraction for entries with only country in location column
df2.loc[locations_df[locations_df['extracted'] == 0].index, 'location'].str.extract(r"(?P<location>(?P<country>\w{2}))")

Unnamed: 0,location,country
42,US,US
144,,
173,US,US
204,,
230,US,US
...,...,...
17809,,
17816,US,US
17821,,
17822,,


In [57]:
# Save in new df, locations_df2
locations_df2 = df2.loc[locations_df[locations_df['extracted'] == 0].index, 'location'].str.extract(r"(?P<location>(?P<country>\w{2}))")

In [58]:
locations_df2['location'].isnull().value_counts()

True     343
False     91
Name: location, dtype: int64

In [59]:
locations_df2['state'] = np.nan
locations_df2['city'] = np.nan
locations_df2.loc[locations_df2['country'].notnull(), ['state', 'city']] = ""

In [60]:
for i in locations_df2[locations_df2['location'].notnull()].index:
    locations_df.loc[i, 'location'] = locations_df2.loc[i, 'location']
    locations_df.loc[i, 'country'] = locations_df2.loc[i, 'country']
    locations_df.loc[i, 'state'] = locations_df2.loc[i, 'state']
    locations_df.loc[i, 'city'] = locations_df2.loc[i, 'city']

Let's try to fill up the 3 columns using both df that we created.

In [61]:
# For rows that have non-null locations (locations_df)
df3 = df2.copy()
df3['country'] = locations_df.loc[:, 'country']
df3['state'] = locations_df.loc[:, 'state']
df3['city'] = locations_df.loc[:, 'city']

In [62]:
# For rows that have null locations (df_new)
for i in df_new.index:
    df3.loc[i, 'country'] = df_new.loc[i, 'country']
    df3.loc[i, 'state'] = df_new.loc[i, 'state']
    df3.loc[i, 'city'] = df_new.loc[i, 'city']

In [63]:
df3.head()

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
0,Marketing Intern,"US, NY, New York",Marketing,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,f,t,f,Other,Internship,,,Marketing,f,f,US,NY,New York
1,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,f,t,f,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,f,f,NZ,,Auckland
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,f,t,f,,,,,,f,f,US,IA,Wever
3,Account Executive - Washington DC,"US, DC, Washington",Sales,,<p>Our passion for improving quality of life through geography is at the heart of everything we ...,<p><b>THE COMPANY: ESRI – Environmental Systems Research Institute</b></p>\r\n<p>Our passion for...,"<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s or Master’s in GIS, business administration, or a r...","<p>Our culture is anything but corporate—we have a collaborative, creative environment; phone di...",f,t,f,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,f,f,US,DC,Washington
4,Bill Review Manager,"US, FL, Fort Worth",,,<p>SpotSource Solutions LLC is a Global Human Capital Management Consulting firm headquartered i...,"<p><b>JOB TITLE:</b> Itemization Review Manager</p>\r\n<p><b>LOCATION:</b> Fort Worth, TX<b> ...",<p><b>QUALIFICATIONS:</b></p>\r\n<ul>\r\n<li>RN license in the State of Texas</li>\r\n<li>Diplom...,<p>Full Benefits Offered</p>,f,t,t,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,f,f,US,FL,Fort Worth


In [64]:
df3['country'].isnull().value_counts()

False    17434
True       211
Name: country, dtype: int64

In [65]:
df3['state'].isnull().value_counts()

False    17434
True       211
Name: state, dtype: int64

In [66]:
df3['city'].isnull().value_counts()

False    17302
True       343
Name: city, dtype: int64

Let us patch the NaN entries with "None" or "Undefined".

In [67]:
df3.loc[df3['country'].isnull(), "country"] = "Undefined"
df3.loc[df3['state'].isnull(), "state"] = "Undefined"
df3.loc[df3['city'].isnull(), "city"] = "Undefined"

In [68]:
len(df3[df3['country'].isnull()]), len(df3[df3['state'].isnull()]), len(df3[df3['city'].isnull()])

(0, 0, 0)

In [69]:
df3['country'].value_counts().head(15)

US           10524
GB            2349
GR             939
CA             450
DE             390
NZ             330
IN             274
AU             213
Undefined      211
PH             132
NL             126
BE             117
IE             112
SG              80
HK              77
Name: country, dtype: int64

#### ii. Required Experience

There are 6976 missing entries for this column. Just like location, we will extract or search from textual data as well but this time we will focus on description and more on requirements.

In [70]:
df3['required_experience'].value_counts()

Mid-Senior level    3774
Entry level         2645
Associate           2274
Not Applicable      1077
Director             385
Internship           374
Executive            140
Name: required_experience, dtype: int64

In [71]:
df3[df3['required_experience'].isnull()]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,f,t,f,,,,,,f,f,US,IA,Wever
5,Accounting Clerk,"US, MD,",,,,<p><b>Job Overview</b></p>\r\n<p>Apex is an environmental consulting firm that offers stable lea...,,,f,f,f,,,,,,f,f,US,MD,
7,Lead Guest Service Specialist,"US, CA, San Francisco",,,<p>Airenvy’s mission is to provide lucrative yet hassle free full service short term property ma...,<h3>Who is Airenvy?</h3>\r\n<p>Hey there! We are seasoned entrepreneurs in the heart of San Fran...,"<ul>\r\n<li>Experience with CRM software, live chat, and phones, including one year minimum of c...",<p><b>Competitive Pay.</b> You'll be able to eat steak everyday if you choose to. </p>\r\n<p><b...,f,t,t,,,,,,f,f,US,CA,San Francisco
11,Talent Sourcer (6 months fixed-term contract),"GB, LND, London",HR,,<p><b>Want to build a 21st century financial service?</b></p>\r\n<p>We're convinced that that th...,<p>TransferWise is the clever new way to move money between countries. Co-founded by Skype’s fir...,<p><b>We’re looking for someone who:</b></p>\r\n<ul>\r\n<li>Proven track record in sourcing acro...,<p>You will join one of Europe’s most hotly tipped startups with plenty of opportunities to grow...,f,t,f,,,,,,f,f,GB,LND,London
17,Southend-on-Sea Traineeships Under NAS 16-18 Year Olds Only,"GB, SOS, Southend-on-Sea",,,<p>Established on the principles that full time education is not for everyone Spectrum Learning ...,<p>Government funding is only available for 16-18 year olds.</p>\r\n<p>We have 10 vacancies for ...,<p>16-18 year olds only due to government funding.</p>\r\n<p>Career prospects</p>,<p>Career prospects.</p>,f,t,t,,,,,,f,f,GB,SOS,Southend
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17869,Sr Technical Lead LIMS,"US, DE, Wilmington",,,,<p><b>Job Title: Sr Technical Lead</b></p>\r\n<p><b>Salary: Open</b></p>\r\n<p><b>Duration: Ful...,<p>Responsibilities:</p>\r\n<p> </p>\r\n<ul>\r\n<li>He should be extensive knowledge of Sample M...,,f,f,f,Full-time,,,Pharmaceuticals,,f,f,US,DE,Wilmington
17871,Water Truck Driver,"US, PA, Waynesburg",,,<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,<ul>\r\n<li>Requires skilled work in operating commercial trucks to load and unload fluids from ...,<ul>\r\n<li>GED or diploma required.</li>\r\n<li>Requires minimum of one year experience with ta...,,f,t,t,Full-time,,,Oil & Energy,,f,f,US,PA,Waynesburg
17872,Product Manager,"US, CA, San Francisco",Product Development,,<p>Flite delivers ad innovation at scale to the world's top publishers and brands. Marketers use...,<p>Flite's SaaS display ad platform fuels the world's top publishers and brands by reducing the ...,<ul>\r\n<li>\r\nBA/BS in Computer Science or a related technical field\r\n</li>\r\n<li>\r\nAt le...,<ul>\r\n<li>Competitive base</li>\r\n<li>Attractive stock option plan</li>\r\n<li>Medical/Dental...,f,t,f,Full-time,,,Internet,Product Management,f,f,US,CA,San Francisco
17873,Recruiting Coordinator,"US, NC, Charlotte",,,,<p><b>RESPONSIBILITIES:</b></p>\r\n<ul>\r\n<li>Will facilitate the recruiting and hiring process...,<p><b>REQUIRED SKILLS:</b></p>\r\n<ul>\r\n<li>Associates Degree or a combination of education pl...,,f,t,f,Contract,,,Utilities,,f,f,US,NC,Charlotte


In [72]:
df3[df3['required_experience'].notnull()]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
0,Marketing Intern,"US, NY, New York",Marketing,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,f,t,f,Other,Internship,,,Marketing,f,f,US,NY,New York
1,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,f,t,f,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,f,f,NZ,,Auckland
3,Account Executive - Washington DC,"US, DC, Washington",Sales,,<p>Our passion for improving quality of life through geography is at the heart of everything we ...,<p><b>THE COMPANY: ESRI – Environmental Systems Research Institute</b></p>\r\n<p>Our passion for...,"<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s or Master’s in GIS, business administration, or a r...","<p>Our culture is anything but corporate—we have a collaborative, creative environment; phone di...",f,t,f,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,f,f,US,DC,Washington
4,Bill Review Manager,"US, FL, Fort Worth",,,<p>SpotSource Solutions LLC is a Global Human Capital Management Consulting firm headquartered i...,"<p><b>JOB TITLE:</b> Itemization Review Manager</p>\r\n<p><b>LOCATION:</b> Fort Worth, TX<b> ...",<p><b>QUALIFICATIONS:</b></p>\r\n<ul>\r\n<li>RN license in the State of Texas</li>\r\n<li>Diplom...,<p>Full Benefits Offered</p>,f,t,t,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,f,f,US,FL,Fort Worth
6,Head of Content (m/f),"DE, BE, Berlin",ANDROIDPIT,20000-28000,"<p>Founded in 2009, the <b>Fonpit AG</b> rose with its international web portal <b>ANDROIDPIT</b...",<p><b>Your Responsibilities:</b></p>\r\n<p> </p>\r\n<ul>\r\n<li>Manage the English-speaking edit...,<p><b>Your Know-How:</b></p>\r\n<p><b> ...,<p><b>Your Benefits:</b></p>\r\n<p> </p>\r\n<ul>\r\n<li>Being part of a fast-growing company in ...,f,t,t,Full-time,Mid-Senior level,Master's Degree,Online Media,Management,f,f,DE,BE,Berlin
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17874,JavaScript Developer,"US, ,",,80000-100000,,"<p>Sr, JavaScript Developer<br> Experience : 4-10 years<br> Location : New York</p>\r\n<p>Experi...",,,f,f,f,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Information Technology,f,f,US,,
17875,Account Director - Distribution,"CA, ON, Toronto",Sales,,<p>Vend is looking for some awesome new talent to come join us. You'll be working in an awesome ...,<p>Just in case this is the first time you’ve visited our website Vend is an award winning web b...,<p>To ace this role you:</p>\r\n<ul>\r\n<li>Will eat comprehensive Statements of Work for breakf...,<p><b>What can you expect from us?</b></p>\r\n<p>We have an open culture where we openly share o...,f,t,t,Full-time,Mid-Senior level,,Computer Software,Sales,f,f,CA,ON,Toronto
17876,Payroll Accountant,"US, PA, Philadelphia",Accounting,,<p>WebLinc is the e-commerce platform and services provider for the fastest growing online retai...,<p></p>\r\n<p>The Payroll Accountant will focus primarily on payroll functions for approximately...,<p></p>\r\n<p>- B.A. or B.S. in Accounting</p>\r\n<p>- <b>Desire to have fun while doing what yo...,<p></p>\r\n<h3>Health &amp; Wellness</h3>\r\n<ul>\r\n<li>Medical plan</li>\r\n<li>Prescription d...,f,t,t,Full-time,Mid-Senior level,Bachelor's Degree,Internet,Accounting/Auditing,f,f,US,PA,Philadelphia
17878,Graphic Designer,"NG, LA, Lagos",,,,<p>Nemsia Studios is looking for an experienced visual/graphic designer to join our Lagos office...,"<p>1. Must be fluent in the latest versions of Corel &amp; Adobe CC (Esp. Photoshop, Illustrator...",<p>Competitive salary (compensation will be based on experience) <br>Casual attire <br>At Nemsia...,f,f,t,Contract,Not Applicable,Professional,Graphic Design,Design,f,f,NG,LA,Lagos


Like usual, we will do a test run on our codes to see how far we can detect the signal from the text data column.

In [73]:
df3.loc[df3['required_experience'].isnull(), 'requirements'].str.contains("experience")

2        False
5          NaN
7         True
11        True
17       False
         ...  
17869    False
17871     True
17872     True
17873     True
17877     True
Name: requirements, Length: 6976, dtype: object

In [74]:
df3.loc[df3['required_experience'].isnull(), 'requirements'].str.contains("experience").value_counts()

True     3688
False    1421
Name: requirements, dtype: int64

In [75]:
list(range(20))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [76]:
for i in range(21):
    print(i)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20


In [77]:
for i in range(11, 16):
    print(i)

11
12
13
14
15


In [78]:
for j in [1, 3, 5]:
    print(j)

1
3
5


In [79]:
a = 1
if a == 1:
    b = 2
print(b)

2


In [80]:
"hello" + "" + "" + "pop"

'hellopop'

In [81]:
df3['required_experience'].isnull().value_counts()

False    10669
True      6976
Name: required_experience, dtype: int64

Before we use functions and loops to impute experience, we need to manually impute some entries just to make things simpler.

In [82]:
df4 = df3.copy()

# Imputing Director
df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains("Director")), 'required_experience'] = "Director"

# Imputing Executive
df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains("VP")), 'required_experience'] = "Executive"
df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains("Vice President")), 
        'required_experience'] = "Executive"
df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains("Chief")) & \
        (~df4['title'].str.contains("Assistant")), 'required_experience'] = "Executive"

# Imputing Internship
df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains("Intern")), 'required_experience'] = "Internship"
#df3.loc[(df3['required_experience'].isnull()) & (df3['description'].str.contains("Intern")), 
#        'required_experience'] = "Internship"

# One 20 years entry
df4.loc[6386, "required_experience"] = "Mid-Senior level"

In [83]:
df4['required_experience'].isnull().value_counts()

False    11032
True      6613
Name: required_experience, dtype: int64

6976 - 6613 = 363 imputed

We'll create a function that imputes experience based on the identified strings from the required_experience column.

In [84]:
def impute_exp(data, i, col):
    df = data.copy()
    if (i == 1) | (i == 8):
        if i == 1:
            str1 = "year"
            str2 = "yr"
            str3 = "21 "
        elif i == 8:
            str1 = "years"
            str2 = "yrs"
            str3 = "18 "
            
        s1 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + " " + str1 + " experience")) & \
                    (~df['requirements'].str.contains(str3, na=False)), "required_experience"].value_counts()
        s2 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + " " + str1 + " of experience")) & \
                    (~df['requirements'].str.contains(str3, na=False)), "required_experience"].value_counts()
        s3 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(word(i).lower() + " " + str1 + " experience")) & \
                    (~df['requirements'].str.contains(str3, na=False)), "required_experience"].value_counts()
        s4 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(word(i).lower() + " " + str1 + " of experience")) & \
                    (~df['requirements'].str.contains(str3, na=False)), "required_experience"].value_counts()
        s5 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(word(i).lower() + " " + str1, regex=False)) & \
                    (~df['requirements'].str.contains(str3, na=False)), "required_experience"].value_counts()
        s6 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + " " + str1, regex=False)) & \
                    (~df['requirements'].str.contains(str3, na=False)), "required_experience"].value_counts()
        s7 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + "+ " + str1, regex=False)) & \
                    (~df['requirements'].str.contains(str3, na=False)), "required_experience"].value_counts()
        s8 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + " " + str2, regex=False)) & \
                    (~df['requirements'].str.contains(str3, na=False)), "required_experience"].value_counts()
        s9 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + "+ " + str2, regex=False)) & \
                    (~df['requirements'].str.contains(str3, na=False)), "required_experience"].value_counts()

        count_series = s1.add(s2, fill_value=0).add(s3, fill_value=0).add(s4, fill_value=0).add(s5, fill_value=0)\
                        .add(s6, fill_value=0).add(s7, fill_value=0).add(s8, fill_value=0).add(s9, fill_value=0)
        if len(count_series) > 0:  
            print(count_series.idxmax())

            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + " " + str1 + " experience")) & \
                   (~df[col].str.contains(str3, na=False)), "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + " " + str1 + " of experience")) & \
                   (~df[col].str.contains(str3, na=False)), "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(word(i).lower() + " " + str1 + " experience")) & \
                   (~df[col].str.contains(str3, na=False)), "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(word(i).lower() + " " + str1 + " of experience")) & \
                   (~df[col].str.contains(str3, na=False)), "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(word(i).lower() + " " + str1, regex=False)) & \
                   (~df[col].str.contains(str3, na=False)), "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + " " + str1, regex=False)) & \
                   (~df[col].str.contains(str3, na=False)), "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + "+ " + str1, regex=False)) & \
                   (~df[col].str.contains(str3, na=False)), "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + " " + str2, regex=False)) & \
                   (~df[col].str.contains(str3, na=False)), "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + "+ " + str2, regex=False)) & \
                   (~df[col].str.contains(str3, na=False)), "required_experience"] = count_series.idxmax()
        else:
            print("Series is empty so there's no imputation")
            
    else:
        str1 = "years"
        str2 = "yrs"
        
        s1 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + " " + str1 + " experience")), 
                    "required_experience"].value_counts()
        s2 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + " " + str1 + " of experience")), 
                    "required_experience"].value_counts()
        s3 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(word(i).lower() + " " + str1 + " experience")), 
                    "required_experience"].value_counts()
        s4 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(word(i).lower() + " " + str1 + " of experience")), 
                    "required_experience"].value_counts()
        s5 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(word(i).lower() + " " + str1, regex=False)), 
                    "required_experience"].value_counts()
        s6 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + " " + str1, regex=False)), 
                    "required_experience"].value_counts()
        s7 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + "+ " + str1, regex=False)), 
                    "required_experience"].value_counts()
        s8 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + " " + str2, regex=False)), 
                    "required_experience"].value_counts()
        s9 = df.loc[(df['required_experience'].notnull()) & (df['requirements'].str.contains(str(i) + "+ " + str2, regex=False)), 
                    "required_experience"].value_counts()

        count_series = s1.add(s2, fill_value=0).add(s3, fill_value=0).add(s4, fill_value=0).add(s5, fill_value=0)\
                        .add(s6, fill_value=0).add(s7, fill_value=0).add(s8, fill_value=0).add(s9, fill_value=0)
        if len(count_series) > 0:
            print(count_series.idxmax())

            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + " " + str1 + " experience")), 
                   "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + " " + str1 + " of experience")), 
                   "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(word(i).lower() + " " + str1 + " experience")), 
                   "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(word(i).lower() + " " + str1 + " of experience")), 
                   "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(word(i).lower() + " " + str1, regex=False)), 
                   "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + " " + str1, regex=False)), 
                   "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + "+ " + str1, regex=False)), 
                   "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + " " + str2, regex=False)), 
                   "required_experience"] = count_series.idxmax()
            df.loc[(df['required_experience'].isnull()) & (df[col].str.contains(str(i) + "+ " + str2, regex=False)), 
                   "required_experience"] = count_series.idxmax()
        else:
            print("Series is empty so there's no imputation")
        
    return df

Besides, a self-defined looping function is also created to aid the imputation function.

In [85]:
def loop_exp(data, a, b, col):
    df = data.copy()
    for i in range(a, b):
        print(i)
        if i == 0:
            df.loc[(df['required_experience'].isnull()) & (df['description'].str.contains("no experience")), 
                    "required_experience"] = "Entry level"
        else:
            df = impute_exp(df, i, col)
        print("Loop number " + str(i) + " done")
    
    print("DONE!")
    return df

In [86]:
df4 = loop_exp(df4, 11, 16, "requirements")
df4 = loop_exp(df4, 0, 11, "requirements")

11
Mid-Senior level
Loop number 11 done
12
Mid-Senior level
Loop number 12 done
13
Mid-Senior level
Loop number 13 done
14
Mid-Senior level
Loop number 14 done
15
Mid-Senior level
Loop number 15 done
DONE!
0
Loop number 0 done
1
Entry level
Loop number 1 done
2
Associate
Loop number 2 done
3
Mid-Senior level
Loop number 3 done
4
Mid-Senior level
Loop number 4 done
5
Mid-Senior level
Loop number 5 done
6
Mid-Senior level
Loop number 6 done
7
Mid-Senior level
Loop number 7 done
8
Mid-Senior level
Loop number 8 done
9
Mid-Senior level
Loop number 9 done
10
Mid-Senior level
Loop number 10 done
DONE!


In [87]:
df4['required_experience'].isnull().value_counts()

False    12713
True      4932
Name: required_experience, dtype: int64

6976 - 5739 = 1237, new one is 6976 - 4932 = 2044 (807 more, not bad!)

In [88]:
df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains("Manager", na=False)), 
        "required_experience"] = "Mid-Senior level"
df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains("18 ", na=False)), 
        "required_experience"] = "Entry level"

df4.loc[(df4['required_experience'].isnull()) & (df4['requirements'].str.contains("No minimum experience", na=False)), 
        "required_experience"] = "Entry level"
df4.loc[(df4['required_experience'].isnull()) & (df4['requirements'].str.contains("No experience", na=False)), 
        "required_experience"] = "Entry level"
df4.loc[(df4['required_experience'].isnull()) & (df4['requirements'].str.contains("No any experience", na=False)), 
        "required_experience"] = "Entry level"

In [89]:
df4['required_experience'].isnull().value_counts()

False    13305
True      4340
Name: required_experience, dtype: int64

6976 - 4407 = 2569 (update, not 4407 anymore!!)

In [90]:
df4['required_experience'].value_counts()

Mid-Senior level    5399
Entry level         2894
Associate           2674
Not Applicable      1077
Internship           560
Director             533
Executive            168
Name: required_experience, dtype: int64

A loop is created that forms a series that stores counts of required_experience entries based on certain strings in requirements column. Based on the counts, the mode is extracted for imputation.

In [91]:
for word in ["work experience", "some experience", "experience with", "experience working", "proven experience", 
             "Some experience", "working experience", "experienced in", "experience in", "Experience in", 
             "Experienced in", "Experience working", "Experience with", "Proven experience", "Work experience"]:
    count_series = df4.loc[(df4['required_experience'].notnull()) & (df4['requirements'].str.contains(word)), 
                              "required_experience"].value_counts()
    if len(count_series) > 0:
        df4.loc[(df4['required_experience'].isnull()) & (df4['requirements'].str.contains(word)), 
                   "required_experience"] = count_series.idxmax()
        df4.loc[(df4['required_experience'].isnull()) & (df4['description'].str.contains(word)), 
                   "required_experience"] = count_series.idxmax()
    else:
        print("Series is empty so there's no imputation")

        
for word in ["English Teacher", "Data Entry", "Senior", "Cruise Staff", "Part Time"]:
    count_series = df4.loc[(df4['required_experience'].notnull()) & (df4['title'].str.contains(word)), 
                              "required_experience"].value_counts()
    if len(count_series) > 0:
        df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains(word)), 
                   "required_experience"] = count_series.idxmax()
    else:
        print("Series is empty so there's no imputation")

In [92]:
from num2word import word
df4 = loop_exp(df4, 11, 16, "description")
df4 = loop_exp(df4, 1, 11, "description")

11
Mid-Senior level
Loop number 11 done
12
Mid-Senior level
Loop number 12 done
13
Mid-Senior level
Loop number 13 done
14
Mid-Senior level
Loop number 14 done
15
Mid-Senior level
Loop number 15 done
DONE!
1
Entry level
Loop number 1 done
2
Associate
Loop number 2 done
3
Mid-Senior level
Loop number 3 done
4
Mid-Senior level
Loop number 4 done
5
Mid-Senior level
Loop number 5 done
6
Mid-Senior level
Loop number 6 done
7
Mid-Senior level
Loop number 7 done
8
Mid-Senior level
Loop number 8 done
9
Mid-Senior level
Loop number 9 done
10
Mid-Senior level
Loop number 10 done
DONE!


In [93]:
df4['required_experience'].isnull().value_counts()

False    15704
True      1941
Name: required_experience, dtype: int64

3365 becomes 2796, then with "description", becomes 2215 then 1941!

In [94]:
df4['required_experience'].value_counts()

Mid-Senior level    7103
Entry level         3526
Associate           2737
Not Applicable      1077
Internship           560
Director             533
Executive            168
Name: required_experience, dtype: int64

In [95]:
df4.loc[df4['required_experience'].isnull(), 'title'].value_counts()

Beauty & Fragrance consultants needed                   55
Electrical Maintenance Technician                       16
Software Engineer                                       15
Mechanical Engineer                                     14
Process Engineer                                        11
                                                        ..
Web/Mobile Front End developer                           1
I want to work at Vend in WELLINGTON                     1
OUD: Business Controller 3.                              1
Brand Partner                                            1
Library Page - North Regional Library, Holly Springs     1
Name: title, Length: 1470, dtype: int64

In [96]:
df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains("Beauty & Fragrance", na=False)), 
        "required_experience"] = "Associate"

In [97]:
df4['required_experience'].isnull().value_counts()

False    15761
True      1884
Name: required_experience, dtype: int64

Now the task is to impute all fraudulent entries, then lastly we settle this column by using str.contains for title for imputation

In [98]:
df4.loc[df4['fraudulent'] == "t", 'required_experience'].isnull().value_counts()

False    718
True     140
Name: required_experience, dtype: int64

In [99]:
# Manual mid senior conditions
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & \
        (df4['requirements'].str.contains("5 y e a r' s")), "required_experience"] = "Mid-Senior level"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['requirements'].str.contains("qualified")), 
        "required_experience"] = "Mid-Senior level"

# experience
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['requirements'].str.contains("experience")) & \
        (~df4['requirements'].str.contains("No any", na=False)) & (~df4['requirements'].str.contains("No experience", na=False)), 
        "required_experience"] = "Associate"

# entry level
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['requirements'].str.contains("entry level")), 
        "required_experience"] = "Entry level"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['benefits'].str.contains("entry level")), 
        "required_experience"] = "Entry level"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['requirements'].str.contains("High school")), 
        "required_experience"] = "Entry level"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['requirements'].str.contains("High School")), 
        "required_experience"] = "Entry level"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['requirements'].str.contains("No Experience")), 
        "required_experience"] = "Entry level"

# Not Applicable
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Part-Time")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("PART-TIME")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Easy")), 
        "required_experience"] = "Not Applicable"

# with Assistant
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Assistant Accountant")), 
        "required_experience"] = "Associate"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Assistant")), 
        "required_experience"] = "Entry level"

In [100]:
df4['required_experience'].isnull().value_counts()

False    15808
True      1837
Name: required_experience, dtype: int64

In [101]:
df4.loc[df4['fraudulent'] == "t", 'required_experience'].isnull().value_counts()

False    765
True      93
Name: required_experience, dtype: int64

In [102]:
df4.loc[df4['fraudulent'] == "t", 'required_experience'].value_counts()

Entry level         309
Mid-Senior level    272
Associate            70
Not Applicable       65
Director             29
Executive            10
Internship           10
Name: required_experience, dtype: int64

In [103]:
len(df4.loc[(df4['fraudulent'] == "t") & (df4['required_experience'].isnull()) & (df4['description'].str.contains("experience")), :])

15

Further imputation but mostly based on description and some are title

In [104]:
df4.loc[(df4['fraudulent'] == "t") & (df4['required_experience'].isnull()) & \
        (df4['description'].str.contains("no previous experience")), "required_experience"] = "Entry level"

df4.loc[(df4['fraudulent'] == "t") & (df4['required_experience'].isnull()) & (df4['description'].str.contains("Earn")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['fraudulent'] == "t") & (df4['required_experience'].isnull()) & (df4['description'].str.contains("Cash")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['fraudulent'] == "t") & (df4['required_experience'].isnull()) & (df4['description'].str.contains("Work From Home")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['fraudulent'] == "t") & (df4['required_experience'].isnull()) & (df4['title'].str.contains("Webcam Model")), 
        "required_experience"] = "Not Applicable"

df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("intern")), 
        "required_experience"] = "Internship"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Junior")), 
        "required_experience"] = "Entry level"

df4.loc[(df4['fraudulent'] == "t") & (df4['required_experience'].isnull()) & (df4['description'].str.contains("experience")), 
        "required_experience"] = "Associate"

In [105]:
df4['required_experience'].isnull().value_counts()

False    15839
True      1806
Name: required_experience, dtype: int64

In [106]:
df4.loc[df4['fraudulent'] == "t", 'required_experience'].isnull().value_counts()

False    796
True      62
Name: required_experience, dtype: int64

In [107]:
df4.loc[df4['fraudulent'] == "t", 'required_experience'].value_counts()

Entry level         311
Mid-Senior level    272
Not Applicable       84
Associate            79
Director             29
Internship           11
Executive            10
Name: required_experience, dtype: int64

In [108]:
len(df4.loc[(df4['title'].str.contains("customer")) & (df4['fraudulent'] == "t") & (df4['required_experience'].isnull()), :])

2

In [109]:
df4.loc[(df4['title'].str.contains("Customer")), 'required_experience'].value_counts()

Entry level         554
Associate            93
Mid-Senior level     92
Not Applicable       76
Internship           21
Director             10
Executive             1
Name: required_experience, dtype: int64

In [110]:
df4.loc[(df4['title'].str.contains("Sales")), 'required_experience'].value_counts()

Mid-Senior level    382
Associate           312
Entry level         279
Not Applicable       95
Director             73
Executive            23
Internship           20
Name: required_experience, dtype: int64

In [111]:
df4.loc[(df4['title'].str.contains("Admin")), 'required_experience'].value_counts()

Entry level         151
Mid-Senior level    135
Associate           108
Not Applicable       99
Internship           12
Director              2
Executive             2
Name: required_experience, dtype: int64

Unfortunately, we need to perform manual imputation for some of the entries below because it is too complicated to use codes to impute them.

In [112]:
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Customer")), 
        "required_experience"] = "Entry level"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("customer")), 
        "required_experience"] = "Entry level"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("CSR")), 
        "required_experience"] = "Entry level"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Admin")) & \
        (~df4['title'].str.contains("Linux")) & (~df4['title'].str.contains("System")), "required_experience"] = "Entry level"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Representative")), 
        "required_experience"] = "Entry level"

df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("work from home")), 
        "required_experience"] = "Not Applicable"

# Manual imputation for those weird titles

df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Nurse")), 
        "required_experience"] = "Not Applicable"

df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("5 Guys")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Forward Cap.")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Boys & Girls Club")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Brand & Logo Design Contest")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Fidelity")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Success is Knocking...")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("KMC ")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Ninestone")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Use Your Spare Time to Start Earning More")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Northwestern Hospital")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Brand Partner")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Real Services")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("RNFA - St. Joseph Medical Center ")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("RN - Operating Room")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("OR Specialty Coordinator")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Network Marketing")), 
        "required_experience"] = "Not Applicable"
df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Vacancies At The Cafe Royal Hotel London")), 
        "required_experience"] = "Not Applicable"

df4.loc[(df4['required_experience'].isnull()) & (df4['fraudulent'] == "t") & (df4['title'].str.contains("Furniture mover")), 
        "required_experience"] = "Entry level"

Now, the remaining ones we'll assume that their required experience is Associate (around 2-3 years)

In [113]:
df4.loc[(df4['fraudulent'] == "t") & (df4['required_experience'].isnull()), 'required_experience'] = "Associate"

In [114]:
df4['required_experience'].isnull().value_counts()

False    15901
True      1744
Name: required_experience, dtype: int64

In [115]:
df4.loc[df4['fraudulent'] == "t", 'required_experience'].isnull().value_counts()

False    858
Name: required_experience, dtype: int64

In [116]:
df4.loc[df4['fraudulent'] == "t", 'required_experience'].value_counts()

Entry level         327
Mid-Senior level    272
Not Applicable      107
Associate           102
Director             29
Internship           11
Executive            10
Name: required_experience, dtype: int64

We arrive at the last task (hopefully...), that is to use str.contains and random.choices method to randomly impute based on a specified probability.

In [117]:
df4.loc[(df4['required_experience'].isnull()), 'title'].value_counts().to_dict()

{'Electrical Maintenance Technician': 16,
 'Software Engineer': 15,
 'Mechanical Engineer': 14,
 'Process Engineer': 11,
 'Sr. Design Engineer Mechanical - 3D CAD': 11,
 'IT Security Analyst': 11,
 'Local Representative': 11,
 'Buyer': 10,
 'Food Quality': 10,
 'Electrical Maintenance Technician - Major States': 10,
 'Luxury fragrance consultants needed for Xmas!': 9,
 'Sales Rep for AT&T Solutions Provider - Management Training': 9,
 'Sales Representative with Management Training - AT&T': 9,
 'Data Scientist': 7,
 'Home Improvement Marketing': 7,
 'Web Designer': 6,
 'Marketing and Sales Representative- Full Time Position': 6,
 'Teaching English': 6,
 'Web Developer': 6,
 'Production Supervisor': 6,
 'Client Service Professional': 6,
 'Promotions / Marketing Assistant': 6,
 'Sales Lead Generator': 6,
 'Customer Service Positions ($18-$22 an hour)': 5,
 'EXPERIENCED CAREGIVERS NEEDED TODAY!THE BEST PAY & AWESOME BENEFITS!!': 5,
 'General Application': 5,
 'Office Assistant': 5,
 'Graph

In [118]:
import random

keywords = ["SAP", "Sr", "Junior", "Jr", ["Developer", "developer", "DEVELOPER"], ["Customer", "CUSTOMER"], ["Assistant", "assistant"], "Entry Level", 
            "Associate", "intern", ["Technician", "TECHNICIAN"], ["Analyst", "ANALYST"], ["Representative", "REPRESENTATIVE"], "Lead", "Oil", "Cleaner", "Janitor",
            "Clerk", ["Design", "DESIGNER", "design"], ["Sales", "sales"], "Editor", ["Specialist", "specialist", "SPECIALIST"], "Reception", "PHP", 
            "Teacher", "Retail", "manager", "Trainee", "Fresher", ["Caregiver", "CAREGIVER"], "Care", "Home Health", "Artist", "Consultant", "Driver", "DRIVER", "driver", 
            "Executive", "Operative", "fragrance", "beauty", "Head", "Software", "Mechanical Engineer", "Dish Washer", "Dishwasher",
            ["Process Engineer", "PROCESS ENGINEER"], ["Data Scientist", "Data scientist"], "iOS Engineer", "Android Engineer", "Java", "Applications", "Translator", 
            ["Data Analyst", "DATA ANALYST", "data analyst", "Data Analysis"], "Data Engineer", "BI", "Coordinator", "Leader", ["Operator", "operator"], 
            "Real Estate", "Insurance", "Oracle", "Maintenance", "Accountant", "Web", ["Machine Learning", "Machine learning", "Deep learning"], "Decision", "Supervisor", 
            "Automation", "Testing", "Android", "HTML", ["Optician", "Opto"], "Broadcast", "Civil Engineer", ["Installer", "INSTALLER"], 
            ["Writer", "writer"], "Database", "Cook", "Advisor", "Bartender", ["Producer", "PRODUCER"], "Support", ["System", "SYSTEM"], "Controller", 
            ["Front End", "FrontEnd", "Frontend", "Front-End", "Front end"], "Controls", "IT", "Marketing", "Cloud", ["Client", "CLIENT"], "Setter", 
            "Backend", ["Data Entry", "Data entry"], "SEO", ["QA", "Quality Assurance"], "Call Center", "Social Media", "Abroad", 
            "Electronic", "Fragrance", "DBA", "Coach", "Content", "Event", ["Solution", "SOLUTION"], "Nurse", ["Warehouse", "warehouse"], "Mechanic", 
            "Photographer", ["Cashier", "cashier"], "Recruit", "Attorney", ["Teach", "teach"], "Linux", "Business Development", 
            "Food", "Inventory", "UX", "Programmer", "Chef", "Administrator", "Manufacturing", "Architect", "Buyer", "Freelance", 
            "Crew", "Part-Time", "Carpenter", "videographer", "Integration", "Housekeeper", "Therapist", "Trainer", "Instructor", 
            "Resevoir Engineer", "Dermatologist", "Material Handler", "GV", "Book", "Seamstress", "Technologist", "Help Desk", "Counselor", 
            "Implementation", "Compliance", "Regnskabsassistent", "RF Engineer", ["Reporter", "REPORTER"], "Ambas", "Full Stack", 
            "Auditor", "Pathologist", "Operations", "Production Engineer", "Stylist", "Negotiator", "House Keeping", "counter staff", 
            "Implementer", "Typist", "SAS Admin", "Financ", "Ruby", "Server", "Telemarket", "Product Owner", "Plumber", "Fundraising", 
            "Host", "Relations", "Brand", "lead", "Merchandis", "Models", "Attendant", "Office", "Sharepoint", "Marketeer", 
            "Site Reliability", "Secretary", "Department store", "HOSTING ENGINEER"]

for keyword in keywords:
    print(keyword)
    # Entry level
    if keyword in ["Junior", "Jr", "Fresher", "Clerk", "Trainee", "Caregiver", "Care", "Driver", "DRIVER", "driver", 
                   "Cleaner", "Janitor", "Dish Washer", "Dishwasher", "Entry Level", "Abroad", "Crew", "Part-Time", 
                   "Housekeeper", "House Keeping", "Material Handler", "Seamstress", "GV", "Ambas", "counter staff",
                   "Typist", "Plumber", "Fundraising", "Models", "Attendant", "Marketeer", "Department store"]:
        df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains(keyword)), 
                "required_experience"] = "Entry level"
    
    # Internship
    elif keyword in ["intern"]:
        df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains(keyword)), 
                "required_experience"] = "Internship"
    
    # Associate
    elif keyword in ["Associate", "Fragrance", "videographer", "Dermatologist", "Book", "Help Desk", "Counselor", 
                     "Regnskabsassistent", "RF Engineer", "Stylist", "Implementer", "Relations", "HOSTING ENGINEER"]:
        df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains(keyword)), 
                "required_experience"] = "Associate"
    
    # Mid-Senior level
    elif keyword in ["Sr", "manager", "Lead", "lead", "LEAD", "Head", "Resevoir Engineer", "Network engineer", "Negotiator", 
                     "SAS Admin", "Site Reliability", "Secretary"]:
        df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains(keyword)), 
                "required_experience"] = "Mid-Senior level"
        
    else:
        if type(keyword) == list:
            items = []
            weights = []
            for i, key in enumerate(keyword):
                print(i, key)
                if i == 0:
                    items = list(df4.loc[(df4['required_experience'].notnull()) & (df4['title'].str.contains(key)), 
                                         "required_experience"].value_counts()[0:2].index)
                    total = df4.loc[(df4['required_experience'].notnull()) & (df4['title'].str.contains(key)), 
                                    "required_experience"].value_counts()[0:2].values.sum()
                    weights = list(df4.loc[(df4['required_experience'].notnull()) & (df4['title'].str.contains(key)), 
                                           "required_experience"].value_counts()[0:2].values/total*100)
        
                    # Get the values in random fashion
                    k = len(df4[(df4['required_experience'].isnull()) & (df4['title'].str.contains(key))])
                    rand_exps = random.choices(items, weights=weights, k=k)
                    df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains(key)), 
                            "required_experience"] = rand_exps
                else:
                    # Get the values in random fashion
                    k = len(df4[(df4['required_experience'].isnull()) & (df4['title'].str.contains(key))])
                    rand_exps = random.choices(items, weights=weights, k=k)
                    df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains(key)), 
                            "required_experience"] = rand_exps
                    
        else:
            items = list(df4.loc[(df4['required_experience'].notnull()) & (df4['title'].str.contains(keyword)), 
                                 "required_experience"].value_counts()[0:2].index)
            total = df4.loc[(df4['required_experience'].notnull()) & (df4['title'].str.contains(keyword)), 
                            "required_experience"].value_counts()[0:2].values.sum()
            weights = list(df4.loc[(df4['required_experience'].notnull()) & (df4['title'].str.contains(keyword)), 
                                   "required_experience"].value_counts()[0:2].values/total*100)
        
            # Get the values in random fashion
            k = len(df4[(df4['required_experience'].isnull()) & (df4['title'].str.contains(keyword))])
            rand_exps = random.choices(items, weights=weights, k=k)
            df4.loc[(df4['required_experience'].isnull()) & (df4['title'].str.contains(keyword)), 
                    "required_experience"] = rand_exps

SAP
Sr
Junior
Jr
['Developer', 'developer', 'DEVELOPER']
0 Developer
1 developer
2 DEVELOPER
['Customer', 'CUSTOMER']
0 Customer
1 CUSTOMER
['Assistant', 'assistant']
0 Assistant
1 assistant
Entry Level
Associate
intern
['Technician', 'TECHNICIAN']
0 Technician
1 TECHNICIAN
['Analyst', 'ANALYST']
0 Analyst
1 ANALYST
['Representative', 'REPRESENTATIVE']
0 Representative
1 REPRESENTATIVE
Lead
Oil
Cleaner
Janitor
Clerk
['Design', 'DESIGNER', 'design']
0 Design
1 DESIGNER
2 design
['Sales', 'sales']
0 Sales
1 sales
Editor
['Specialist', 'specialist', 'SPECIALIST']
0 Specialist
1 specialist
2 SPECIALIST
Reception
PHP
Teacher
Retail
manager
Trainee
Fresher
['Caregiver', 'CAREGIVER']
0 Caregiver
1 CAREGIVER
Care
Home Health
Artist
Consultant
Driver
DRIVER
driver
Executive
Operative
fragrance
beauty
Head
Software
Mechanical Engineer
Dish Washer
Dishwasher
['Process Engineer', 'PROCESS ENGINEER']
0 Process Engineer
1 PROCESS ENGINEER
['Data Scientist', 'Data scientist']
0 Data Scientist
1 Data 

In [119]:
df4['required_experience'].isnull().value_counts()

False    17458
True       187
Name: required_experience, dtype: int64

First attempt: remaining null is 612, then becomes 546, and then 465 to 424. Prev: 324, latest 268 then 209 then 187

In [120]:
df4['required_experience'].value_counts()

Mid-Senior level    7863
Entry level         4004
Associate           3172
Not Applicable      1146
Internship           570
Director             534
Executive            169
Name: required_experience, dtype: int64

In [121]:
df4.loc[(df4['required_experience'].isnull()), 'title'].value_counts().to_dict()

{'General Application': 5,
 'VIRTUAL ASSISTANT': 2,
 'Product': 2,
 'H1B SPONSOR FOR L1/L2/OPT': 2,
 'Work at Home - Business Owner': 2,
 'Work with us': 2,
 'General submissions - NYC': 1,
 'Working from Home': 1,
 'MN Domestic Violence Advocate (Part-time)': 1,
 'Cabling Techs': 1,
 'Self Employed - Work from Home': 1,
 'Otak2 2014 Cohort (Round 1)': 1,
 'Want to work at Franq?': 1,
 'Ejecutivo de Cuentas/Ejecutivos Comercial ': 1,
 'Hiring for all FOH and BOH Positions!': 1,
 'Want to work at Global Beach?': 1,
 'Job Fair ': 1,
 'Employee for incoming department': 1,
 'Jobs in Brazil': 1,
 'Skilled CNC Millers': 1,
 'Judicatory Proctor': 1,
 'Scrum Master ': 1,
 'Patient Advocate': 1,
 'I want to work at Vend in WELLINGTON': 1,
 'Advanced Semiconductor Power Device': 1,
 'Commercial Lender-Milwaukee, WI': 1,
 'Flyer distributor': 1,
 'Street Team': 1,
 "John's Talent Network": 1,
 'Cold Call Applicants': 1,
 'Licensed Social Worker': 1,
 'Domain Expert': 1,
 'Application Form': 1,
 

In [122]:
df4['required_experience'].isnull().value_counts()

False    17458
True       187
Name: required_experience, dtype: int64

The rest of the remaining 1.1% missing observations will be given "Not Applicable" value. This is because most of the titles are very vague and unclear and not sufficient information can be obtained from description/requirements.

In [123]:
# Impute the remaining to end this column
df4.loc[(df4['required_experience'].isnull()), "required_experience"] = "Not Applicable"

In [124]:
df4['required_experience'].isnull().sum()

0

#### iii. Required Education

There are 8028 missing values in this column which makes things even more challenging than previous column. However, the process of imputation should be similar to required experience, using string manipulation methods.

In [125]:
df4['required_education'].value_counts()

Bachelor's Degree                    5107
High School or equivalent            2002
Unspecified                          1375
Master's Degree                       416
Associate Degree                      264
Certification                         165
Some College Coursework Completed     100
Professional                           73
Vocational                             47
Some High School Coursework            27
Doctorate                              26
Vocational - HS Diploma                 9
Vocational - Degree                     6
Name: required_education, dtype: int64

From here, we know a few of the values that should be more focused: Bachelor's Degree, Master's Degree, Associate Degree, High School or equivalent, Certification/Professional and Doctorate which gives us around 6 to 7 different groups of keywords to search for.

In [126]:
df4[df4['required_education'].isnull()]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
0,Marketing Intern,"US, NY, New York",Marketing,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,f,t,f,Other,Internship,,,Marketing,f,f,US,NY,New York
1,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,f,t,f,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,f,f,NZ,,Auckland
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,f,t,f,,Mid-Senior level,,,,f,f,US,IA,Wever
5,Accounting Clerk,"US, MD,",,,,<p><b>Job Overview</b></p>\r\n<p>Apex is an environmental consulting firm that offers stable lea...,,,f,f,f,,Mid-Senior level,,,,f,f,US,MD,
7,Lead Guest Service Specialist,"US, CA, San Francisco",,,<p>Airenvy’s mission is to provide lucrative yet hassle free full service short term property ma...,<h3>Who is Airenvy?</h3>\r\n<p>Hey there! We are seasoned entrepreneurs in the heart of San Fran...,"<ul>\r\n<li>Experience with CRM software, live chat, and phones, including one year minimum of c...",<p><b>Competitive Pay.</b> You'll be able to eat steak everyday if you choose to. </p>\r\n<p><b...,f,t,t,,Entry level,,,,f,f,US,CA,San Francisco
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17872,Product Manager,"US, CA, San Francisco",Product Development,,<p>Flite delivers ad innovation at scale to the world's top publishers and brands. Marketers use...,<p>Flite's SaaS display ad platform fuels the world's top publishers and brands by reducing the ...,<ul>\r\n<li>\r\nBA/BS in Computer Science or a related technical field\r\n</li>\r\n<li>\r\nAt le...,<ul>\r\n<li>Competitive base</li>\r\n<li>Attractive stock option plan</li>\r\n<li>Medical/Dental...,f,t,f,Full-time,Mid-Senior level,,Internet,Product Management,f,f,US,CA,San Francisco
17873,Recruiting Coordinator,"US, NC, Charlotte",,,,<p><b>RESPONSIBILITIES:</b></p>\r\n<ul>\r\n<li>Will facilitate the recruiting and hiring process...,<p><b>REQUIRED SKILLS:</b></p>\r\n<ul>\r\n<li>Associates Degree or a combination of education pl...,,f,t,f,Contract,Entry level,,Utilities,,f,f,US,NC,Charlotte
17875,Account Director - Distribution,"CA, ON, Toronto",Sales,,<p>Vend is looking for some awesome new talent to come join us. You'll be working in an awesome ...,<p>Just in case this is the first time you’ve visited our website Vend is an award winning web b...,<p>To ace this role you:</p>\r\n<ul>\r\n<li>Will eat comprehensive Statements of Work for breakf...,<p><b>What can you expect from us?</b></p>\r\n<p>We have an open culture where we openly share o...,f,t,t,Full-time,Mid-Senior level,,Computer Software,Sales,f,f,CA,ON,Toronto
17877,Project Cost Control Staff Engineer - Cost Control Exp - TX,"US, TX, Houston",,,<p>We Provide Full Time Permanent Positions for many medium to large US companies. We are intere...,<p>Experienced Project Cost Control Staff Engineer is required having responsibility to provide ...,<ul>\r\n<li>At least 12 years professional experience.</li>\r\n<li>Ability to work in a diverse ...,,f,f,f,Full-time,Mid-Senior level,,,,f,f,US,TX,Houston


Same practice from this point, test the extent to how we can detect keywords for imputation.

In [127]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("Bachelor").value_counts()

False    5370
True      559
Name: requirements, dtype: int64

In [128]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("bachelor").value_counts()

False    5883
True       46
Name: requirements, dtype: int64

In [129]:
df4[(df4['requirements'].str.contains("Bachelor")) & (df4['required_education'].isnull())].head()

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
19,Process Controls Engineer - DCS PLC MS Office - PA,"US, PA, USA Northeast",,,<p>We Provide Full Time Permanent Positions for many medium to large US companies. We are intere...,<p>Experienced Process Controls Engineer is required having responsibility to monitor the facili...,"<ul>\r\n<li>Must have 5 or more years of experience with DCS programming, troubleshooting, and m...",,f,f,f,Full-time,Mid-Senior level,,,,f,f,US,PA,USA Northeast
64,SENIOR FINANCE SOFTWARE RESEARCHER AND ENGINEER,"US, ,",,,,"<p>DUTIES: Conduct research for building technical, statistical, algorithmic and math models</p>...","<p>REQUIREMENTS: Bachelor’s degree in Mathematics, statistics, computer software</p>\r\n<p>engin...",,f,f,f,,Mid-Senior level,,,,f,f,US,,
121,Smart-Meter Expert,"DE, BY, Wiepoldsried",tech,,<p>hello world</p>\r\n<p>talents23_ drives the change in digital recruitment and develops the be...,<p>We have extensive experience in battery storage technologies and renewable energies. As a med...,<ul>\r\n<li>Expert in best of class metering solutions for US solar &amp; storage designs.</li>\...,"<p>Want to be part of a fast growing, high energetic and motivated team?</p>\r\n<p>We afford a i...",f,t,t,Full-time,Mid-Senior level,,,,f,f,DE,BY,Wiepoldsried
170,Micro-grid Systems Engineer,"DE, BY, Wiepoldsried",tech,,<p>hello world</p>\r\n<p>talents23_ drives the change in digital recruitment and develops the be...,<p>We have extensive experience in battery storage technologies and renewable energies. As a med...,<ul>\r\n<li>Experience with utility interactive micro-grid design and standalone backup design</...,"<p>Want to be part of a fast growing, high energetic and motivated team?</p>\r\n<p>We afford a i...",f,t,t,Full-time,Mid-Senior level,,,,f,f,DE,BY,Wiepoldsried
182,Facilities Engineer,"US, TX, Houston",,,<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p><b>SUMMARY</b></p>\r\n<p>Provide engineering support to execute the scope, technical evaluati...",<p><b>EDUCATION and/or EXPERIENCE</b></p>\r\n<ul>\r\n<li>Bachelor’s degree in Chemical or Mechan...,,f,t,t,Full-time,Mid-Senior level,,Oil & Energy,Engineering,f,f,US,TX,Houston


In [130]:
df4.loc[(df4['required_education'].isnull()) & (~df4['requirements'].str.contains("Bachelor", na=False)), 
        'requirements'].str.contains("Degree").value_counts()

False    5137
True      233
Name: requirements, dtype: int64

In [131]:
# Requires 10 verifications before we implement this code!
df4.loc[(df4['required_education'].isnull()) & (~df4['requirements'].str.contains("Bachelor", na=False)), 
        'requirements'].str.contains("degree").value_counts()

False    4736
True      634
Name: requirements, dtype: int64

In [132]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("Master").value_counts()

False    5756
True      173
Name: requirements, dtype: int64

In [133]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("High School").value_counts()

False    5816
True      113
Name: requirements, dtype: int64

In [134]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("High school").value_counts()

False    5795
True      134
Name: requirements, dtype: int64

In [135]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("high school").value_counts()

False    5891
True       38
Name: requirements, dtype: int64

In [136]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("PhD").value_counts()

False    5900
True       29
Name: requirements, dtype: int64

In [137]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("Doctorate").value_counts()

False    5928
True        1
Name: requirements, dtype: int64

In [138]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("Certifi").value_counts()

False    5755
True      174
Name: requirements, dtype: int64

In [139]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("graduate degree").value_counts()

False    5908
True       21
Name: requirements, dtype: int64

In [140]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("Associate's degree").value_counts()

False    5928
True        1
Name: requirements, dtype: int64

In [141]:
df4.loc[(df4['required_education'].isnull()) & (~df4['requirements'].str.contains("Bachelor", na=False)) & (~df4['requirements'].str.contains("Degree", na=False)), 
        'requirements'].str.contains("BA").value_counts()

False    4962
True      175
Name: requirements, dtype: int64

In [142]:
df4.loc[(df4['required_education'].isnull()) & (~df4['requirements'].str.contains("Bachelor", na=False)) & (~df4['requirements'].str.contains("Degree", na=False)), 
        'requirements'].str.contains("BS").value_counts()

False    4904
True      233
Name: requirements, dtype: int64

In [143]:
df4.loc[(df4['required_education'].isnull()) & (~df4['requirements'].str.contains("Bachelor", na=False)) & (~df4['requirements'].str.contains("Degree", na=False)), 
        'requirements'].str.contains("B.A").value_counts()

False    5111
True       26
Name: requirements, dtype: int64

In [144]:
df4.loc[(df4['required_education'].isnull()) & (~df4['requirements'].str.contains("Bachelor", na=False)) & (~df4['requirements'].str.contains("Degree", na=False)), 
        'requirements'].str.contains("B.S").value_counts()

False    5051
True       86
Name: requirements, dtype: int64

In [145]:
df4.loc[(df4['required_education'].isnull()) & (~df4['requirements'].str.contains("Bachelor", na=False)) & (~df4['requirements'].str.contains("Degree", na=False)), 
        'requirements'].str.contains("BEng").value_counts()

False    5135
True        2
Name: requirements, dtype: int64

In [146]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("Associates Degree").value_counts()

False    5922
True        7
Name: requirements, dtype: int64

In [147]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("Associate's Degree").value_counts()

False    5923
True        6
Name: requirements, dtype: int64

In [148]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("Associate Degree").value_counts()

False    5924
True        5
Name: requirements, dtype: int64

In [149]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("Associates degree").value_counts()

False    5924
True        5
Name: requirements, dtype: int64

In [150]:
df4.loc[(df4['required_education'].isnull()) & (~df4['requirements'].str.contains("High School", na=False)) & (~df4['requirements'].str.contains("high school", na=False)), 
        'requirements'].str.contains("Diploma").value_counts()

False    5729
True       49
Name: requirements, dtype: int64

In [151]:
df4.loc[(df4['required_education'].isnull()) & (~df4['requirements'].str.contains("High School", na=False)) & (~df4['requirements'].str.contains("high school", na=False)), 
        'requirements'].str.contains("diploma").value_counts()

False    5599
True      179
Name: requirements, dtype: int64

2928 possible imputation

In [152]:
df4.loc[(df4['required_education'].isnull()) & (~df4['requirements'].str.contains("Bachelor", na=False)) & (~df4['requirements'].str.contains("Degree", na=False)), 
        'requirements'].str.contains("Engineer").value_counts()

False    4967
True      170
Name: requirements, dtype: int64

In [153]:
df4.loc[(df4['required_education'].isnull()) & (~df4['requirements'].str.contains("Bachelor", na=False)) & (~df4['requirements'].str.contains("BS", na=False)) & (df4['requirements'].str.contains("master's", na=False)), 
        :]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
3034,Certificate in Social Innovation Management,"KE, , Nai",,,<p>The Amani Institute is about developing whole individuals who have the knowledge and practica...,"<p>This unique, field-based program brings together a group of competitively selected, highly ta...",<p>The program is open to anyone in the world with a college degree (or at least 2 years of work...,,f,t,t,,Associate,,,,f,f,KE,,Nai
15497,Shapeways Operations Intern,"NL, NB, Eindhoven",,,"<p>Shapeways is the leading 3D printing marketplace and community, empowering designers to bring...",<p>Love all things 3D Printing? Have some fresh ideas? Want to join a successful team? Shapeways...,"<ul>\r\n<li>At your best in an international, fast-growing start-up – or similarly demanding env...",<h3>Why join our team?</h3>\r\n<p>Shapeways is breaking new ground in the field of 3D printing. ...,f,t,f,Full-time,Internship,,Information Technology and Services,Manufacturing,f,f,NL,NB,Eindhoven


In [154]:
df4.loc[3034, "requirements"]

"<p>The program is open to anyone in the world with a college degree (or at least 2 years of work experience) who is eager to train intensively to build his/her skills at leading social change. We are especially looking for people like:</p>\r\n<p>\xa0</p>\r\n<ul>\r\n<li>Andrew, a recent graduate determined to create a life and career of addressing the world's biggest social challenges but isn't sure where to start</li>\r\n<li>Natasha, who works for an international NGO but has been looking for a way to improve her practical skills and leadership capacity but hasn't found a master's program that provides what she needs</li>\r\n<li>Joseph, who has been working in an investment bank but feels that his passion for making a difference would find better use with organizations solving problems like rural development or access to health care</li>\r\n</ul>"

For description column

In [155]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("Bachelor").value_counts()

False    7785
True      243
Name: description, dtype: int64

In [156]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("bachelor").value_counts()

False    8005
True       23
Name: description, dtype: int64

In [157]:
df4.loc[(df4['required_education'].isnull()) & (~df4['description'].str.contains("Bachelor", na=False)), 
        'description'].str.contains("Degree").value_counts()

False    7730
True       55
Name: description, dtype: int64

In [158]:
# Requires 10 verifications before we implement this code!
df4.loc[(df4['required_education'].isnull()) & (~df4['description'].str.contains("Bachelor", na=False)), 
        'description'].str.contains("degree").value_counts()

False    7473
True      312
Name: description, dtype: int64

In [159]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("Master's degree").value_counts()

False    8014
True       14
Name: description, dtype: int64

In [160]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("Master").value_counts()

False    7938
True       90
Name: description, dtype: int64

In [161]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("master's degree").value_counts()

False    8027
True        1
Name: description, dtype: int64

In [162]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("High School").value_counts()

False    7964
True       64
Name: description, dtype: int64

In [163]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("High school").value_counts()

False    8013
True       15
Name: description, dtype: int64

In [164]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("high school").value_counts()

False    8006
True       22
Name: description, dtype: int64

In [165]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("PhD").value_counts()

False    8020
True        8
Name: description, dtype: int64

In [166]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("Doctorate").value_counts()

False    8028
Name: description, dtype: int64

In [167]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("Certifi").value_counts()

False    7949
True       79
Name: description, dtype: int64

In [168]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("graduate degree").value_counts()

False    8004
True       24
Name: description, dtype: int64

In [169]:
df4.loc[(df4['required_education'].isnull()) & (~df4['description'].str.contains("Bachelor", na=False)) & (~df4['description'].str.contains("Degree", na=False)), 
        'description'].str.contains("BA").value_counts()

False    7597
True      133
Name: description, dtype: int64

In [170]:
df4.loc[(df4['required_education'].isnull()) & (~df4['description'].str.contains("Bachelor", na=False)) & (~df4['description'].str.contains("Degree", na=False)), 
        'description'].str.contains("BS").value_counts()

False    7602
True      128
Name: description, dtype: int64

In [171]:
df4.loc[(df4['required_education'].isnull()) & (~df4['description'].str.contains("Bachelor", na=False)) & (~df4['description'].str.contains("Degree", na=False)), 
        'description'].str.contains("B.A").value_counts()

False    7691
True       39
Name: description, dtype: int64

In [172]:
df4.loc[(df4['required_education'].isnull()) & (~df4['description'].str.contains("Bachelor", na=False)) & (~df4['description'].str.contains("Degree", na=False)), 
        'description'].str.contains("B.S").value_counts()

False    7580
True      150
Name: description, dtype: int64

In [173]:
df4.loc[(df4['required_education'].isnull()) & (~df4['description'].str.contains("Bachelor", na=False)) & (~df4['description'].str.contains("Degree", na=False)), 
        'description'].str.contains("BEng").value_counts()

False    7729
True        1
Name: description, dtype: int64

In [174]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("Associates Degree").value_counts()

False    8017
True       11
Name: description, dtype: int64

In [175]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("Associate's Degree").value_counts()

False    8028
Name: description, dtype: int64

In [176]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("Associate Degree").value_counts()

False    8024
True        4
Name: description, dtype: int64

In [177]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("Associates degree").value_counts()

False    8024
True        4
Name: description, dtype: int64

In [178]:
df4.loc[(df4['required_education'].isnull()) & (~df4['description'].str.contains("High School", na=False)) & (~df4['description'].str.contains("high school", na=False)), 
        'description'].str.contains("Diploma").value_counts()

False    7927
True       16
Name: description, dtype: int64

In [179]:
df4.loc[(df4['required_education'].isnull()) & (~df4['description'].str.contains("High School", na=False)) & (~df4['description'].str.contains("high school", na=False)), 
        'description'].str.contains("diploma").value_counts()

False    7922
True       21
Name: description, dtype: int64

In [180]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("vocational").value_counts()

False    8015
True       13
Name: description, dtype: int64

In [181]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("vocational").value_counts()

False    5925
True        4
Name: requirements, dtype: int64

In [182]:
df4.loc[df4['required_education'].isnull(), 'description'].str.contains("vocational education").value_counts()

False    8018
True       10
Name: description, dtype: int64

In [183]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("vocational education").value_counts()

False    5929
Name: requirements, dtype: int64

In [184]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("vocational degree").value_counts()

False    5929
Name: requirements, dtype: int64

In [185]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("university degree").value_counts()

False    5913
True       16
Name: requirements, dtype: int64

In [186]:
df4.loc[(df4['required_education'].isnull()) & \
        (~df4['requirements'].str.contains('degree', na=False)), 'requirements'].str.contains("undergraduate").value_counts()

False    4909
True       17
Name: requirements, dtype: int64

In [187]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("HS diploma").value_counts()

False    5928
True        1
Name: requirements, dtype: int64

In [188]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("High-school").value_counts()

False    5927
True        2
Name: requirements, dtype: int64

In [189]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("high-school").value_counts()

False    5929
Name: requirements, dtype: int64

In [190]:
df4.loc[(df4['required_education'].isnull()) & \
        (~df4['requirements'].str.contains('degree', na=False)), 'requirements'].str.contains("Undergraduate").value_counts()

False    4923
True        3
Name: requirements, dtype: int64

In [191]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("Bachelors").value_counts()

False    5844
True       85
Name: requirements, dtype: int64

In [192]:
df4.loc[df4['required_education'].isnull(), 'requirements'].str.contains("Bachelor").value_counts()

False    5370
True      559
Name: requirements, dtype: int64

In [193]:
df4.loc[(df4['required_education'].isnull()) & \
        (~df4['requirements'].str.contains('degree', na=False)) & \
        (~df4['requirements'].str.contains('Degree', na=False)) & \
        (~df4['requirements'].str.contains('diploma', na=False)) & \
        (~df4['requirements'].str.contains('Diploma', na=False)), 'requirements'].str.contains("university").value_counts()

False    4335
True       20
Name: requirements, dtype: int64

In [194]:
df4.loc[(df4['required_education'].isnull()) & \
        (~df4['requirements'].str.contains('degree', na=False)) & \
        (~df4['requirements'].str.contains('Degree', na=False)) & \
        (~df4['requirements'].str.contains('diploma', na=False)) & \
        (~df4['requirements'].str.contains('Diploma', na=False)), 'requirements'].str.contains("University").value_counts()

False    4342
True       13
Name: requirements, dtype: int64

For university no need include diploma condition but for University need!

In [195]:
df4.loc[(df4['required_education'].isnull()) & (df4['requirements'].str.contains("university")) & \
        (~df4['requirements'].str.contains('degree', na=False)) & (~df4['requirements'].str.contains('Degree', na=False)), 
        ['title', 'description', 'requirements']]

Unnamed: 0,title,description,requirements
788,Super Marketing Specialist,<p>“What do you want to be when you grow up?” your dad asked.</p>\r\n<p>“Marine biologist? Astro...,<p>1. You have a killer marketing instinct</p>\r\n<p>2. You know how to sell</p>\r\n<p>3. You un...
3415,Brand Ambassador,"<p>TeeSmile is looking for passionate, motivated and enthusiastic people to help us grow.</p>\r\...","<ul>\r\n<li>Actively invloved with an organization (in case of a student, must be enrolled in th..."
3419,Seedcamp Summer Intern,"<h1>Seedcamp in a nutshell</h1>\r\n<p>2013 is an exciting year for us, we&rsquo;ve set ourselves...",<p><b><b>We&rsquo;re looking for people who are:<br /></b></b></p>\r\n<ul>\r\n<li>Passionate abo...
3879,Senior Java Developer,"<p>As a Senior Developer at Hoiio, you will be critical to the success of our company, as a key ...",<ul>\r\n<li>- Experience (from university or work) is required. </li>\r\n<li>- Fresh graduates a...
4323,Sales Manager,<h1><b><i>Asapy is inviting IT Sales Manager</i></b></h1>\r\n<p><b><i></i></b></p>\r\n<p>Respons...,<ul>\r\n<li>Technical background (including IT department in the university)</li>\r\n<li>experie...
4436,Marketing Manager,<p>Emergence Capital Partners is a leading venture capital firm focused on early and growth-stag...,"<p>Desired background, qualities, interests:</p>\r\n<p>- 2+ years of business/marketing/j..."
4447,I want to be a Postgrad Intern at Vend!,"<p>It used to be that to set up a retail store, you’d have to spend mega bucks buying receipt pr...",<ul>\r\n<li>Have a minimum of 5 years experience writing code for web- or iOS­-based projects</l...
5147,Intern (Summer or Immediate) - Global Banking Expansion,"<p>TransferWise is a VC-backed, international money transfer start-up co-founded by Skype’s firs...",<p><b>We're looking for somebody to:</b></p>\r\n<ul>\r\n<li>\r\nUse personal knowledge and resea...
6086,Community Manager - New York City,"<p>At WannaYum, amazing service is our core. We're looking for a Marketing &amp; Community Manag...","<p>Experience </p>\r\n<ul>\r\n<li>2-5 years’ experience in marketing, brand management, communit..."
7634,Growth Hacker,<p><b>Explovia Overview:</b> </p>\r\n<p>Explovia is a London start-up on an ambitious mis...,<p><b>Overall Responsibility: </b></p>\r\n<p>The Growth Hacker will be tasked with demand genera...


In [196]:
df4.loc[(df4['required_education'].isnull()) & (df4['requirements'].str.contains("University")) & \
        (~df4['requirements'].str.contains('degree', na=False)) & (~df4['requirements'].str.contains('Degree', na=False)), 
        ['title', 'description', 'requirements']]

Unnamed: 0,title,description,requirements
1154,Chief Technology Officer/Co-Founder,<p>If you are not entrepreneurial and are not ready to lead a global technology team then this p...,<p>Prior experience in leading a team of developers.</p>\r\n<p>Startup experience is preferred.<...
1640,Jobs in Brazil,<p>We are looking for passionate professionals who want to work in Brazil. We are acting in vari...,<p>The requirements depend a lot on the offer but they should include:</p>\r\n<ul>\r\n<li>\r\n<p...
2442,Student MATADORS for Bulls Eye Communications,<ul>\r\n<li>Recruitment of Brand Ambassadors for various clients (reputable MNCs) for different ...,<ul>\r\n<li>Strategic and analytical thinking</li>\r\n<li>Currently Enrolled in University</li>\...
3014,SEO/Content Marketing Intern,<p>Vend is an award winning web based point of sale software for retail. We’re chucking out cru...,<p>You should be/have:</p>\r\n<ul>\r\n<li>Interest in SEO and content marketing – experience is ...
3052,Senior Analytics Consultant,"<p>The company, a global management consulting firm serving clients in more than 100 countries, ...",<p></p>\r\n<ul>\r\n<li>BSc and MSc or Phd in Statistics or related science field from a well-est...
3833,Marketing & Public Relation Manager,"<p>The candidate has the main responsibility:</p>\r\n<ul>\r\n<li>To initiate contact, organise a...","<p>University diploma (B.Sc.) in Marketing and Public Relations or Communication and Marketing, ..."
4800,Specialists Required New Zealand.,<p>North and South Island locations.</p>\r\n<p></p>\r\n<p>We require specialist Doctors in a wid...,<p>Our client based in the far North of NZ is seeking a Gastroenterologist.<br><br><br></p>\r\n<...
4917,Group Accountant (Head of Accounting),"<p>ince 2005, M-BIZ Global’s innovative business model has proven to be profitable and has led t...",<p>- University graduate with BA/BS in business with emphasis in Accounting or Finance (ACCA/CIM...
5218,Web designer- Internship position,<p>Moneymarket s.a is the leading provider of online marketplaces and financial services in Gree...,<ul>\r\n<li>Applicants should be undergraduates of a Greek University and meet all the requireme...
6729,Software Developer,"<p>We are seeking a talented, hardworking, and very driven software developer for our office in ...","<h3>Required skills to have:</h3>\r\n<ul>\r\n<li>Java (excellent knowledge, or ability and willi..."


In [197]:
df4.loc[(df4['required_education'].isnull()) & (df4['description'].str.contains("vocational")), :]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
3816,Financial Advisor,"CA, ON, Mississauga",,,,<p>JOB DESCRIPTION<br />If you are considering a new path working with a growing company after m...,,,f,f,f,,Mid-Senior level,,,,f,f,CA,ON,Mississauga
4450,CNC Machinist,"US, CA, Los Angeles",,,<p>We Provide Full Time Permanent Positions for many medium to large US companies. We are intere...,<p><b><i>(We have more than 1500+ Job openings in our website and some of them are relevant to t...,,,f,f,f,Full-time,Mid-Senior level,,,,f,f,US,CA,Los Angeles
6866,Residential Aide,"US, PA, Berwyn",,,,"<p>Seeking a rewarding career, helping others with room for personal and professional growth?</p...",<p><b>Basic Qualifications </b></p>\r\n<ul>\r\n<li>\r\n<b>Education/Training </b>\r\n<ul>\r\n<li...,,f,f,t,Full-time,Associate,,,,f,f,US,PA,Berwyn
9619,CNC Machinist,"US, MI, Detroit",,,<p>We Provide Full Time Permanent Positions for many medium to large US companies. We are intere...,<p><b><i>(We have more than 1500+ Job openings in our website and some of them are relevant to t...,,,f,f,f,Full-time,Mid-Senior level,,,,f,f,US,MI,Detroit
11289,Program Assistant QIDP Assistant Day Program,"US, OH, Seville",,,<p>SHC/The Arc of Medina County is a great place to start or continue your career. Whether you’r...,<p><b>Job Title: Program Assistant QIDP Assistant - Links</b></p>\r\n<p><b> </b></p>\r\n<p><b>P...,<p><b> Bona-fide Occupationally Required Competencies and Credentials:</b></p>\r\n<ul>\r\n<li>...,<p>Non-Exempt Full Time 40 hours/week</p>\r\n<p>Retirement Plan</p>\r\n<p>Health Benefits</p>...,f,t,f,,Mid-Senior level,,,,f,f,US,OH,Seville
11526,CNC Machinist,"US, MN, Minneapolis",,,<p>We Provide Full Time Permanent Positions for many medium to large US companies. We are intere...,<p><b><i>(We have more than 1500+ Job openings in our website and some of them are relevant to t...,,,f,f,f,Full-time,Mid-Senior level,,,,f,f,US,MN,Minneapolis
14322,CNC Machinist,"US, MI, Detroit",,,<p>We Provide Full Time Permanent Positions for many medium to large US companies. We are intere...,<p><b><i>(We have more than 1500+ Job openings in our website and some of them are relevant to t...,,,f,f,f,Full-time,Mid-Senior level,,,,f,f,US,MI,Detroit
14998,CNC Machinist,"US, MO, St. Louis",,,<p>We Provide Full Time Permanent Positions for many medium to large US companies. We are intere...,<p><b><i>(We have more than 1500+ Job openings in our website and some of them are relevant to t...,,,f,f,f,Full-time,Mid-Senior level,,,,f,f,US,MO,St. Louis
15043,CNC Machinist,"US, MA, Boston",,,<p>We Provide Full Time Permanent Positions for many medium to large US companies. We are intere...,<p><b><i>(We have more than 1500+ Job openings in our website and some of them are relevant to t...,,,f,f,f,Full-time,Mid-Senior level,,,,f,f,US,MA,Boston
15895,CNC Machinist,"US, OH, Cleveland",,,<p>We Provide Full Time Permanent Positions for many medium to large US companies. We are intere...,<p><b><i>(We have more than 1500+ Job openings in our website and some of them are relevant to t...,,,f,f,f,Full-time,Mid-Senior level,,,,f,f,US,OH,Cleveland


In [198]:
df4.loc[(df4['required_education'].isnull()) & (df4['requirements'].str.contains("professional")), ['title', 'description', 'requirements']]

Unnamed: 0,title,description,requirements
172,Registrar's in Psychiatry,"<p>We are seeking Registrar's in Psychiatry for a variety of locations throughout QLD,Vic,NSW,SA...",<p>At least 12 months experience at Registrar Level in Psychiatry.</p>\r\n<p></p>\r\n<p>Registra...
234,Postgraduate Certificate in Social Innovation Management Kenya - March 2015,"<p>This unique, field-based, full-time program brings together 25 individuals from different cou...",<p>What do we look for in a program participant?</p>\r\n<p>If you meet the majority of the requi...
381,Junior Front End Developer - Javascript,<p>Advisor Websites is looking for a talented and self-motivated front-end developer to join our...,<ul>\r\n<li>Self-reliant learner and creative problem-solver</li>\r\n<li>Bachelor's degree in So...
422,Senior Frontend Developer,<p><b>About the Company</b><br>We are ticketscript - the European market leader in digital self-...,<p><b>Your profile</b><br>Would you like to work in a professional environment with young and mo...
423,Agent-Inbound Sales Position,<p>Are you ready to start your sales career with a growing organization in a call center sales a...,<p>As a Customer Service Sales Representative you should be driven to succeed and exceed custome...
...,...,...,...
17802,Front Office Manager/reception,<p>Our company is looking for a full time employee to manage our front desk. Experience in phys...,"<p>Mange &amp; train front desk staff providing services to guests in a friendly, efficient &amp..."
17806,Front Office Manager/Reception,<p>Our company is looking for a full time employee to manage our front desk. Experience in physi...,"<p>Mange &amp; train front desk staff providing services to guests in a friendly, efficient &amp..."
17828,Sales Associate,<p><b>LEARN TO EARN AN EXECUTIVE LEVEL INCOME</b></p>\r\n<p><b>FULL TRAINING AND SUPPORT FROM EX...,<p><b>What You Can Do.</b></p>\r\n<p><b> </b></p>\r\n<p>• Have the potential to earn an executiv...
17873,Recruiting Coordinator,<p><b>RESPONSIBILITIES:</b></p>\r\n<ul>\r\n<li>Will facilitate the recruiting and hiring process...,<p><b>REQUIRED SKILLS:</b></p>\r\n<ul>\r\n<li>Associates Degree or a combination of education pl...


In [199]:
df4.loc[234, "requirements"]

'<p>What do we look for in a program participant?</p>\r\n<p>If you meet the majority of the requirements below, we would love to receive your application.</p>\r\n<ul>\r\n<li>A university degree (undergraduate)</li>\r\n<li>Ideally two years of practical experience (either working or volunteering)</li>\r\n<li>Evidence of commitment to social change through your personal and/or professional life</li>\r\n<li>Strong desire to develop yourself further both professionally and personally</li>\r\n</ul>'

In [200]:
df4['required_education'].value_counts()

Bachelor's Degree                    5107
High School or equivalent            2002
Unspecified                          1375
Master's Degree                       416
Associate Degree                      264
Certification                         165
Some College Coursework Completed     100
Professional                           73
Vocational                             47
Some High School Coursework            27
Doctorate                              26
Vocational - HS Diploma                 9
Vocational - Degree                     6
Name: required_education, dtype: int64

Only 1442, perhaps some overlaps but hopefully at least half of them aren't

Let's design our code for imputation, slowly. The reason we put col as a parameter is to be able to check on 2 columns which are requirements and description

In [201]:
df4.loc[10897, 'required_education'] = "High School or equivalent"
df4.loc[5220, 'required_education'] = "Unspecified"

In [202]:
keywords = [
    "Master's degree", "master's degree", "Associate's degree", "Associate's Degree", "Associate Degree", "Associate degree",  
    "Associates Degree", "Associates degree", "Bachelor", "bachelor", "Degree", "degree", "Undergraduate", "undergraduate", 
    "BA", "BS", "B.A", "B.S", "BEng", "High School", "High school", "high school", "HS diploma", "High-school", "PhD", 
    "Doctorate", "Ph.D", "Ph.d", "vocational education", "Diploma", "diploma"
]

In [203]:
def impute_edu(df, words, col):
    print("Total keywords to loop for column", col + ":", len(words))
    for i, word in enumerate(words):
        print("Word number", str(i) + ":", word)
        
        # Master
        if word in ["Master's degree", "master's degree"]:
            if len(df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)), "required_education"]) > 0:
                df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)), 
                       "required_education"] = "Master's Degree"
                print("Imputed", word)
            
        # Associate Degree
        elif word in ["Associate's degree", "Associate's Degree", "Associate Degree", "Associate degree",  
                      "Associates Degree", "Associates degree"]:
            if len(df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)), "required_education"]) > 0:
                df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)), 
                       "required_education"] = "Associate Degree"
                print("Imputed", word)
        
        # Bachelor's Degree
        elif word in ["Bachelor", "bachelor"]:
            if len(df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)), "required_education"]) > 0:
                df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)), 
                       "required_education"] = "Bachelor's Degree"
                print("Imputed", word)
        elif word in ["Degree", "degree"]:
            if len(df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)) & \
                   (~df[col].str.contains("Bachelor", na=False)) & (~df[col].str.contains("bachelor", na=False)), 
                   "required_education"]) > 0:
                df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)) & \
                       (~df[col].str.contains("Bachelor", na=False)) & (~df[col].str.contains("bachelor", na=False)), 
                       "required_education"] = "Bachelor's Degree"
                print("Imputed", word)
        elif word in ['Undergraduate', 'undergraduate']:
            if len(df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)) & \
                   (~df[col].str.contains("Degree", na=False)) & (~df[col].str.contains("degree", na=False)), 
                   "required_education"]) > 0:
                df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)) & \
                       (~df[col].str.contains("Degree", na=False)) & (~df[col].str.contains("degree", na=False)), 
                       "required_education"] = "Bachelor's Degree"
                print("Imputed", word)
            
        # BA or BS or BEng
        elif word in ["BA", "BS", "B.A", "B.S", "BEng"]:
            if len(df.loc[(df['required_education'].isnull()) & (~df[col].str.contains("Bachelor", na=False)) & \
                   (~df[col].str.contains("bachelor", na=False)) & (~df[col].str.contains("Degree", na=False)) & \
                   (~df[col].str.contains("degree", na=False)) & (df[col].str.contains(word, na=False)), 
                   "required_education"]) > 0:
                df.loc[(df['required_education'].isnull()) & (~df[col].str.contains("Bachelor", na=False)) & \
                       (~df[col].str.contains("bachelor", na=False)) & (~df[col].str.contains("Degree", na=False)) & \
                       (~df[col].str.contains("degree", na=False)) & (df[col].str.contains(word, na=False)), 
                       "required_education"] = "Bachelor's Degree"
                print("Imputed", word)
            
        # High School
        elif word in ["High School", "High school", "high school", "HS diploma", "High-school"]:
            if len(df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)) & \
                          (~df[col].str.contains("Tutorizon", na=False)), "required_education"]) > 0:
                df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)) & \
                       (~df[col].str.contains("Tutorizon", na=False)), "required_education"] = "High School or equivalent"
                print("Imputed", word)
            
        # Doctorate
        elif word in ["PhD", "Doctorate", "Ph.D", "Ph.d"]:
            if len(df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)), "required_education"]) > 0:
                df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)), 
                       "required_education"] = "Doctorate"
                print("Imputed", word)
            
        # Vocational
        elif word == "vocational education":
            if len(df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)), "required_education"]) > 0:
                df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)), 
                       "required_education"] = "Vocational"
                print("Imputed", word)
            
        # Diploma (certification)
        elif word in ["Diploma", "diploma"]:
            if len(df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)) & \
                   (~df[col].str.contains("high school", na=False)) & (~df[col].str.contains("High school", na=False)) & \
                   (~df[col].str.contains("High School", na=False)) & (~df[col].str.contains("HS", na=False)) & \
                   (~df[col].str.contains("diplomacy", na=False)) & (~df[col].str.contains("diplomatic", na=False)) & \
                   (~df[col].str.contains("degree", na=False)) & (~df[col].str.contains("Bachelor", na=False)), 
                   "required_education"]) > 0:
                df.loc[(df['required_education'].isnull()) & (df[col].str.contains(word)) & \
                       (~df[col].str.contains("high school", na=False)) & (~df[col].str.contains("High school", na=False)) & \
                       (~df[col].str.contains("High School", na=False)) & (~df[col].str.contains("HS", na=False)) & \
                       (~df[col].str.contains("diplomacy", na=False)) & (~df[col].str.contains("diplomatic", na=False)) & \
                       (~df[col].str.contains("degree", na=False)) & (~df[col].str.contains("Bachelor", na=False)), 
                       "required_education"] = "Certification"
                print("Imputed", word)
                
    return df

In [204]:
df4['required_education'].isnull().value_counts()  # 8028 nulls

False    9617
True     8028
Name: required_education, dtype: int64

In [205]:
# Execute the loop function
df5 = df4.copy()
df5 = impute_edu(df5, keywords, "requirements")
df5 = impute_edu(df5, keywords, "description")

Total keywords to loop for column requirements: 31
Word number 0: Master's degree
Imputed Master's degree
Word number 1: master's degree
Imputed master's degree
Word number 2: Associate's degree
Imputed Associate's degree
Word number 3: Associate's Degree
Imputed Associate's Degree
Word number 4: Associate Degree
Imputed Associate Degree
Word number 5: Associate degree
Word number 6: Associates Degree
Imputed Associates Degree
Word number 7: Associates degree
Imputed Associates degree
Word number 8: Bachelor
Imputed Bachelor
Word number 9: bachelor
Imputed bachelor
Word number 10: Degree
Imputed Degree
Word number 11: degree
Imputed degree
Word number 12: Undergraduate
Imputed Undergraduate
Word number 13: undergraduate
Imputed undergraduate
Word number 14: BA
Imputed BA
Word number 15: BS
Imputed BS
Word number 16: B.A
Imputed B.A
Word number 17: B.S
Imputed B.S
Word number 18: BEng
Word number 19: High School
Imputed High School
Word number 20: High school
Imputed High school
Word nu

We missed out one keyword which is academic background, so we'll do it manually.

In [206]:
# Requirements column
df5.loc[(df5['required_education'].isnull()) & (df5['requirements'].str.contains("academic background")), 
        'required_education'] = "Bachelor's Degree"
df5.loc[(df5['required_education'].isnull()) & (df5['requirements'].str.contains("Academic background")), 
        'required_education'] = "Bachelor's Degree"

# Description column
df5.loc[(df5['required_education'].isnull()) & (df5['description'].str.contains("academic background")), 
        'required_education'] = "Bachelor's Degree"
df5.loc[(df5['required_education'].isnull()) & (df5['description'].str.contains("Academic background")), 
        'required_education'] = "Bachelor's Degree"

Check the counts now.

In [207]:
df5['required_education'].isnull().value_counts()

False    12460
True      5185
Name: required_education, dtype: int64

In [208]:
df4['required_education'].isnull().value_counts().iloc[1] - df5['required_education'].isnull().value_counts().iloc[1]

2843

In [209]:
df5['required_education'].isnull().value_counts() / len(df5) * 100

False    70.614905
True     29.385095
Name: required_education, dtype: float64

In [210]:
df5['required_education'].value_counts()

Bachelor's Degree                    7561
High School or equivalent            2249
Unspecified                          1375
Master's Degree                       442
Associate Degree                      306
Certification                         206
Some College Coursework Completed     100
Professional                           73
Doctorate                              59
Vocational                             47
Some High School Coursework            27
Vocational - HS Diploma                 9
Vocational - Degree                     6
Name: required_education, dtype: int64

We have 2827 entries imputed which is not bad but not sufficient enough, so let's take a look at some of the remaining rows.

In [211]:
df5[df5['required_education'].isnull()].head(15)

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
0,Marketing Intern,"US, NY, New York",Marketing,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,f,t,f,Other,Internship,,,Marketing,f,f,US,NY,New York
1,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,f,t,f,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,f,f,NZ,,Auckland
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,f,t,f,,Mid-Senior level,,,,f,f,US,IA,Wever
7,Lead Guest Service Specialist,"US, CA, San Francisco",,,<p>Airenvy’s mission is to provide lucrative yet hassle free full service short term property ma...,<h3>Who is Airenvy?</h3>\r\n<p>Hey there! We are seasoned entrepreneurs in the heart of San Fran...,"<ul>\r\n<li>Experience with CRM software, live chat, and phones, including one year minimum of c...",<p><b>Competitive Pay.</b> You'll be able to eat steak everyday if you choose to. </p>\r\n<p><b...,f,t,t,,Entry level,,,,f,f,US,CA,San Francisco
11,Talent Sourcer (6 months fixed-term contract),"GB, LND, London",HR,,<p><b>Want to build a 21st century financial service?</b></p>\r\n<p>We're convinced that that th...,<p>TransferWise is the clever new way to move money between countries. Co-founded by Skype’s fir...,<p><b>We’re looking for someone who:</b></p>\r\n<ul>\r\n<li>Proven track record in sourcing acro...,<p>You will join one of Europe’s most hotly tipped startups with plenty of opportunities to grow...,f,t,f,,Mid-Senior level,,,,f,f,GB,LND,London
16,Hands-On QA Leader,"IL, , Tel Aviv, Israel",R&D,,<p>At HoneyBook we’re re-imagining the events industry and building a product that is already ch...,"<p>We are looking for a Hands-On QA Leader for our talented R&amp;D team, located in the Center ...",<ul>\r\n<li>Previous experience in client &amp; server testing</li>\r\n<li>Experience in Leading...,,f,t,f,Full-time,Mid-Senior level,,Internet,Engineering,f,f,IL,,"Tel Aviv, Israel"
17,Southend-on-Sea Traineeships Under NAS 16-18 Year Olds Only,"GB, SOS, Southend-on-Sea",,,<p>Established on the principles that full time education is not for everyone Spectrum Learning ...,<p>Government funding is only available for 16-18 year olds.</p>\r\n<p>We have 10 vacancies for ...,<p>16-18 year olds only due to government funding.</p>\r\n<p>Career prospects</p>,<p>Career prospects.</p>,f,t,t,,Entry level,,,,f,f,GB,SOS,Southend
18,Visual Designer,"US, NY, New York",,,<p>Kettle is an independent digital agency based in New York City and the Bay Area. We’re commit...,"<p>Kettle is hiring a Visual Designer!</p>\r\n<p>Job Location: New York, NY</p>\r\n<p>Kettle is ...",,,f,t,f,,Mid-Senior level,,,,f,f,US,NY,New York
20,Marketing Assistant,"US, TX, Austin",,,<p>IntelliBright was created to leverage enterprise level online business practices to generate ...,<p>IntelliBright is growing fast and is looking for a <b>Marketing Assistant </b>to join our tea...,<p><b>Job Requirements</b></p>\r\n<ul>\r\n<li>Assist in creating client online marketing campaig...,,f,t,f,,Associate,,,Marketing,f,f,US,TX,Austin
24,Customer Service,"GB, LND, London",,,,<p>We are a canary wharf based e-commerce company and are recruiting for a full time customer se...,,,f,f,f,,Associate,,,,f,f,GB,LND,London


In [212]:
df5.loc[(df5['requirements'].isnull()) & (df5['required_education'].isnull()), ['title', 'description']]

Unnamed: 0,title,description
18,Visual Designer,"<p>Kettle is hiring a Visual Designer!</p>\r\n<p>Job Location: New York, NY</p>\r\n<p>Kettle is ..."
24,Customer Service,<p>We are a canary wharf based e-commerce company and are recruiting for a full time customer se...
88,Sales and Partnerships Intern,<p>At dopios we are rethinking the way we interact with unknown locations and our goal is to mak...
104,Shipping Clerk,<p>A Local company in Reading PA is looking for a Shipping clerk Mon- Fri Hours 7am-3:30 pm $14-...
108,Software Project Manager,"<p>Skookum Digital Works is looking for a motivated, self-starter to support and facilitate the ..."
...,...,...
17822,5 Guys,<p>Analyze the excel books of the franchise and then post them online for him to use.</p>
17827,Student Positions Part-Time and Full-Time.,"<p>Student Positions Part-Time and Full-Time.<br>You can do it all from home, in your free time,..."
17843,Interior Designer Position Available,"<p>Our client, a home-staging company, is in need of an Interior Designer to join their team ASA..."
17852,GWT Expert,<p>GWT Expert</p>\r\n<p>We are experiencing a rapid worldwide adoption of our flagship open sour...


After some random exploration on the requirements and description columns, we see that most of them are really not specifying education requirements. We decide to create one more value for required_education, that is "empty requirements" as opposed to unspecified for those that have null requirements, while the rest will be imputed with unspecified.

In [213]:
# Those with null requirements
df5.loc[(df5['requirements'].isnull()) & (df5['required_education'].isnull()), 'required_education'] = "Empty requirements"

# Those with valid requirements
df5.loc[(df5['requirements'].notnull()) & (df5['required_education'].isnull()), 'required_education'] = "Unspecified"

In [214]:
df5['required_education'].isnull().sum()

0

In [215]:
df5['required_education'].value_counts()

Bachelor's Degree                    7561
Unspecified                          5197
High School or equivalent            2249
Empty requirements                   1363
Master's Degree                       442
Associate Degree                      306
Certification                         206
Some College Coursework Completed     100
Professional                           73
Doctorate                              59
Vocational                             47
Some High School Coursework            27
Vocational - HS Diploma                 9
Vocational - Degree                     6
Name: required_education, dtype: int64

#### iv. Function

We have 6378 null entries, slightly lower count than required_education which is a relief but we still need to put some effort to get this entries imputed properly.

In [216]:
df5['function'].isnull().value_counts()

False    11267
True      6378
Name: function, dtype: int64

In [217]:
len(df5['function'].value_counts())

37

In [218]:
df5['function'].value_counts()

Information Technology    1732
Sales                     1455
Engineering               1343
Customer Service          1185
Marketing                  819
Administrative             614
Design                     337
Health Care Provider       327
Education                  325
Other                      325
Management                 308
Business Development       226
Accounting/Auditing        210
Human Resources            203
Project Management         183
Finance                    165
Consulting                 140
Art/Creative               131
Writing/Editing            131
Production                 115
Product Management         113
Quality Assurance          111
Advertising                 90
Business Analyst            83
Data Analyst                82
Public Relations            76
Manufacturing               73
General Business            68
Research                    50
Strategy/Planning           46
Legal                       44
Training                    37
Supply C

Also, let's take a look at industries.

In [219]:
df5['industry'].value_counts().head(25)

Information Technology and Services    1712
Computer Software                      1368
Internet                               1057
Marketing and Advertising               821
Education Management                    819
Financial Services                      753
Hospital & Health Care                  486
Consumer Services                       348
Telecommunications                      329
Oil & Energy                            286
Retail                                  223
Real Estate                             167
Accounting                              159
Construction                            153
E-Learning                              138
Management Consulting                   128
Design                                  127
Staffing and Recruiting                 127
Health, Wellness and Fitness            123
Insurance                               121
Automotive                              117
Logistics and Supply Chain              110
Human Resources                 

In [220]:
len(df5[(df5['function'].isnull()) & (df5['department'].isnull())])

4831

In [221]:
len(df5[(df5['function'].isnull()) & (df5['department'].isnull()) & (df5['industry'].notnull())])

2099

In [222]:
len(df5[(df5['function'].isnull()) & (df5['department'].notnull())])

1547

In [223]:
df5[(df5['function'].isnull()) & (df5['department'].notnull())].iloc[:, :16]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function
11,Talent Sourcer (6 months fixed-term contract),"GB, LND, London",HR,,<p><b>Want to build a 21st century financial service?</b></p>\r\n<p>We're convinced that that th...,<p>TransferWise is the clever new way to move money between countries. Co-founded by Skype’s fir...,<p><b>We’re looking for someone who:</b></p>\r\n<ul>\r\n<li>Proven track record in sourcing acro...,<p>You will join one of Europe’s most hotly tipped startups with plenty of opportunities to grow...,f,t,f,,Mid-Senior level,Unspecified,,
34,I Want To Work At Karmarama,"GB, LND,",All,,"<p>At Karmarama we have a unique hiring policy: nice, talented and decent people who genuinely w...",<p>Didn't see a role for you? Don't fret. We’re always looking for talented people to join our t...,<p>Hey!</p>\r\n<p>Thanks again for applying to Karmarama and showing your interest in our compan...,,f,t,t,Full-time,Not Applicable,Unspecified,Marketing and Advertising,
57,Intensive Case Management Worker (Bilingual Essential),"CA, ON, Ottawa",ICM,,<p><b>Since 1973: Working together to make our community healthy</b></p>\r\n<p>Good health means...,<h3><b>Internal/External Employment Opportunity</b></h3>\r\n<p><i><b>Position Title: </b>Intensi...,"<p><b>Education and Language</b></p>\r\n<ul>\r\n<li>A bachelor's degree in counseling, psycholog...",<p>Sandy Hill Community Health Centre offers employees an excellent benefits package which inclu...,f,t,t,,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,
66,AS3 / Flash Developer,"GR, I, Athens",Engineering,,"<p><b>reEmbedit provides a branded video player for embedded videos (YouTube, etc) together with...",<p>We are looking for a candiate that is capable of developing a Flash/AS3 based video player. &...,<p>AS3 / FLASH</p>\r\n<p>Video player development</p>\r\n<p></p>,<p>Competitive salary</p>\r\n<p>Stock options</p>\r\n<p>Unlimited coffee and snacks!</p>,f,t,t,,Mid-Senior level,Unspecified,,
88,Sales and Partnerships Intern,"GR, I, Athens",Business Development,,"<p><b>Hi, we are dopios</b></p>\r\n<p><i>“We are here to make any location <b>accessible and ope...",<p>At dopios we are rethinking the way we interact with unknown locations and our goal is to mak...,,,f,t,t,,Internship,Empty requirements,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17841,Software Engineer - System Integration,"US, FL, Tampa",Digital Pathology,,<p>Innovative technology for digital pathology and cancer diagnostics</p>,<p><b>Key Responsibilities: </b></p>\r\n<ul>\r\n<li>Develop Simagis API</li>\r\n<li>Develop Inte...,<p><b>Programming Skils</b></p>\r\n<p>Java SE - 5</p>\r\n<p>Apache Tomcat - 5</p>\r\n<p>HL7 Prot...,,f,t,f,,Mid-Senior level,Bachelor's Degree,,
17848,SEO CONTENT WRITER,"GB, LND, london",SEO,,,<p>A Blogger or Journalist is required for delivering content and supporting the published conte...,<p>Key responsibilities within this role:</p>\r\n<ul>\r\n<li>Supporting the Website Development ...,<p>We offer: <br /><br />• Excellent training and development opportunities <br />• Excellent P...,f,f,t,,Mid-Senior level,Bachelor's Degree,,
17853,Call Center/Customer Service,"US, NJ, Newark",Customer Service,,<p>At Command we care enough to consistently place the right candidates in the right jobs. We ha...,<p>At Command we care enough to consistently place the right candidates in the right jobs. We ha...,<p><b>Responsibilities:</b></p>\r\n<ul>\r\n<li>Determines requirements by working with customers...,<p><b>Benefits:</b></p>\r\n<ul>\r\n<li>15/hr (non-negotiable)</li>\r\n<li>Medical/dental coverag...,f,t,t,,Entry level,High School or equivalent,,
17863,Implementation Support Specialist,"US, IA, Dubuque",Services,,"<p>We design, build, sell, and service the most innovative operations management technology in t...",<p><i>WANTED</i>: an Implementation Support Specialist with personality to share and technical e...,"<p><br><b>Who you are…</b></p>\r\n<p>• You know the ins-and-outs of Microsoft Windows, SQL Serve...",,f,t,f,Full-time,Associate,Bachelor's Degree,Computer Software,


In [224]:
df5[(df5['function'].isnull()) & (df5['department'].isnull()) & (df5['industry'].notnull())].iloc[:, :16]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function
8,HP BSM SME,"US, FL, Pensacola",,,<p>Solutions3 is a <b>woman-owned small business </b>whose focus is IT Service Management using ...,<p></p>\r\n<p></p>\r\n<p>Implementation/Configuration/Testing/Training on:</p>\r\n<p>HP Service ...,<p><b>MUST BE A US CITIZEN.</b></p>\r\n<p><b>An active TS/SCI clearance will be required.</b></p...,,f,t,t,Full-time,Associate,Bachelor's Degree,Information Technology and Services,
28,Talent Management Process Manager,"US, MO, St. Louis",,,<p>We Provide Full Time Permanent Positions for many medium to large US companies. We are intere...,<p><b><i>(We have more than 1500+ Job openings in our website and some of them are relevant to t...,,,f,f,f,Full-time,Mid-Senior level,Bachelor's Degree,Management Consulting,
35,English Teacher Abroad,"US, NY, Saint Bonaventure",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p>Play with kids, get paid for it </p>\r\n<p>Love travel? Jobs in Asia</p>\r\n<p>$1,500+ USD mo...",<p>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not nec...,<p>See job description</p>,f,t,t,Contract,Entry level,Bachelor's Degree,Education Management,
36,Graduates: English Teacher Abroad,"US, NY, Yonkers",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p><img src=""#URL_ec9a1dff9db12b7f5987cf4cae6df01a39cd3ed5bad7cdf0448958cf97610268#""></p>\r\n<p>...",<p>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not nec...,<p>See job description</p>,f,t,f,Contract,Entry level,Bachelor's Degree,Education Management,
40,English Teacher Abroad,"US, PA, Kutztown",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p>Play with kids, get paid for it </p>\r\n<p>Love travel? Jobs in Asia</p>\r\n<p>$1,500+ USD mo...",<p>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not nec...,<p>See job description</p>,f,t,t,Contract,Entry level,Bachelor's Degree,Education Management,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17817,CSR,"US, LA, Slidell",,,,<p>Now hiring CSR / Advertising representatives to work from home</p>\r\n<p>Must Have:</p>\r\n<p...,,,f,f,f,Part-time,Entry level,Empty requirements,Online Media,
17852,GWT Expert,"FI, , Turku",,,,<p>GWT Expert</p>\r\n<p>We are experiencing a rapid worldwide adoption of our flagship open sour...,,,f,t,f,Full-time,Mid-Senior level,Empty requirements,Information Technology and Services,
17869,Sr Technical Lead LIMS,"US, DE, Wilmington",,,,<p><b>Job Title: Sr Technical Lead</b></p>\r\n<p><b>Salary: Open</b></p>\r\n<p><b>Duration: Ful...,<p>Responsibilities:</p>\r\n<p> </p>\r\n<ul>\r\n<li>He should be extensive knowledge of Sample M...,,f,f,f,Full-time,Mid-Senior level,Unspecified,Pharmaceuticals,
17871,Water Truck Driver,"US, PA, Waynesburg",,,<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,<ul>\r\n<li>Requires skilled work in operating commercial trucks to load and unload fluids from ...,<ul>\r\n<li>GED or diploma required.</li>\r\n<li>Requires minimum of one year experience with ta...,,f,t,t,Full-time,Entry level,Certification,Oil & Energy,


In [225]:
df5[(df5['function'].isnull()) & (df5['department'].notnull())]['department'].value_counts().head(50)

IT                         65
Sales                      63
Operations                 61
Product                    46
Development                44
Marketing                  38
tech                       32
Engineering                23
Clerical                   20
HR                         18
Legal                      18
Information Technology     17
R&D                        17
Creative                   16
All                        15
Product Development        14
Information Technology     13
Performance Marketing      13
CS                         12
Editorial                  11
International Growth       10
Customer Service           10
Maintenance                10
Merchandising              10
Content                    10
Technology                  9
Reservations                9
EC                          9
Retail                      8
Permanent                   8
Tech                        8
African Program             8
Administrative              7
Services  

In [226]:
df5[(df5['function'].isnull())]['title'].value_counts()

English Teacher Abroad                               309
English Teacher Abroad                                94
Graduates: English Teacher Abroad                     57
Beauty & Fragrance consultants needed                 55
Software Engineer                                     40
                                                    ... 
Graphic Artist, East Asia Pacific Division Office      1
Sous Chef Needed!                                      1
Sr. Oracle PL/SQL Developer                            1
Senior SEM Manager Spanish                             1
   Environmental Technician I                          1
Name: title, Length: 4281, dtype: int64

In [227]:
df5[(df5['function'].isnull())]['title'].value_counts().head(40)

English Teacher Abroad                                                  309
English Teacher Abroad                                                   94
Graduates: English Teacher Abroad                                        57
Beauty & Fragrance consultants needed                                    55
Software Engineer                                                        40
Project Manager                                                          24
Account Manager                                                          23
Web Designer                                                             22
English Teacher Overseas                                                 21
Cruise Staff Wanted *URGENT*                                             20
Web Developer                                                            20
Sales Manager                                                            20
Home Based Payroll Typist/Data Entry Clerks Positions Available          20
Application 

In [228]:
df5[(df5['function'].isnull())]['title'].value_counts().loc[
    (df5[(df5['function'].isnull())]['title'].value_counts() < 13) &
    (df5[(df5['function'].isnull())]['title'].value_counts() > 4)
].head(55)

Regional Inside Sales Representative                                    12
Manufacturing Engineer                                                  12
Home Based Payroll Data Entry Clerk Position - Earn $100-$200 Daily     12
Escrow Officer / Title Closer                                           12
Buyer                                                                   12
CNC Programmer                                                          11
IT Security Analyst                                                     11
Maintenance Technician                                                  11
Local Representative                                                    11
Quality Assurance Manager                                               11
New Product Development Project Leader                                  11
Manufacturing Engineering - Lean Manufacture                            11
Auditor                                                                 11
Office Manager           

In [229]:
df5[(df5['title'].str.contains('Teacher')) & (df5['function'].isnull())]['title'].value_counts()

English Teacher Abroad                                     309
English Teacher Abroad                                      94
Graduates: English Teacher Abroad                           57
English Teacher Overseas                                    21
English Teacher Abroad (Conversational)                      6
English Teacher Overseas (Conversational)                    4
Graduates: English Teacher Overseas                          3
EFL English Language Teachers for Saudi Arabia Tax Free      1
Elementary Teacher                                           1
Teacher's Assistant                                          1
Secondary Special Education Teacher                          1
Middle School Special Education Teacher                      1
Middle School Building Substitute Teacher                    1
High School Science Teacher                                  1
Pre-School Teacher                                           1
Elementary School Building Substitute Teacher          

In [230]:
df5[(df5['title'].str.contains('Teacher')) & (df5['function'].isnull())]['title'].value_counts().sum()

506

In [231]:
df5[(df5['title'].str.contains('Developer')) & (df5['function'].isnull())]['title'].value_counts()

.NET Developer                             20
Web Developer                              20
Application Developer                      20
iOS Developer                              17
Android Developer                          14
                                           ..
Full Stack Web Developer - Node.js          1
(Javascript) Web Application Developer      1
Platform Developer                          1
Frontend Web Developer                      1
Web Designer & Front End Developer          1
Name: title, Length: 382, dtype: int64

In [232]:
df5[(df5['title'].str.contains('Developer')) & (df5['function'].isnull())]['title'].value_counts().sum()

581

In [233]:
df5[(df5['title'].str.contains('Engineer')) & (df5['function'].isnull())]['title'].value_counts()

Software Engineer                                         40
Controls Engineer                                         18
Mechanical Engineer                                       14
Manufacturing Engineering Manager                         13
Process Engineer                                          13
                                                          ..
Passionate Frontend Web Engineer                           1
Ninja Web Engineer                                         1
VP Engineering                                             1
Oracle Systems Engineer with HPC exp and Coherence exp     1
Software Engineer - Java/EDI                               1
Name: title, Length: 376, dtype: int64

In [234]:
df5[(df5['title'].str.contains('Software Engineer')) & (df5['function'].notnull())]['function'].value_counts()

Engineering               188
Information Technology     86
Product Management          2
Design                      1
Research                    1
General Business            1
Accounting/Auditing         1
Other                       1
Customer Service            1
Strategy/Planning           1
Name: function, dtype: int64

In [235]:
df5[df5['title'].str.contains("Cruise")]['description'].iloc[0]

"<p><b>6* Ultra Luxury American Cruise Company is urgently looking for the following positions:</b><br><b>*Hospitality</b>\xa0- For the many Bars &amp; Restaurants on board.<br><b>*Retail</b>\xa0- For the Duty FREE Shops &amp; Boutiques on board.<br><b>*housekeeping</b>\xa0- For the Housekeeping &amp; Cleaning jobs.<br><b>*Office Admin</b>\xa0- For the Front desk &amp; Tour booking jobs<br><b>*Other Positions</b>\xa0- DJ's, Security Staff, Photographers &amp; Nannies.</p>\r\n<p><b>Vessel type or operation:</b>\xa06* Ultra Luxury Cruise.<br><b>Certification &amp; Experience:</b>\xa0Previous experience (not Required)<br>Good English speaker, Some Customer Service Skills, wanting to learn &amp; work.<br><b>Job Type:</b>\xa0Perm.<br><b>Sailing Area:</b>\xa0World wide.<br><b>Benefits:</b>\xa0On board en suite accommodation and food, Medical cover for duration of contract,\xa0<br>world work visa, free wifi,\xa0<b>TAX FREE Salary &amp; more!</b></p>\r\n<p><b>Job Description:</b><br>A 6* Ultra

In [236]:
df5[(df5['title'].str.contains('Engineer')) & (df5['function'].isnull())]['title'].value_counts().sum()

596

In [237]:
df5[(df5['title'].str.contains('Sales')) & (df5['function'].isnull())]['title'].value_counts()

Sales Manager                                                         20
Regional Sales Manager                                                14
Regional Inside Sales Representative                                  12
Regional Field Sales Representative                                   10
Sales Lead Generator                                                   6
                                                                      ..
Sales Assistant - SAACHI                                               1
Residential Escrow Officers & Title Insurance Sales - Houston Area     1
Senior Director of Advertising Sales                                   1
Marketing and Sales Representative- Full Time Position                 1
Outside Sales Professional-Maple Plain                                 1
Name: title, Length: 182, dtype: int64

In [238]:
df5[(df5['title'].str.contains('Sales')) & (df5['function'].isnull())]['title'].value_counts().sum()

270

In [239]:
df5[(df5['title'].str.contains('Financ')) & (~df5['title'].str.contains('Writer', na=False)) & \
    (~df5['title'].str.contains('Attorney', na=False)) & (df5['function'].isnull()) & \
    (~df5['title'].str.contains('Editor', na=False))]['title'].value_counts()

Manager of Finance                                                                 13
Financial Advisor                                                                   3
Director of Finance                                                                 2
Financial Officer                                                                   2
Partnership Manager - High Growth Specialty Finance Company                         2
Internal Audit & Financial Advisory Senior                                          2
Financial Analyst                                                                   2
Love Financial Administration?                                                      2
Finance/Accountancy Recruitment Consultant                                          1
Principal Consultant - Oracle Hyperion Financial Management                         1
Financial Controller                                                                1
Junior Finance Position                               

In [240]:
df5[(df5['title'].str.contains('Financ')) & (~df5['title'].str.contains('Writer', na=False)) & \
    (~df5['title'].str.contains('Attorney', na=False)) & (df5['function'].isnull()) & \
    (~df5['title'].str.contains('Editor', na=False))]['title'].value_counts().sum()

73

In [241]:
df5[(df5['title'].str.contains('Financ')) & (~df5['title'].str.contains('Writer', na=False)) & \
    (~df5['title'].str.contains('Attorney', na=False)) & (df5['function'].notnull()) & \
    (~df5['title'].str.contains('Editor', na=False))]['function'].value_counts()

Finance                   55
Accounting/Auditing       14
Sales                     13
Financial Analyst         13
Information Technology     5
Administrative             3
Business Analyst           3
Business Development       3
Education                  2
Legal                      2
Management                 1
Engineering                1
Consulting                 1
Writing/Editing            1
Health Care Provider       1
Name: function, dtype: int64

In [242]:
df5[(df5['title'].str.contains('Programmer')) & (df5['function'].isnull())]['title'].value_counts()

CNC Programmer                                             11
Peoplesoft HCM Lead - Programmer/Analyst                    1
RoR Developer/Programmer                                    1
Ruby on Rails Programmer                                    1
Information Security Programmer - nDiscovery Team           1
PHP Programmer                                              1
Strong Infrastructure (web-service and API) Programmer      1
Linux Systems Programmer / Python Developer at Flipnode     1
LAMP Programmer for Event Marketing Agency                  1
Software Engineer - Java/EDI Programmer                     1
 CNC Programmer                                             1
Java/EDI Programmer                                         1
Senior Programmer / Developer - L3                          1
Contract Gameplay Programmer                                1
Controls - PLC Programmer                                   1
Programmer/ Analyst                                         1
Programm

In [243]:
df5[(df5['title'].str.contains('Programmer')) & (df5['function'].isnull())]['title'].value_counts().sum()

33

In [244]:
df5[(df5['title'].str.contains('Accountant')) & (df5['function'].isnull())]['title'].value_counts()

Accountant                                             3
Senior Accountant                                      3
Accountant Tax Advisory                                2
Accountant                                             1
Tax Accountant                                         1
Cost Accountant                                        1
Project Accountant Asst                                1
Assistant Accountant/immediate start                   1
Accountant - Leading Private Equity Firm               1
Staff Accountant, AP                                   1
(Assistant) Accountant                                 1
Cost Accountant - QAD ERP WCM PPE EMS - Madison, WI    1
Group Accountant (Head of Accounting)                  1
Client Service Associate / Accountant                  1
Name: title, dtype: int64

In [245]:
df5[(df5['title'].str.contains('Accountant')) & (df5['function'].isnull())]['title'].value_counts().sum()

19

In [246]:
df5[(df5['title'].str.contains('Accountant')) & (df5['function'].notnull())]['function'].value_counts()

Accounting/Auditing       52
Finance                   18
Administrative             2
Financial Analyst          1
Management                 1
Engineering                1
Information Technology     1
Name: function, dtype: int64

In [247]:
df5[(df5['title'].str.contains('HR')) & \
    (df5['function'].notnull())]['function'].value_counts()

Human Resources           44
Customer Service           7
Information Technology     4
Research                   1
Management                 1
Supply Chain               1
Advertising                1
Finance                    1
Accounting/Auditing        1
Name: function, dtype: int64

In [248]:
df5[(df5['title'].str.contains('Resources')) & \
    (df5['function'].notnull())]['function'].value_counts()

Human Resources    35
Education           1
Manufacturing       1
Management          1
Science             1
Administrative      1
Name: function, dtype: int64

In [249]:
df5[((df5['title'].str.contains('Human Resources')) | (df5['title'].str.contains('HR'))) & \
    (df5['function'].isnull())]['title'].value_counts()

Human Resources Manager                                            10
HR Assistant                                                        3
HR Specialist                                                       2
Human Resources Recruiter (45K-60K)                                 2
HR Manager                                                          2
Human Resources Specialist                                          2
Human Resources and Safety Manager                                  1
Product Manager, HRIS & Analytics                                   1
Junior HR Marketing Manager                                         1
HR expert                                                           1
Recruiter/HR                                                        1
Human Resources Generalist                                          1
HR Recruiter--Grand Junction, CO                                    1
HR Research Analyst / Writer - Summer Contract (3-4 Month)          1
Senior HR Manager or

In [250]:
df5[((df5['title'].str.contains('Human Resources')) | (df5['title'].str.contains('HR'))) & \
    (df5['function'].isnull())]['title'].value_counts().sum()

44

In [251]:
df5[(df5['title'].str.contains('Data Scien')) & (df5['function'].isnull())]['title'].value_counts()

Data Scientist                                  11
Data Scientist (Big Data, Machine Learning)      1
Junior Data Scientist                            1
Data Scientist, UK                               1
Data Scientist (Part-Time)                       1
Data Scientist - Full-time or Consultant         1
Data Scientist (Recommendations)                 1
Pricing Analyst/Data Scientist - OptionsAway     1
Data Scientist                                   1
Client Facing Data Scientist                     1
Data Scientist Manager                           1
Name: title, dtype: int64

In [252]:
df5[(df5['title'].str.contains('Analytics')) & (df5['function'].isnull())]['title'].value_counts()

Data Analytics Engineer                                                                  1
SEO, Adwords & Analytics Specialist                                                      1
BI / Analytics Lead                                                                      1
Senior Data Analytics Engineer                                                           1
Advanced Analytics Architect                                                             1
Insights & Analytics Intern                                                              1
J.P. Morgan - CIB Tehcnology - Credit Analytics Risk and Pricing Developer- Associate    1
AVP / VP Solutions - SAP BI & Analytics                                                  1
Data & Analytics Intern                                                                  1
Web Analytics Specialist                                                                 1
Pfizer - Sr Director, Global Health & Value - Pricing Analytics Lead                     1

In [253]:
df5[(df5['title'].str.contains('Data Scien')) & (df5['function'].notnull())]['function'].value_counts()

Data Analyst              11
Engineering               11
Information Technology     9
Research                   2
Science                    2
Product Management         1
Name: function, dtype: int64

In [254]:
df5[(df5['title'].str.contains('Analytics')) & (df5['function'].notnull())]['function'].value_counts()

Data Analyst              10
Marketing                  7
Business Analyst           5
Information Technology     5
Consulting                 2
Advertising                2
Strategy/Planning          2
Engineering                2
Science                    1
Customer Service           1
Name: function, dtype: int64

In [255]:
df5[(df5['title'].str.contains('Account Executive')) & (df5['function'].isnull())]['title'].value_counts()

Account Executive                                7
Title Account Executive                          6
Global Account Executive                         1
Junior Account Executive                         1
 Jr. Account Executive                           1
Account Executive / Sales Rep                    1
Client Account Executive                         1
Start-Up PR Account Executive                    1
B2B Inside Sales Account Executive               1
Account Executive                                1
Junior PR Account Executive                      1
Corporate Account Executive                      1
Major Account Executive- Philadelphia, PA.       1
Sales & Account Executive                        1
Digital Account Executive                        1
Name: title, dtype: int64

In [256]:
df5[(df5['title'].str.contains('Analyst')) & (df5['function'].isnull())]['title'].value_counts()

Business Analyst                                15
IT Security Analyst                             11
Data Analyst                                     5
Analyst                                          3
Marketing Analyst                                3
                                                ..
Sr. Security Test Analyst                        1
Alliance Data - Pricing & Profit Analyst Job     1
Finance and Accounting Analyst                   1
Trainee PPC Analyst                              1
Cleaner Recruitment Analyst/Associate            1
Name: title, Length: 114, dtype: int64

In [257]:
df5[(df5['title'].str.contains('Analyst')) & (df5['function'].isnull())]['title'].value_counts().head(12)

Business Analyst                 15
IT Security Analyst              11
Data Analyst                      5
Analyst                           3
Marketing Analyst                 3
Compliance Analyst                3
Application Support Analyst       3
Product Analyst                   3
Business Intelligence Analyst     3
Financial Analyst                 2
PPC Analyst                       2
Growth Analyst                    2
Name: title, dtype: int64

In [258]:
df5[(df5['title'].str.contains('Data Analyst')) & (df5['function'].isnull())]['title'].value_counts()

Data Analyst                            5
Business and Data Analyst               1
Data Analyst - SAS                      1
Information Systems Data Analyst        1
Growth Hacker / Data Analyst Manager    1
Quantitative Data Analyst               1
Junior Web Developer/ Data Analyst      1
Data Analyst (Career Development)       1
Name: title, dtype: int64

In [259]:
df5[(df5['title'].str.contains('Business Analyst')) & (df5['function'].isnull())]['title'].value_counts()

Business Analyst                               15
Technical Business Analyst                      1
Business Analyst/Quality Assurance Analyst      1
Business Analyst - Product                      1
Business Analyst/QA intern                      1
Informatica MDM- Business Analyst               1
Business Analyst - Decision Sciences            1
SAP Business Analyst                            1
Business Analyst Intern                         1
Entry Level Business Analyst                    1
Business Analyst with FX (Foreign Exchange)     1
Junior Business Analyst                         1
Project Manager / Business Analyst              1
Lead Business Analyst                           1
E Commerce Business Analyst                     1
Name: title, dtype: int64

In [260]:
df5[(df5['title'].str.contains('Business Intelligence')) & (df5['function'].isnull())]['title'].value_counts()

Business Intelligence Analyst                                  3
Business Intelligence Developer                                2
BUSINESS-ANALYST - DWH-Business Intelligence - (8-10 YEARS)    2
Head of Business Intelligence                                  2
Lead Business Intelligence                                     1
Data Warehouse Manager / Business Intelligence Architect       1
Senior Business Intelligence Engineer                          1
Sr.Business Intelligence Technical Architect                   1
SAP Business Intelligence - .NET UNIX SQL SCI - Washington     1
Name: title, dtype: int64

In [261]:
df5[(df5['title'].str.contains('IT Security')) & (df5['function'].isnull())]['title'].value_counts()

IT Security Analyst                                   11
IT Security                                            4
IT Security Presales Engineer                          1
IT Security Audit Candidate                            1
Urgent Need : IT Security Professional for Bahrain     1
IT Security Admin Analyst II                           1
Name: title, dtype: int64

In [262]:
df5[(df5['title'].str.contains('Account Executive')) & (df5['function'].notnull())]['function'].value_counts()

Sales                   118
Advertising              14
Public Relations         12
General Business          5
Business Development      5
Marketing                 2
Accounting/Auditing       2
Customer Service          2
Consulting                1
Name: function, dtype: int64

In [263]:
df5[(df5['title'].str.contains('Cruise')) & (df5['function'].notnull())]['function'].value_counts()

Series([], Name: function, dtype: int64)

In [264]:
df5[(df5['title'].str.contains('Supervisor')) & (df5['function'].notnull())]['function'].value_counts()

Management              26
Manufacturing           10
Customer Service        10
Health Care Provider     4
Marketing                4
Advertising              3
Accounting/Auditing      3
Public Relations         2
Strategy/Planning        2
Business Analyst         2
Engineering              2
Training                 2
Other                    1
Administrative           1
Human Resources          1
Finance                  1
Project Management       1
Design                   1
Production               1
Name: function, dtype: int64

In [265]:
df5[(df5['title'].str.contains('Analyst')) & (df5['function'].notnull())]['function'].value_counts()

Information Technology    88
Business Analyst          47
Data Analyst              39
Marketing                 17
Finance                   16
Financial Analyst         14
Engineering               14
Quality Assurance         10
Research                   6
Customer Service           6
Accounting/Auditing        6
Product Management         5
Sales                      5
Administrative             5
Other                      4
Supply Chain               4
Consulting                 4
General Business           3
Human Resources            3
Business Development       2
Project Management         2
Management                 1
Purchasing                 1
Legal                      1
Name: function, dtype: int64

In [266]:
df5[(df5['title'].str.contains('CAD Designer')) & (df5['function'].notnull())]['function'].value_counts()

Engineering    1
Name: function, dtype: int64

In [267]:
df5[(df5['title'].str.contains('Data Entry')) & (df5['function'].isnull())]['title'].value_counts()

Home Based Payroll Typist/Data Entry Clerks Positions Available                                                    20
Data Entry Admin/Clerical Positions - Work From Home                                                               18
Home Based Payroll Data Entry Clerk Position - Earn $100-$200 Daily                                                12
Data Entry                                                                                                          7
Data Entry Clerk                                                                                                    2
Data Entry/Administrative Assistant                                                                                 1
 Medical Intake Representative (Data Entry)                                                                         1
Customer service/ Data Entry                                                                                        1
Home  Based Typist/ Data Entry Clerk                    

In [268]:
df5[(df5['title'].str.contains('Account Executive')) & (df5['function'].isnull())]['title'].value_counts().sum()

26

In [269]:
df5[(df5['title'].str.contains('Data Entry')) & (df5['function'].isnull())]['title'].value_counts().sum()

68

In [270]:
df5[(df5['title'].str.contains('Data Entry')) & (df5['function'].notnull())]['function'].value_counts()

Administrative            28
Customer Service           9
Data Analyst               5
Consulting                 1
Accounting/Auditing        1
Marketing                  1
Human Resources            1
Information Technology     1
Production                 1
Name: function, dtype: int64

In [271]:
df5[(df5['title'].str.contains('Business Intelligence')) & (df5['function'].notnull())]['function'].value_counts()

Information Technology    6
Data Analyst              3
Engineering               2
Marketing                 1
Business Analyst          1
Science                   1
Name: function, dtype: int64

In [272]:
df5[(df5['title'].str.contains('Technician')) & (df5['function'].isnull())]['title'].value_counts().head(12)

Electrical Maintenance Technician                   16
Maintenance Technician                              11
Electrical Maintenance Technician - Major States    10
GIS Technician                                       4
Field Technician                                     3
Wastewater Technician                                3
Minnesota Part time Maintenance Technician           2
Warehouse Technician                                 2
HVAC Technician                                      2
Field Technician                                     2
Appliance Repair Technician                          2
Engineering Technician                               1
Name: title, dtype: int64

In [273]:
df5[(df5['title'].str.contains('Maintenance')) & (df5['function'].notnull())]['function'].value_counts()

Manufacturing           13
Engineering              9
Other                    8
Management               2
Health Care Provider     1
Distribution             1
Name: function, dtype: int64

In [274]:
df5[(df5['title'].str.contains('Writer')) & (df5['function'].isnull())]['title'].value_counts().sum()

32

In [275]:
df5[(df5['title'].str.contains('Writer')) & (df5['function'].notnull())]['function'].value_counts()

Writing/Editing           39
Marketing                  9
Information Technology     3
Sales                      2
Human Resources            1
Product Management         1
Customer Service           1
Training                   1
Engineering                1
Name: function, dtype: int64

In [276]:
df5[(df5['title'].str.contains('Developer')) & (df5['function'].isnull())]['title'].value_counts().idxmax()

'.NET Developer'

In [277]:
'  fef  '.strip()

'fef'

In [278]:
for i in df5[(df5['title'].str.contains('Developer')) & (df5['function'].isnull())]['title'].value_counts().index:
    print(i)

.NET Developer
Web Developer
Application Developer
iOS Developer
Android Developer
Java Developer
PHP Developer
Software Developer
Front End Developer
Front-End Developer
Senior Web Developer
Senior Developer
Developer
Front-end Developer
Lead Developer
Senior iOS Developer
Junior Web Developer
Senior Developer Ruby on Rails
Senior .NET Developer
Senior Java Developer
Mobile Developer
Senior Application Developer
Front-End Web Developer
JavaScript Developer
PHP Web Developer
.Net Developer - C# SQL SOA SSIS - Albany, NY
Sr. Systems Developer
Backend Developer
.Net Developer
Business Developer
Front-End Developer/HTML/JavaScript/CSS
BI Developer
Ruby on Rails Web Developer
Senior Ruby on Rails Developer
Senior Front End Developer
SQL Server Developer
Drupal Developer
Frontend Developer
Experienced PHP Developer
Python Developer
Business Intelligence Developer
Game Developer
Javascript Web Application Developer
Android Developer 
Contract SilverStripe Developer
Junior Developer
SQL Devel

In [279]:
for i in df5[(df5['title'].str.contains('Developer')) & (df5['function'].isnull())]['title'].value_counts().index:
    if len(df5[(df5['title'].str.contains(i.strip(), regex=False)) & (df5['function'].notnull())]['function'].value_counts()) > 0:
        print(i, ": " + df5[(df5['title'].str.contains(i.strip(), regex=False)) & (df5['function'].notnull())]['function'].value_counts().idxmax())

.NET Developer : Information Technology
Web Developer : Information Technology
Application Developer : Information Technology
iOS Developer : Engineering
Android Developer : Information Technology
Java Developer : Information Technology
PHP Developer : Information Technology
Software Developer : Information Technology
Front End Developer : Engineering
Front-End Developer : Information Technology
Senior Web Developer : Information Technology
Senior Developer : Information Technology
Developer : Information Technology
Front-end Developer : Engineering
Lead Developer : Information Technology
Senior iOS Developer : Information Technology
Junior Web Developer : Engineering
Senior .NET Developer : Information Technology
Senior Java Developer : Information Technology
Mobile Developer : Information Technology
Front-End Web Developer : Information Technology
JavaScript Developer : Information Technology
PHP Web Developer : Information Technology
Backend Developer : Engineering
.Net Developer : 

In [280]:
df5[(df5['title'].str.contains('Project Manager')) & (df5['function'].notnull())]['function'].value_counts()

Project Management        94
Information Technology    18
Production                12
Management                 6
Consulting                 4
Accounting/Auditing        2
Design                     1
Engineering                1
Purchasing                 1
Product Management         1
Research                   1
General Business           1
Business Development       1
Name: function, dtype: int64

In [281]:
df5[(df5['title'].str.contains('Account Manager')) & (df5['function'].notnull())]['function'].value_counts()

Sales                     76
Business Development      19
Marketing                 13
Customer Service          13
Advertising               10
Project Management         8
Public Relations           6
Accounting/Auditing        4
Human Resources            4
Strategy/Planning          2
Consulting                 2
Business Analyst           1
Writing/Editing            1
Information Technology     1
Production                 1
General Business           1
Administrative             1
Name: function, dtype: int64

In [282]:
df5[(df5['title'].str.contains('Developer')) & (df5['function'].notnull())]['function'].value_counts()

Information Technology    628
Engineering               380
Other                      22
Production                 22
Design                     17
Consulting                 13
Marketing                   8
Sales                       5
Advertising                 5
Accounting/Auditing         5
Business Development        4
Research                    3
Art/Creative                3
Finance                     3
Training                    2
Project Management          2
Public Relations            1
Management                  1
Human Resources             1
Customer Service            1
Product Management          1
Education                   1
Name: function, dtype: int64

In [283]:
df5[(df5['title'].str.contains('English Teacher')) & (df5['function'].notnull())]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
60,Graduates: English Teacher Abroad (Conversational),"US, IA, Iowa city",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p><img src=""#URL_ec9a1dff9db12b7f5987cf4cae6df01a39cd3ed5bad7cdf0448958cf97610268#""></p>\r\n<p>...",<p>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not nec...,<p>See job description</p>,f,t,t,Contract,Entry level,Bachelor's Degree,Education Management,Education,f,f,US,IA,Iowa city
143,English Teacher Abroad (Conversational),"US, TX, Hidalgo",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p>Play with kids, get paid for it.</p>\r\n<p>Vacancies in Asia</p>\r\n<p>$1500 USD + monthly ($...","<p>University degree required. TEFL / TESOL / CELTA, and/or teaching experience preferred</p>\r\...",<p>See job description</p>,f,t,t,Contract,Entry level,Bachelor's Degree,Education Management,Education,f,f,US,TX,Hidalgo
338,Graduates: English Teacher Abroad (Conversational),"US, AR, Jonesboro",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p><img src=""#URL_ec9a1dff9db12b7f5987cf4cae6df01a39cd3ed5bad7cdf0448958cf97610268#""></p>\r\n<p>...",<p>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not nec...,<p>See job description</p>,f,t,t,Contract,Entry level,Bachelor's Degree,Education Management,Education,f,f,US,AR,Jonesboro
404,English Teacher Abroad (Conversational),"US, IN, Greencastle",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p><img src=""#URL_ec9a1dff9db12b7f5987cf4cae6df01a39cd3ed5bad7cdf0448958cf97610268#""></p>\r\n<p>...","<p>University degree required. TEFL / TESOL / CELTA, and/or teaching experience preferred, but n...",<p>See job description</p>,f,t,t,Contract,Entry level,Bachelor's Degree,Education Management,Education,f,f,US,IN,Greencastle
415,Graduates: English Teacher Abroad (Conversational),"US, OK, Bethany",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p><img src=""#URL_ec9a1dff9db12b7f5987cf4cae6df01a39cd3ed5bad7cdf0448958cf97610268#""></p>\r\n<p>...",<p>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not nec...,<p>See job description</p>,f,t,t,Contract,Entry level,Bachelor's Degree,Education Management,Education,f,f,US,OK,Bethany
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14124,Graduates: English Teacher Abroad (Conversational),"US, IN, Muncie",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p><img src=""#URL_ec9a1dff9db12b7f5987cf4cae6df01a39cd3ed5bad7cdf0448958cf97610268#""></p>\r\n<p>...",<p>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not nec...,<p>See job description</p>,f,t,t,Contract,Entry level,Bachelor's Degree,Education Management,Education,f,f,US,IN,Muncie
14135,Graduates: English Teacher Abroad (Conversational),"US, MN, St. Cloud",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p><img src=""#URL_ec9a1dff9db12b7f5987cf4cae6df01a39cd3ed5bad7cdf0448958cf97610268#""></p>\r\n<p>...",<p>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not nec...,<p>See job description</p>,f,t,t,Contract,Entry level,Bachelor's Degree,Education Management,Education,f,f,US,MN,St. Cloud
14148,Graduates: English Teacher Abroad (Conversational),"US, WV, Huntington",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p><img src=""#URL_ec9a1dff9db12b7f5987cf4cae6df01a39cd3ed5bad7cdf0448958cf97610268#""></p>\r\n<p>...",<p>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not nec...,<p>See job description</p>,f,t,t,Contract,Entry level,Bachelor's Degree,Education Management,Education,f,f,US,WV,Huntington
14157,Graduates: English Teacher Abroad (Conversational),"US, PA, Mansfield",,,<p>We help teachers get safe &amp; secure jobs abroad :)</p>,"<p><img src=""#URL_ec9a1dff9db12b7f5987cf4cae6df01a39cd3ed5bad7cdf0448958cf97610268#""></p>\r\n<p>...",<p>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not nec...,<p>See job description</p>,f,t,t,Contract,Entry level,Bachelor's Degree,Education Management,Education,f,f,US,PA,Mansfield


In [284]:
df5[(df5['function'].isnull()) & (df5['department'].notnull())]['department'].value_counts().head(50).index

Index(['IT', 'Sales', 'Operations', 'Product', 'Development', 'Marketing',
       'tech', 'Engineering', 'Clerical', 'HR', 'Legal',
       'Information Technology', 'R&D', 'Creative', 'All',
       'Product Development', 'Information Technology ',
       'Performance Marketing', 'CS', 'Editorial', 'International Growth',
       'Customer Service', 'Maintenance', 'Merchandising ', 'Content',
       'Technology', 'Reservations', 'EC', 'Retail', 'Permanent', 'Tech',
       'African Program', 'Administrative', 'Services', 'MM', 'Accounting',
       'Technical', 'Administration', 'Business Development',
       'Customer Support', 'Nursing', 'Client Services', 'Data Entry',
       'Internships', 'Development ', 'Design', 'Digital Pathology', 'General',
       'PPC', 'Software Engineering'],
      dtype='object')

'IT', 'Information Technology', 'Information Technology ', 'Technology', 'Tech' = Information Technology
Software Engineering ???

'Sales', = Sales

'Marketing' = Marketing

'Customer Service', 'CS', 'Client Services', 'Customer Support' = Customer Service

'Administration', 'Administrative' = Administrative

'Accounting' = Accounting/Auditing

Find the sum of rows to be imputed from the keywords, this is just to give us an idea. Note that this is not the exact amount due to overlapping so the actual amount is less than that.

In [285]:
total = 0
for w in ['Teacher', 'Design', 'Engineer', 'Sales', 'Developer', 'Human Resources', 'HR', 'Financ', 
          'Programmer', 'Accountant', 'Data Entry', 'Data Analyst', 'Business Analyst', 'Business Intelligence', 
          'Account Manager', 'Project Manager', 'Auditor', 'Supervisor', 'Account Executive', 'Writer', 
          'Data Scien', 'Analytics', 'Receptionist', 'Community Manager', 'Administrative', 'Contact Center']:
    total_w = len(df5[(df5['title'].str.contains(w)) & (df5['function'].isnull())]['title'])
    total += total_w
total

2859

In [286]:
df5['function'].isnull().sum()

6378

The more accurate way of imputation is to check both title and industry at the same time, if possible department is also quite important so we can check department first then only check industry. The basic idea is for example, if industry is Information Tech and Services, we'll immediately impute function as (most probably but not necessarily) as Information Technology.

First approach is to use title to impute. Then, we'll use department for the remaining rows, and then followed by industry. This is the sequence that gives the best outcome in my opinion.

### Stop here

In [287]:
keywords_title = [
    'Teacher', 'Teaching', 'Design', 'Quality Assurance', 'QA', 'Manufacturing', 'Engineer', 'Sales', 'Developer', 'developer', 
    'Human Resources', 'HR', 'Financ', 'Programmer', 'Accountant', 'Data Entry', 'Data Analyst', 'Business Analyst', 
    'Business Intelligence', 'Analyst', 'Account Manager', 'Project Manager', 'Auditor', 'Supervisor', 'Account Executive', 
    'Writer', 'Data Scien', 'Analytics', 'Receptionist', 'Community Manager', 'Administrative', 'Contact Center', 
    'Executive Assistant', 'Marketing', 'Retail', 'CUSTOMER SERVICE', 'Customer Service', 'Military Veteran', 'Art Director', 
    'Property Preservation', 'Chef', 'Operations Specialist', 'IT Operations', 'Product Manager', 'Buyer', 'Care', 
    'Support Worker', 'Social Media', 'Controller', 'System Administrator', 'IT Security', 'Office Administrator', 'GIS', 
    'Project Coordinator', 'Healthcare', 'Pharmacist', 'Business Development', 'Software', 'Driver', 'Nurse', 'Implementation', 
    'Real Estate', 'Payable', 'Office Assistant', 'Insurance', 'Product', 'Graphics', 'Operator', 'Systems Administrator', 
    'Customer Support', 'Recovery Specialist', 'Attorney', 'Artist', 'Seamstress', 'Geophysicist', 'Technical Support', 
    'Mortgage', 'Database', 'Data ', 'Health', 'Nursing', 'CARE', 'Tax', 'SharePoint', 'Sharepoint', 'Telemarket', 
    'Bartender', 'Account Coordinator', 'Community Management', 'Local Coordinator', 'Therapist', 'Cleaner', 'Client Ambassador', 
    'Fraud', 'Brand Ambassador', 'Wealth', 'writer', 'design', 'Tutor', 'Teach', 'Sysadmin', 'Linux', 'Solutions Architect', 
    'Ruby', 'Billing', 'Growth Hacker', 'SEM', 'Editor', 'Project Management', 'engineer', 'customer service', 'Physician', 
    'Oracle', 'HTML', 'DEVELOPER', 'DBA', 'Secretary', 'Journalis', 'IT', 'Appointment Setter', 'Pharmacy', 'Call Center', 
    'Electronics', 'Bookkeep', 'Optician', 'Technology', 'Pastor', 'Advocate', 'Web', 'Translat', 'SEO', 'Radiolog', 
    'Strategist', 'Social Worker', 'Interpreter', 'Processor', 'Housing', 'Researcher', 'sales'
]

keywords_dept = [
    'IT', 'Information Technology', 'Information Technology ', 'Technology', 'Tech', 'Sales', 'Marketing', 'Customer Service', 
    'CS', 'Client Services', 'Customer Support', 'Administration', 'Administrative', 'Accounting', 'Creative'
]

In [288]:
def impute_function(df, words, col):
    print('Total keywords to be looped for', col + ":", len(words))
    if col == 'title':
        for word in words:
            if word == 'Developer':
                for w in df[(df['title'].str.contains(word)) & (df['function'].isnull())]['title'].value_counts().index:
                    if len(df[(df['title'].str.contains(w.strip(), regex=False)) & \
                              (df['function'].notnull())]['function'].value_counts()) > 0:
                        mode = df[(df['title'].str.contains(w.strip(), regex=False)) & \
                                  (df['function'].notnull())]['function'].value_counts().idxmax()
                        df.loc[(df['title'].str.contains(w.strip(), regex=False)) & (df['function'].isnull()), 
                               'function'] = mode
                        
            elif word == 'Design':
                mode = df[(df['title'].str.contains(word)) & (df['function'].notnull()) & \
                          (~df['title'].str.contains('CAD', na=False)) & (~df['title'].str.contains('Cad', na=False)) & \
                          (~df['title'].str.contains('Engineer', na=False))]['function'].value_counts().idxmax()
                df.loc[(df['title'].str.contains(word)) & (df['function'].isnull()) & \
                       (~df['title'].str.contains('CAD', na=False)) & (~df['title'].str.contains('Cad', na=False)) & \
                       (~df['title'].str.contains('Engineer', na=False)), 'function'] = mode
                        
            elif word == 'Financ':
                mode = df[(df['title'].str.contains(word)) & (~df['title'].str.contains('Writer', na=False)) & \
                          (~df['title'].str.contains('Attorney', na=False)) & (df['function'].notnull()) & \
                          (~df['title'].str.contains('Editor', na=False))]['function'].value_counts().idxmax()
                df.loc[(df['title'].str.contains(word)) & (df['function'].isnull()) & \
                       (~df['title'].str.contains('Writer', na=False)) & (~df['title'].str.contains('Editor', na=False)) & \
                       (~df['title'].str.contains('Attorney', na=False)), 'function'] = mode
                
            elif word in ['Human Resources', 'HR']:
                mode = df[(df['title'].str.contains(word)) & (df['function'].notnull()) & \
                          (~df['title'].str.contains('HRMS', na=False)) & \
                          (~df['title'].str.contains('Product Manager', na=False))]['function'].value_counts().idxmax()
                df.loc[(df['title'].str.contains(word)) & (df['function'].isnull()) & \
                       (~df['title'].str.contains('HRMS', na=False)) & \
                       (~df['title'].str.contains('Product Manager', na=False)), 'function'] = mode
                        
            else:
                mode = df[(df['title'].str.contains(word)) & (df['function'].notnull())]['function'].value_counts().idxmax()
                df.loc[(df['title'].str.contains(word)) & (df['function'].isnull()), 'function'] = mode
                
        
    elif col == 'department':
        for word in words:
            if word in ['IT', 'Information Technology', 'Information Technology ', 'Technology', 'Tech']:
                if len(df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function']) > 0:
                    df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function'] = "Information Technology"
            
            elif word == 'Sales':
                if len(df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function']) > 0:
                    df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function'] = "Sales"
            
            elif word == 'Marketing':
                if len(df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function']) > 0:
                    df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function'] = "Marketing"
                    
            elif word in ['Customer Service', 'CS', 'Client Services', 'Customer Support']:
                if len(df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function']) > 0:
                    df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function'] = "Customer Service"
                    
            elif word in ['Administration', 'Administrative']:
                if len(df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function']) > 0:
                    df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function'] = "Administrative"
            
            elif word == 'Accounting':
                if len(df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function']) > 0:
                    df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function'] = "Accounting/Auditing"
                    
            elif word == 'Creative':
                if len(df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function']) > 0:
                    df.loc[(df['function'].isnull()) & (df['department'].notnull()) & \
                           (df['department'] == word), 'function'] = "Art/Creative"
                    
    return df

In [289]:
# Manual imputation
df6 = df5.copy()

for word in ['Cad Designer', 'CAD', 'Maintenance', 'Environmental', 'Machinist', 'EHS', 'Safety Specialist', 'Geologist', 
             'Battery', 'Estimator', 'Smart-Meter', 'Devops', 'Lawn Crew', 'Drafting', 'Electrician', 'Signal Testing', 
             'Shop Foreman', 'Surveyor']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Engineering"

for word in ['Beauty', 'Demonstrator', 'Closer', 'fragrance', 'go getters', 'Telecanvasser', 'Account manager', 
             'Working from Home', 'Research Interview', 'Originator']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Sales"
    
for word in ['Quality Manager', 'Office Manager', 'Project Controls', 'Operations Manager', 'Head of Operations', 
             'Director of Operations', 'Development Manager', 'Program Manager', 'Chief', 'Bar manager', 
             'Executive Director', 'Manager', 'Director', 'CTO', 'Head', 'VP', 'CEO', 'Provision', 'Lead']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Management"
    
for word in ['Payroll', 'Appointment Coordinator', 'Personal Assistant', 'Business Admin', 'Returns Specialist', 
             'Church Administrator', 'Contract Administrator', 'Billing Administrator', 'Fashion E-shop Administrator', 
             'Welcome desk Administrator', 'Government', 'E-commerce administrator', 'Operations Coordinator', 
             'Front Desk', 'ADMIN', 'Operations']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Administrative"
    
for word in ['CAREGIVER', 'Healthcare Support', 'PCP', 'Mental Health', 'Medical', 'Dental', 'Psychiatrist', 
             'Chemical Dependency', 'Group Counselor', 'Substance Abuse', 'ABA Counselor', 'Youth Case Management', 
             'RGN', 'Anaesthetic', 'Neuro', 'CNA']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Health Care Provider"
    
for word in ['Talent Management', 'Recruitment', 'Recruiting', 'Recruiter', 'Staffing', 'Recuiter', 'C.V. para Base de Datos', 
             'Leicestershire Apprenticeships']:
    df6.loc[(df6['title'].str.contains(word, regex=False)) & (df6['function'].isnull()), 'function'] = "Human Resources"
    
for word in ['Cruise', 'In-Store Assistants', 'Guest Service', 'Customer Assistant', 'Collections Representative', 'CSR', 
             'Captioning Assistant', 'Travel Agent', 'Crew Members', 'Food & Beverage Guest', 'Ticket Booth', 
             'School Bus Monitor', 'Community Support']:
    df6.loc[(df6['title'].str.contains(word, regex=False)) & (df6['function'].isnull()), 'function'] = "Customer Service"
    
for word in ['Supply Chain', 'Logistics', 'Elite Agent', 'Stocker', 'Inventory']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Supply Chain"
    
df6.loc[(df6['title'].str.contains('Product Development')) & (df6['function'].isnull()), 'function'] = "Product Management"

df6.loc[(df6['title'].str.contains('Regional')) & (df6['function'].isnull()) & \
        (~df6['title'].str.contains('Library', na=False)), 'function'] = "Sales"
df6.loc[(df6['title'].str.contains('Training')) & (df6['function'].isnull()), 'function'] = "Training"
df6.loc[(df6['title'].str.contains('Warehouse')) & (df6['function'].isnull()) & \
        (~df6['title'].str.contains('Driver', na=False)), 'function'] = "Production"

for word in ['Production', 'Warehousing', 'Producer', 'Broadcast']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Production"

for word in ['Curriculum Associate', 'Lead Instructor', 'Professional Development Coordinator', 'Learning Enterprises']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Education"

for word in ['Technician', 'General Application', 'Labor', 'Housekeeper', 'Cook', 'Security Officer', 'Groomer', 
             'SECURITY OFFICER', 'Janitor', 'INSTALLER', 'Traffic Planner', 'Bricklaying', 'cashier', 'Veteran Interview', 
             'Internship', 'Carpenter', 'Oilfield', 'Summer Intern', 'Casual/Part-Time', 'Make Easy Money at Home', 
             'Voluntourist', 'NY | POOL UI @DS+LW', 'Local Representative', 'SF | ACD (COPY) @OP', 
             'Generic - speculative application', 'Dance', 'OPEN APPLICATIONS - BUSINESS/ACCOUNT/PRODUCERS', 'Asbestos', 
             'Flyer', 'Library Page', 'None of your openings', 'Dog', 'Benefit Counselor', 'Hospitality Security', 
             'Point Nine Talent', 'Cable Tech', 'Field Tech', 'I want to work @Workable', 'Initiativbewerbung', 
             'Want to work at Franq', 'Adcash', 'Open Applications', 'Got Talent', 'Part Time Day Porter', 
             'Intern & Graduate']:
    df6.loc[(df6['title'].str.contains(word, regex=False)) & (df6['function'].isnull()), 'function'] = "Other"
    
for word in ['DGV', 'Driving', 'DRIVER']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Distribution"
    
for word in ['Management Associate', 'Commercial and Operations', 'Account Management Intern', 'Campus Rep']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Business Development"

for word in ['Associate', 'Winter / Spring Internship', 'Coherent array imaging']:
    df6.loc[(df6['title'] == word) & (df6['function'].isnull()), 'function'] = "Research"

df6.loc[(df6['title'] == 'Intern') & (df6['function'].isnull()), 'function'] = "Other"
df6.loc[(df6['title'].str.contains('Clerk')) & (df6['function'].isnull()) & \
        (~df6['title'].str.contains('Payable', na=False)), 'function'] = "Administrative"
df6.loc[(df6['title'].str.contains('IT User')) & (df6['function'].isnull()), 'function'] = "Information Technology"

for word in ['Copywriter', 'Content Strategist', 'Content Coordinator', 'Webmaster, ', 'Reporter']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Writing/Editing"
    
for word in ['Business Owner', 'Earn the Income You Deserve', 'Founder', 'founder', 'Work from Home Executive']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "General Business"

df6.loc[(df6['title'] == 'Management Trainee') & (df6['function'].isnull()), 'function'] = "Human Resources"
df6.loc[(df6['title'].str.contains('Verification Specialist')) & (df6['function'].isnull()), 'function'] = "Quality Assurance"
df6.loc[(df6['title'] == 'Food Quality') & (df6['function'].isnull()), 'function'] = "Other"

for word in ['Technical Lead', 'Administrator', 'Tibco Architect', 'HP BSM SME', 'H1B', 'Applicatieontwikkelaar', 
             'Employee at RhodeCode', 'Motion Trajectory', 'Vend', 'Robust speech separation', 'correction coding', 
             'Model Builder', 'Clinical Informatics', 'SAS', 'Bioinformatics', 'ARCHITECT', 'Junior Web', 
             'Seedcamp Winter Intern', 'Deep learning', 'Backend', 'JavaScript', 'HPC', 'FINANCE SOFTWARE']:
    df6.loc[(df6['title'].str.contains(word, regex=False)) & (df6['function'].isnull()), 'function'] = "Information Technology"

for word in ['SAP', 'Winter Associate', 'Pricing Strategy', 'HCM', 'Consultant', 'CRM']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Consulting"
    
for word in ['Program Host', 'Webcam Model', 'Academy @ Vilnius', 'Illustrator']:
    df6.loc[(df6['title'].str.contains(word, regex=False)) & (df6['function'].isnull()), 'function'] = "Art/Creative"

for word in ['TapHunter', 'Marketeer', 'SUMMER INTERNSHIP', 'Event']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Marketing"
    
for word in ['Talented Architect', 'css']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Design"
    
for word in ['Case Handler', 'Pricing Specialist', 'Housing Counselor']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Finance"
    
for word in ['Social Innovation', 'Projektleder', 'Interested in working at FQ540', 'Consulting - Project Owner']:
    df6.loc[(df6['title'].str.contains(word, regex=False)) & (df6['function'].isnull()), 'function'] = "Project Management"

for word in ['Pipe', 'Electrical Reliability']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Manufacturing"

for word in ['Investment', 'Equities Trader']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Financial Analyst"

for word in ['Digital Executive', 'Analista Comercial']:
    df6.loc[(df6['title'].str.contains(word)) & (df6['function'].isnull()), 'function'] = "Advertising"

df6.loc[(df6['title'].str.contains('Judicatory Proctor')) & (df6['function'].isnull()), 'function'] = "Legal"
df6.loc[(df6['title'].str.contains('Program Host')) & (df6['function'].isnull()), 'function'] = "Art/Creative"
df6.loc[(df6['title'].str.contains('RETENTION')) & (df6['function'].isnull()), 'function'] = "Data Analyst"
df6.loc[(df6['title'].str.contains('Sourcing Specialist')) & (df6['function'].isnull()), 'function'] = "Purchasing"

We need to make sure that the fraudulent entries have valid function values, so we have no choice but to use any applicable methods to impute them, and in this case manual imputation is applied.

In [290]:
# Fraudulent entries
df6.loc[((df6['title'].str.contains('Part Time')) | (df6['title'].str.contains('Part-Time'))) & (df6['function'].isnull()) & \
        ((df6['title'].str.contains('Wanted')) | (df6['title'].str.contains('Needed')) | (df6['title'].str.contains('Require'))), 
        'function'] = 'Other'

for word in ['CHEF', 'Hotel', 'PART-TIME WORK FROM YOUR PLACE', 'KMC', 'Furniture mover', '5 Guys', 
             'Vacancy in Halliburton', 'Military Benefits', 'Immediate']:
    df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & (df6['title'].str.contains(word)), 'function'] = "Other"

for word in ['You can do it all from home', 'Rohan', 'No Experience Required And Never']:
    df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & (df6['description'].str.contains(word)), 
            'function'] = "Other"  

for word in ['Information Systems', 'Graphite Expert', 'Clinical Programming', 'EDI Coordinator']:
    df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & (df6['title'].str.contains(word)), 
            'function'] = "Information Technology"
    
for word in ['Admin', 'Document Control', 'Typist']:
    df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & (df6['title'].str.contains(word)), 
            'function'] = "Administrative"

for word in ['Service Associate', 'Daily Money Team']:
    df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & (df6['title'].str.contains(word)), 
            'function'] = "Customer Service"

for word in ['Forward Cap.', 'Fidelity']:
    df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & (df6['title'].str.contains(word)), 
            'function'] = "Financial Analyst"

df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & \
        (df6['description'].str.contains("looking for people that are quick learners")), 'function'] = "Sales"
df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & \
        (df6['description'].str.contains("Do You Want To Own Your Internet Base")), 'function'] = "General Business"
df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & \
        ((df6['title'].str.contains("RN")) | (df6['title'].str.contains("Hospital")) | (df6['title'].str.contains("health"))), 
        'function'] = "Health Care Provider"
df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & \
        (df6['title'].str.contains("OR Specialty")), 'function'] = "Health Care Provider"
df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & \
        (df6['title'].str.contains("Final Expense Agent")), 'function'] = "Sales"
df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & \
        (df6['title'].str.contains("Field Service Tech Capital")), 'function'] = "Manufacturing"
df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & \
        (df6['title'].str.contains("Ux desgin")), 'function'] = "Design"
df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & \
        (df6['title'].str.contains("Assistant position. The best job")), 'function'] = "Business Development"
df6.loc[(df6['fraudulent'] == 't') & (df6['function'].isnull()) & \
        (df6['title'].str.contains("Ninestone")), 'function'] = "Marketing"

# Based on index
df6.loc[3166, 'function'] = "General Business"
df6.loc[17511, 'function'] = "General Business"
df6.loc[17726, 'function'] = "General Business"
df6.loc[17570, 'function'] = "Other"

# Not Fraudulent
df6.loc[11001, 'function'] = "Other"

In [291]:
# Execute the function
df6 = impute_function(df6, keywords_title, 'title')
print("Before imputation:", df5['function'].isnull().sum())
print("After imputation:", df6['function'].isnull().sum())
print('\n')

df6 = impute_function(df6, keywords_dept, 'department')
df6.loc[(df6['title'].str.contains("Account")) & (df6['function'].isnull()), 'function'] = "Accounting/Auditing"
df6.loc[(df6['title'].str.contains("Architect")) & (df6['function'].isnull()), 'function'] = "Information Technology"
df6.loc[(df6['title'].str.contains("Server")) & (df6['function'].isnull()), 'function'] = "Customer Service"
print("After 2nd round of imputation:", df6['function'].isnull().sum())

Total keywords to be looped for title: 139
Before imputation: 6378
After imputation: 529


Total keywords to be looped for department: 15
After 2nd round of imputation: 473


To ease things a little bit, we developed another method of using mode based on departments to save some time on manual imputation.

In [292]:
# Use for loop and extract mode of function based on department
failed_dep_list = []
for dep in df6[(df6['function'].isnull()) & (df6['department'].notnull())]['department'].value_counts().index:
    if (len(df6[df6['department'] == dep]['function'].value_counts()) > 0) & (dep != "Logistics"):
        mode = df6[df6['department'] == dep]['function'].value_counts().idxmax()
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = mode
        print(dep, "done, mode:", mode)
    else:
        print("**Cannot impute using", dep)
        failed_dep_list.append(dep)
        
# Print the null counts
print("\n")
print("After 3rd round of imputation with dept mode:", df6['function'].isnull().sum())

EC done, mode: Information Technology
All done, mode: Information Technology
General done, mode: Public Relations
Retail done, mode: Sales
MM done, mode: Research
Merchandising  done, mode: Management
Operations done, mode: Other
HR done, mode: Human Resources
DA done, mode: Research
African Program done, mode: Production
Animation done, mode: Production
Design done, mode: Design
**Cannot impute using AML
Photography done, mode: Production
Legal done, mode: Legal
Media done, mode: Advertising
**Cannot impute using Open
Maintenance done, mode: Manufacturing
**Cannot impute using Logistics
Education done, mode: Education
Community Engagement done, mode: Other
**Cannot impute using North Star Shipping
**Cannot impute using Digital and Brand Practice
Fashion:Internships done, mode: Information Technology
**Cannot impute using Home Security
**Cannot impute using Audiology
**Cannot impute using Small Luxury Lodge
Management done, mode: Management
Non-Tech done, mode: Management
**Cannot impu

In [293]:
len(failed_dep_list)

47

The remaining department names that can't be used for imputation are saved inside a list object.

In [294]:
failed_dep_list

['AML',
 'Open',
 'Logistics',
 'North Star Shipping',
 'Digital and Brand Practice',
 'Home Security',
 'Audiology',
 'Small Luxury Lodge',
 'SWT',
 'Aerospace and Defense Engineering Services',
 'MARKETING ',
 'Media/ Television',
 'Visionary Engineering',
 'COMMERCIAL COLLECTIONS',
 'Parks and Recreation',
 'Student Financial Services',
 'Strategy, Creative, Execution, HR',
 'Sutter Medical Group',
 'Front Office & Guest Services ',
 'videographer',
 'Membership Development & Engagement',
 'Information Systems',
 'Recovery-OPS',
 'Programmer',
 'Trainee',
 'Internships / Special Projects / Part-Time Opportunities',
 'Returns',
 'Designer',
 'body Piercing ',
 'BIOMEDICAL EQUIPMENT TECHNICAN',
 'Oasis',
 'Management Support',
 'The Whole Company',
 'Culinary ',
 'Program',
 'Schools ',
 'Athletics',
 'Nonprofit Only',
 'kitchen',
 'HLT',
 'Electrical',
 'AdOps',
 'Content Programming',
 'Wilton EMS',
 'ICM',
 'incrediblue',
 'Finance and Operations']

In [295]:
for dep in failed_dep_list:
    if dep in ['Programmer', 'Information Systems', 'Content Programming']:
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Information Technology"
        
    elif dep in ['Media/ Television', 'videographer']:
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = 'Art/Creative'
        
    elif dep in ['Electrical', 'Visionary Engineering', 'Aerospace and Defense Engineering Services']:
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Engineering"
    
    elif dep in ['Student Financial Services', 'Finance and Operations', 'Recovery-OPS', 'AML']:
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Finance"
    
    elif dep in ['MARKETING ', 'AdOps', 'Strategy, Creative, Execution, HR']:
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Marketing"
        
    elif dep == 'Designer':
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Design"
        
    elif dep in ['Front Office & Guest Services ', 'Schools ', 'Parks and Recreation', 'Small Luxury Lodge']:
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Customer Service"
    
    elif dep in ['Home Security', 'Culinary ', 'body Piercing ', 'kitchen', 'incrediblue', 'HLT', 'SWT', 
                 'The Whole Company', 'Nonprofit Only']:
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Other"
        
    elif dep in ['Logistics', 'North Star Shipping']:
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Supply Chain"
    
    elif dep in ['Sutter Medical Group', 'Audiology', 'Wilton EMS', 'ICM', 'Oasis']:
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Health Care Provider"
        
    elif dep == 'Digital and Brand Practice':
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Administrative"
        
    elif dep == 'Athletics':
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Training"
        
    elif dep in ['COMMERCIAL COLLECTIONS', 'Membership Development & Engagement', 'Returns']:
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Sales"
        
    elif dep == 'BIOMEDICAL EQUIPMENT TECHNICAN':
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Science"
        
    elif dep in ['Trainee', 'Management Support']:
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Management"
        
    elif dep == 'Program':
        df6.loc[(df6['function'].isnull()) & (df6['department'] == dep), 'function'] = "Project Management"

df6.loc[5162, 'function'] = "Engineering"
df6.loc[12338, 'function'] = "Information Technology"

# Print the null counts:
print("After 4th round of imputation with remaining dept:", df6['function'].isnull().sum())

After 4th round of imputation with remaining dept: 334


We still have 334 entries that are nulls. Another way that could prove to be useful is to obtain all the individual words from splitting the titles, then use these words to get the mode function for the respective keywords. Let us try to count the splitted word count of all the remaining job titles.

In [296]:
word_list = []
for title in df6[(df6['function'].isnull())]['title'].value_counts().index:
    words = title.split()
    word_list += words

count_dict = {}
for word in list(set(word_list)):
    count_dict[word] = word_list.count(word)

In [297]:
sorted(count_dict.items(), key=lambda x: x[1], reverse=True)

[('-', 27),
 ('Specialist', 21),
 ('in', 14),
 ('and', 14),
 ('Assistant', 12),
 ('Coordinator', 12),
 ('&', 12),
 ('to', 9),
 ('for', 7),
 ('|', 7),
 ('Associate', 7),
 ('General', 7),
 ('Customer', 7),
 ('Intern', 6),
 ('Team', 6),
 ('Job', 6),
 ('the', 6),
 ('Business', 5),
 ('Senior', 5),
 ('NAS', 5),
 ('/', 5),
 ('Application', 5),
 ('of', 5),
 ('Support', 5),
 ('Expert', 5),
 ('a', 5),
 ('Work', 5),
 ('16-18', 4),
 ('Australia', 4),
 ('Language', 4),
 ('people', 4),
 ('DevOps', 4),
 ('Under', 4),
 ('Development', 4),
 ('Personal', 4),
 ('ENGINEER', 4),
 ('year', 4),
 ('Commercial', 4),
 ("Registrar's", 3),
 ('Year', 3),
 ('SR', 3),
 ('SF', 3),
 ('Banking', 3),
 ('New', 3),
 ('Pricing', 3),
 ('work', 3),
 ('Talent', 3),
 ('Trainer', 3),
 ('olds', 3),
 ('Network', 3),
 ('Sr.', 3),
 ('Staff', 3),
 ('I', 3),
 ('Equipment', 3),
 ('Mechanic', 3),
 ('For', 3),
 ('Dev', 3),
 ('Specialists', 3),
 ('Only', 3),
 ('Management', 3),
 ('Apprenticeship', 3),
 ('Admin', 3),
 ('NY', 3),
 ('@LDK',

In [298]:
len(count_dict)

849

Like we mentioned before, we'll use a similar technique to impute the entries except that this time, we using the keys that we extracted in count_dict for faster imputation since we don't need to check the entries one by one.

In [299]:
for word in ['Coordinator', 'Assistant', 'Customer', 'Associate', 'Development', 'ENGINEER', 'DevOps', 'people', 
             'Commercial', 'Pricing', 'Mechanic', 'Cosmetic', 'Admin', 'Dev', 'Staff', 'Specialists', 'Equipment', 
             'Trainer', 'Talent', 'Fragrance', 'Installer', 'Tech', 'Management', 'MANAGEMENT', 'Administration', 'PR', 
             'development', 'BI', 'Foreman', 'LINUX', 'CONSULTANT', 'Technolog', 'Industrial', 'SysAdmin', 'UX', 
             'Construction', 'Frontend', 'DESIGNER', 'Administration', 'Optomet', 'Electronic', 'Purchasing']:
    if len(df6[(df6['title'].str.contains(word)) & (df6['function'].notnull())]['function'].value_counts()) > 0:
        mode = df6[(df6['title'].str.contains(word)) & (df6['function'].notnull())]['function'].value_counts().idxmax()
        df6.loc[(df6['function'].isnull()) & (df6['title'].str.contains(word)), 'function'] = mode
        print(word, "done, mode:", mode)
    else:
        print("**Cannot impute using", word)

Coordinator done, mode: Administrative
Assistant done, mode: Administrative
Customer done, mode: Customer Service
Associate done, mode: Customer Service
Development done, mode: Sales
ENGINEER done, mode: Information Technology
DevOps done, mode: Engineering
**Cannot impute using people
Commercial done, mode: Sales
Pricing done, mode: Information Technology
Mechanic done, mode: Engineering
Cosmetic done, mode: Sales
Admin done, mode: Administrative
Dev done, mode: Information Technology
Staff done, mode: Sales
Specialists done, mode: Sales
Equipment done, mode: Engineering
Trainer done, mode: Other
Talent done, mode: Human Resources
Fragrance done, mode: Sales
Installer done, mode: Other
Tech done, mode: Information Technology
Management done, mode: Sales
MANAGEMENT done, mode: Business Development
Administration done, mode: Administrative
PR done, mode: Marketing
development done, mode: Information Technology
BI done, mode: Information Technology
Foreman done, mode: Other
LINUX done, m

In [300]:
df6.loc[(df6['title'].str.contains('Trainer')) & (df6['function'].isnull()), 'function'] = "Training"
df6.loc[(df6['title'].str.contains('Coach')) & (df6['function'].isnull()), 'function'] = "Training"
df6.loc[(df6['title'].str.contains('Instructor')) & (df6['function'].isnull()), 'function'] = "Training"
df6.loc[(df6['title'].str.contains('Banking')) & (df6['function'].isnull()), 'function'] = "Finance"

df6.loc[(df6['title'].str.contains('Misc')) & (df6['function'].isnull()), 'function'] = "Other"
df6.loc[(df6['title'].str.contains('Catering')) & (df6['function'].isnull()), 'function'] = "Other"
df6.loc[(df6['title'].str.contains('Onsite')) & (df6['function'].isnull()), 'function'] = "Other"
df6.loc[(df6['title'].str.contains('Emergency')) & (df6['function'].isnull()), 'function'] = "Other"
df6.loc[(df6['title'].str.contains('HELPERS')) & (df6['function'].isnull()), 'function'] = "Other"
df6.loc[(df6['title'].str.contains('Plumber')) & (df6['function'].isnull()), 'function'] = "Other"
df6.loc[(df6['title'].str.contains('Wellness')) & (df6['function'].isnull()), 'function'] = "Other"
df6.loc[(df6['title'].str.contains('Contractor')) & (df6['function'].isnull()), 'function'] = "Other"

df6.loc[(df6['title'].str.contains('Employment')) & (df6['function'].isnull()), 'function'] = "Human Resources"
df6.loc[(df6['title'].str.contains('consultants')) & (df6['function'].isnull()), 'function'] = "Consulting"
df6.loc[(df6['title'].str.contains('consultant')) & (df6['function'].isnull()), 'function'] = "Consulting"
df6.loc[(df6['title'].str.contains('Hygien')) & (df6['function'].isnull()), 'function'] = "Consulting"

df6.loc[(df6['title'].str.contains('ANALYST')) & (df6['function'].isnull()), 'function'] = "Data Analyst"
df6.loc[(df6['title'].str.contains('Quantitative')) & (df6['function'].isnull()), 'function'] = "Data Analyst"
df6.loc[(df6['title'].str.contains('Hostess')) & (df6['function'].isnull()), 'function'] = "Customer Service"
df6.loc[(df6['title'].str.contains('hotel')) & (df6['function'].isnull()), 'function'] = "Customer Service"
df6.loc[(df6['title'].str.contains('Attendant')) & (df6['function'].isnull()), 'function'] = "Customer Service"
df6.loc[(df6['title'].str.contains('Waitress')) & (df6['function'].isnull()), 'function'] = "Customer Service"

df6.loc[(df6['title'].str.contains('Medicine')) & (df6['function'].isnull()), 'function'] = "Health Care Provider"
df6.loc[(df6['title'].str.contains('Urologist')) & (df6['function'].isnull()), 'function'] = "Health Care Provider"
df6.loc[(df6['title'].str.contains('LEVELTHERAPIST')) & (df6['function'].isnull()), 'function'] = "Health Care Provider"
df6.loc[(df6['title'].str.contains('Cardiovascular')) & (df6['function'].isnull()), 'function'] = "Health Care Provider"
df6.loc[(df6['title'].str.contains('Therapeutic')) & (df6['function'].isnull()), 'function'] = "Health Care Provider"

df6.loc[(df6['title'].str.contains('MANAGER')) & (df6['function'].isnull()), 'function'] = "Management"
df6.loc[(df6['title'].str.contains('managers')) & (df6['function'].isnull()), 'function'] = "Management"

df6.loc[(df6['title'].str.contains('Media')) & (df6['function'].isnull()), 'function'] = "Art/Creative"
df6.loc[(df6['title'].str.contains('Photograph')) & (df6['function'].isnull()), 'function'] = "Art/Creative"
df6.loc[(df6['title'].str.contains('Creative')) & (df6['function'].isnull()), 'function'] = "Art/Creative"
df6.loc[(df6['title'].str.contains('Negotiator')) & (df6['function'].isnull()), 'function'] = "Sales"
df6.loc[(df6['title'].str.contains('Publish')) & (df6['function'].isnull()), 'function'] = "Writing/Editing"

df6.loc[(df6['title'].str.contains('Machinery')) & (df6['function'].isnull()), 'function'] = "Engineering"
df6.loc[(df6['title'].str.contains('ELECTRICIAN')) & (df6['function'].isnull()), 'function'] = "Engineering"
df6.loc[(df6['title'].str.contains('Material')) & (df6['function'].isnull()), 'function'] = "Engineering"

df6.loc[(df6['title'].str.contains('COMPUTING')) & (df6['function'].isnull()), 'function'] = "Information Technology"
df6.loc[(df6['title'].str.contains('Programming')) & (df6['function'].isnull()), 'function'] = "Information Technology"
df6.loc[(df6['title'].str.contains('investment')) & (df6['function'].isnull()), 'function'] = "Financial Analyst"
df6.loc[(df6['title'].str.contains('Bankrupt')) & (df6['function'].isnull()), 'function'] = "Finance"

# Print the null counts:
print("After 5th round of imputation:", df6['function'].isnull().sum())

After 5th round of imputation: 177


We are only left with 177 null entries which only accounts for 1% of the total rows, this is an acceptable amount of nulls that we can use some sort of direct imputation methods. For this case, we'll impute these with the value of "Other" to keep things simple and acceptable.

In [301]:
# Impute using "Other"
df6.loc[df6['function'].isnull(), 'function'] = "Other"

# Check the counts
df6['function'].isnull().sum()

0

In [302]:
df6['function'].value_counts()

Information Technology    2649
Engineering               2031
Sales                     1917
Management                1446
Customer Service          1392
Administrative             968
Marketing                  951
Other                      941
Education                  859
Health Care Provider       560
Design                     511
Production                 294
Human Resources            276
Accounting/Auditing        263
Consulting                 262
Writing/Editing            244
Business Development       237
Finance                    229
Project Management         196
Art/Creative               167
Quality Assurance          149
Data Analyst               117
Product Management         114
Business Analyst           107
Advertising                 94
General Business            89
Public Relations            84
Manufacturing               83
Supply Chain                72
Research                    65
Legal                       56
Training                    55
Distribu

#### v. Employment Type

There are 3435 null values for employment type column. This time, a much faster and simpler way of imputation will be used to save time for the intensive tasks later.

In [303]:
df6['employment_type'].isnull().sum()

3435

In [304]:
df6['employment_type'].value_counts()

Full-time    11457
Contract      1517
Part-time      774
Temporary      237
Other          225
Name: employment_type, dtype: int64

Let's check the keywords for imputation from title and description and see how much we can impute.

In [305]:
df6[df6['employment_type'].isnull()]['title'].str.contains('Part time').value_counts()

False    3432
True        3
Name: title, dtype: int64

In [306]:
df6[df6['employment_type'].isnull()]['title'].str.contains('Part Time').value_counts()

False    3424
True       11
Name: title, dtype: int64

In [307]:
df6[df6['employment_type'].isnull()]['title'].str.contains('Part-Time').value_counts()

False    3427
True        8
Name: title, dtype: int64

In [308]:
df6[df6['employment_type'].isnull()]['title'].str.contains('part time').value_counts()

False    3432
True        3
Name: title, dtype: int64

In [309]:
df6[df6['employment_type'].isnull()]['title'].str.contains('PART TIME').value_counts()

False    3435
Name: title, dtype: int64

In [310]:
df6[df6['employment_type'].isnull()]['title'].str.contains('PART-TIME').value_counts()

False    3435
Name: title, dtype: int64

Honestly not so much but at least there's something to be imputed. What about contract?

In [311]:
df6[(df6['employment_type'].isnull()) & (~df6['title'].str.contains('Contractor', na=False))]['title'].str.contains('Contract').value_counts()

False    3422
True       13
Name: title, dtype: int64

In [312]:
df6[(df6['title'].str.contains('Contract')) & (~df6['title'].str.contains('Contractor', na=False)) & \
    (df6['employment_type'].isnull())]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
1533,Games Artist - Contract,"GB, LND, London",,,"<p>ustwo offers you the opportunity to be yourself, whilst delivering the best work on the plane...",<p>Do you want to make the type of games you love as part of a close-knit and talented team?<br>...,"<p>Must have...</p>\r\n<p>• Experience using Unity3D, including material setup, particle effects...",,f,t,f,,Mid-Senior level,Unspecified,,Art/Creative,f,f,GB,LND,London
4191,PHP Developer / Web Developer - Phoenix AZ (Contract),"US, AZ, Phoenix",IT,,,<p><b>Duties and Responsibilities:</b></p>\r\n<ul>\r\n<li>Develop and maintain websites and func...,<p><b>Desired Qualifications:&nbsp;</b><br />Experience with:</p>\r\n<ul>\r\n<li>3 - 4 years of ...,,f,f,t,,Mid-Senior level,Bachelor's Degree,,Information Technology,f,f,US,AZ,Phoenix
5784,"iLog Developer - Contract in Mclean, VA","US, VA, Mclean",,,"<p>SampraSoft is a fast growing IT solutions company headquartered in Atlanta, GA, USA, speciali...","<p>Contract through July 2015</p>\r\n<p>Location: Mclean, VA</p>\r\n<p>2 positions</p>\r\n<p>Hi...",<p></p>\r\n<p>Required Skills:</p>\r\n<p> </p>\r\n<p>• Bachelor's degree in either Computer Scie...,,f,t,f,,Mid-Senior level,Bachelor's Degree,,Information Technology,f,f,US,VA,Mclean
6881,Contract Recruitment Specialist,"US, TX, Houston",Engineering,,,<p><b> </b></p>\r\n<ul>\r\n<li>Establish a functional client / service relationship with interna...,<p>Job Requirements</p>\r\n<p><b>Functional Requirements:</b></p>\r\n<ul>\r\n<li>Conduct interne...,,f,f,t,,Mid-Senior level,Bachelor's Degree,,Human Resources,t,f,US,TX,Houston
7674,Contract Office Assistant Needed!,"US, VA, McLean",,,,<p>Does the possibility of working for a nationally ranked accounting firm appeal to you? Our cl...,<p><b>The ideal candidate will possess the following qualifications:</b></p>\r\n<ul>\r\n<li>Bach...,<ul>\r\n<li>Exposure to a nationally ranked firm</li>\r\n<li>Great experience in a fast paced en...,f,f,t,,Mid-Senior level,Bachelor's Degree,,Administrative,f,f,US,VA,McLean
7765,Contract SilverStripe Developer,"NZ, , Auckland",,,<p>SilverStripe CMS &amp; Framework is an open source platform of web development tools. The pla...,"<p>We are looking for a contract developer with SilverStripe experience to work full-time, on-si...",,,f,t,t,,Mid-Senior level,Empty requirements,,Information Technology,f,f,NZ,,Auckland
9192,Contracts & Compliance Manager,"US, TX, Austin",Legal,,"<p><b>Why CSD?</b></p>\r\n<p>CSD is not only a great place to work, but also to learn, grow and ...","<h3><b>CSD's Contracts and Compliance Manager oversees contractual compliance, maintenance and a...",<ul>\r\n<li>Minimum of a Bachelor's degree in related field or equivalent work experience requir...,<p><b><b><b>CSD offers a competitive benefits package for full-time employees. For a full list o...,f,t,t,,Mid-Senior level,Bachelor's Degree,,Legal,f,f,US,TX,Austin
11403,Contract SilverStripe Developer,"NZ, , Wellington",,,<p>SilverStripe CMS &amp; Framework is an open source platform of web development tools. The pla...,"<p>We are looking for a contract SilverStripe developer to work on-site with us for 1-2 months, ...",,,f,t,f,,Mid-Senior level,Empty requirements,,Information Technology,f,f,NZ,,Wellington
13867,Contract Gameplay Programmer,"US, CA, San Francisco",,,,<p>Rogue Rocket Games has an immediate opening for a strong Gameplay Programmer on contract to h...,<ul>\r\n<li>Unity3D Experience</li>\r\n<li>Strong C# proficiency</li>\r\n<li>A strong base under...,,t,f,f,,Mid-Senior level,Unspecified,,Information Technology,f,f,US,CA,San Francisco
14726,Copywriter Extraordinaire (Part-Time/Contract),"AU, VIC, Carlton",Creative,,"<p><a href=""#URL_cceebe444fcb31d92efbb5450f8a5c58e215f49c12e7c3c48dc884bb4c7f78dc#"" rel=""nofollo...","<h3>Company Overview:</h3>\r\n<p><a href=""#URL_cceebe444fcb31d92efbb5450f8a5c58e215f49c12e7c3c48...",<h3>Key Responsibilities:</h3>\r\n<ul>\r\n<li>Work with us on building our brand message</li>\r\...,<ul>\r\n<li>Opportunity to experience a fast-growth startup environment</li>\r\n<li>Opportunitie...,f,t,t,,Mid-Senior level,Bachelor's Degree,,Writing/Editing,f,f,AU,VIC,Carlton


Now time to check the description column for keywords.

In [313]:
df6[df6['employment_type'].isnull()]['description'].str.contains('Part time').value_counts()

False    3424
True       11
Name: description, dtype: int64

In [314]:
df6[df6['employment_type'].isnull()]['description'].str.contains('Part Time').value_counts()

False    3417
True       18
Name: description, dtype: int64

In [315]:
df6[df6['employment_type'].isnull()]['description'].str.contains('Part-Time').value_counts()

False    3431
True        4
Name: description, dtype: int64

In [316]:
df6[df6['employment_type'].isnull()]['description'].str.contains('part time').value_counts()

False    3397
True       38
Name: description, dtype: int64

In [317]:
df6[df6['employment_type'].isnull()]['description'].str.contains('PART TIME').value_counts()

False    3434
True        1
Name: description, dtype: int64

In [318]:
df6[df6['employment_type'].isnull()]['description'].str.contains('PART-TIME').value_counts()

False    3432
True        3
Name: description, dtype: int64

For contract jobs

In [319]:
df6[(df6['employment_type'].isnull()) & \
    (~df6['description'].str.contains('Contractor', na=False))]['description'].str.contains('Contract').value_counts()

False    3404
True       28
Name: description, dtype: int64

What about full-time positions? This can be said as redundant but just to be sure that we can also use this keyword.

In [320]:
df6[df6['employment_type'].isnull()]['description'].str.contains('Full time').value_counts()

False    3417
True       18
Name: description, dtype: int64

In [321]:
df6[df6['employment_type'].isnull()]['description'].str.contains('Full Time').value_counts()

False    3381
True       54
Name: description, dtype: int64

In [322]:
df6[df6['employment_type'].isnull()]['description'].str.contains('Full-Time').value_counts()

False    3430
True        5
Name: description, dtype: int64

In [323]:
df6[df6['employment_type'].isnull()]['description'].str.contains('full time').value_counts()

False    3328
True      107
Name: description, dtype: int64

In [324]:
df6[df6['employment_type'].isnull()]['description'].str.contains('FULL TIME').value_counts()

False    3432
True        3
Name: description, dtype: int64

In [325]:
df6[df6['employment_type'].isnull()]['description'].str.contains('FULL-TIME').value_counts()

False    3433
True        2
Name: description, dtype: int64

Checking if any entries contains both full time and part time keywords.

In [326]:
df6[
    ((df6['description'].str.contains('Full time')) | (df6['description'].str.contains('Full Time')) | \
     (df6['description'].str.contains('Full-Time')) | (df6['description'].str.contains('full time')) | \
     (df6['description'].str.contains('FULL TIME')) | (df6['description'].str.contains('FULL-TIME'))) & \
    ((df6['description'].str.contains('Part time')) | (df6['description'].str.contains('Part Time')) | \
     (df6['description'].str.contains('Part-Time')) | (df6['description'].str.contains('part time')) | \
     (df6['description'].str.contains('PART TIME')) | (df6['description'].str.contains('PART-TIME'))) & \
    (df6['employment_type'].isnull())
]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
214,Recruiter/Recruiting Assistant,"US, CA, Inglewood",,,,<p><i>“We believe our best investment is in our people.”</i> – Healthy Spot Core Value #8</p>\r\...,,,f,f,f,,Entry level,Empty requirements,,Human Resources,f,f,US,CA,Inglewood
266,Assistant Retail Manager -- Must LOVE Dogs,"US, CA, West Hollywood",,,,<p><i>“Pride is a personal commitment. It is an attitude which separates excellence from medioc...,,,f,f,f,,Mid-Senior level,Empty requirements,,Management,f,f,US,CA,West Hollywood
680,Front-end web developer (CSS/HTML),"GB, ,",,,"<h3>Inviting inspirational individuals</h3>\r\n<p>We’re fast becoming a world-class company, mak...",<p>We help charities to transform the lives of millions of people in need. Our online fundraisin...,,<p>£ 35-45</p>,f,t,t,,Mid-Senior level,Empty requirements,,Information Technology,f,f,GB,,
695,All round web design superstar for cutting edge non-profit websites,"GB, , Angel, London",,,"<h3>Inviting inspirational individuals</h3>\r\n<p>We’re fast becoming a world-class company, mak...","<p><b>Us</b></p>\r\n<p>Raising IT is growing quickly! We create stunning, mobile optimised sites...",<p>See above</p>,<p>£ Leading market rates</p>,f,t,t,,Mid-Senior level,Unspecified,,Design,f,f,GB,,"Angel, London"
831,Retail Manager -- Must LOVE Dogs,"US, , West Hollywood",,,,<p><i>“Pride is a personal commitment. It is an attitude which separates excellence from medioc...,,,f,f,f,,Mid-Senior level,Empty requirements,,Management,f,f,US,,West Hollywood
1127,Caregiver - Watervliet/Hartford,"US, MI, Watervliet",,,"<p>""Our mission to our clients is to preserve their independence, enhance their quality of life,...",<p>Home Sweet Home In-Home Care is one of the fastest growing home care agencies in Southwest Mi...,,<ul>\r\n<li>Competitive compensation with performance reviews</li>\r\n<li>Paid orientation and t...,f,t,t,,Mid-Senior level,High School or equivalent,Hospital & Health Care,Health Care Provider,f,f,US,MI,Watervliet
1364,Retail Staff Member- Los Angeles Area,"US, CA,",,,,<p><i>“Pride is a personal commitment. It is an attitude which separates excellence from medioc...,,,f,f,f,,Mid-Senior level,Empty requirements,,Sales,f,f,US,CA,
1493,Assistant Retail Manager -- Must LOVE Dogs,"US, CA, West Hollywood",,,,<p><i>“Pride is a personal commitment. It is an attitude which separates excellence from medioc...,,,f,f,f,,Mid-Senior level,Empty requirements,,Management,f,f,US,CA,West Hollywood
3166,Earn Money Working From Home,"US, MA, Boston",,,,<p>Are you finding it harder to work for a boss? Have you always wanted a better work / life ba...,,"<p>By following our simple 3 step system, on a part time basis, you have the potential to earn a...",t,f,t,,Not Applicable,Empty requirements,,General Business,t,t,US,MA,Boston
3585,Teller Supervisor,"US, MO, St. Peters",Teller,,<p>Missouri Valley Federal Credit Union (MOVFCU) was chartered in 1975 as CTC Central Region Fed...,"<ul>\r\n<li>Supervises the activities of the teller operations area by assigning work, answering...","<p>A college degree or a minimum of three years experience, two of which should be in progressiv...",,f,t,t,,Mid-Senior level,Bachelor's Degree,,Management,f,f,US,MO,St. Peters


In [327]:
len(df6[
    ((df6['description'].str.contains('Full time')) | (df6['description'].str.contains('Full Time')) | \
     (df6['description'].str.contains('Full-Time')) | (df6['description'].str.contains('full time')) | \
     (df6['description'].str.contains('FULL TIME')) | (df6['description'].str.contains('FULL-TIME'))) & \
    ((df6['description'].str.contains('Part time')) | (df6['description'].str.contains('Part Time')) | \
     (df6['description'].str.contains('Part-Time')) | (df6['description'].str.contains('part time')) | \
     (df6['description'].str.contains('PART TIME')) | (df6['description'].str.contains('PART-TIME'))) & \
    (df6['employment_type'].isnull())
])

34

The above code is not exactly getting all the entries we want, because 34 is way too less in my opinion.

In [328]:
df6.loc[7765, 'description']

'<p>We are looking for a contract developer with SilverStripe experience to work full-time, on-site with one of our clients for several months, as part of our "Field Agent" programme.</p>'

In [329]:
df6.loc[17790, 'description']

'<p>Weekend Cash Jobs Part time &amp; Full time.<br>No Experience Required And Never Any Fees.<br>Work Anytime 1 To 2 Hrs Daily In Free Time.<br>Earn Easily $400 To $500 Extra Per Day.<br>Totally Free To Join &amp; Suitable For All.<br>Take Action &amp; Get Started Here:-<br>#URL_3642a95d0b2308884802999b8ba4f004b69950c970d00995af84c2270b7b570c#</p>'

Let's create a new value that shows that this job consists of both full time and part time positions.

In [330]:
full_time = ['Full time', 'Full Time', 'Full-Time', 'full time', 'FULL TIME', 'FULL-TIME', 'full-time']
part_time = ['Part time', 'Part Time', 'Part-Time', 'part time', 'PART TIME', 'PART-TIME', 'part-time']

full_part_ind_des = []
for full in full_time:
    for part in part_time:
        indexes = list(df6[(df6['description'].str.contains(full)) & (df6['description'].str.contains(part))].index)
        full_part_ind_des += indexes
print("Number of index extracted:", len(full_part_ind_des))

full_part_ind_des = list(set(full_part_ind_des))
print("Number of index extracted after applying set:", len(full_part_ind_des))

Number of index extracted: 262
Number of index extracted after applying set: 248


What about title column? We can try and see.

In [331]:
full_part_ind_title = []
for full in full_time:
    for part in part_time:
        indexes = list(df6[(df6['title'].str.contains(full)) & (df6['title'].str.contains(part))].index)
        full_part_ind_title += indexes
print("Number of index extracted:", len(full_part_ind_title))

full_part_ind_title = list(set(full_part_ind_title))
print("Number of index extracted after applying set:", len(full_part_ind_title))

Number of index extracted: 10
Number of index extracted after applying set: 10


Keep in mind that overlapping is possible so we should do a cross checking. If no overlapping then the count should be 116 but we are expecting less.

In [332]:
len(list(set(full_part_ind_des + full_part_ind_title)))

252

It is confirmed that 6 indexes extracted from titles are duplicates of what we have previously from descriptions, totalling up to only 110 unique indexes for imputation of employment type.

In [333]:
sorted(list(set(full_part_ind_des + full_part_ind_title)))

[214,
 266,
 428,
 570,
 585,
 586,
 680,
 695,
 763,
 831,
 834,
 907,
 1098,
 1126,
 1127,
 1165,
 1292,
 1364,
 1466,
 1493,
 1494,
 1788,
 1795,
 1796,
 2413,
 2691,
 2802,
 2854,
 2855,
 2873,
 2942,
 3040,
 3074,
 3133,
 3166,
 3277,
 3282,
 3301,
 3585,
 3603,
 3732,
 3764,
 3815,
 3837,
 3842,
 3869,
 3881,
 4095,
 4172,
 4181,
 4252,
 4277,
 4299,
 4348,
 4355,
 4413,
 4434,
 4468,
 4503,
 4506,
 4628,
 4675,
 4725,
 4727,
 4737,
 4835,
 4899,
 4923,
 4999,
 5018,
 5036,
 5180,
 5253,
 5279,
 5362,
 5453,
 5542,
 5590,
 5720,
 5795,
 5860,
 5974,
 6199,
 6310,
 6311,
 6318,
 6398,
 6463,
 6487,
 6549,
 6550,
 6650,
 6653,
 6864,
 6928,
 7238,
 7258,
 7615,
 7654,
 7656,
 7657,
 7658,
 7659,
 7660,
 7661,
 7662,
 7663,
 7665,
 7666,
 7743,
 7766,
 7877,
 8051,
 8261,
 8381,
 8390,
 8514,
 8684,
 8702,
 8706,
 8769,
 8838,
 8851,
 9336,
 9349,
 9412,
 9450,
 9475,
 9497,
 9584,
 10066,
 10108,
 10312,
 10537,
 10542,
 10638,
 10817,
 10919,
 11182,
 11196,
 11228,
 11318,
 11392

In [334]:
df6.loc[sorted(list(set(full_part_ind_des + full_part_ind_title))), 'employment_type'].isnull().value_counts()

False    148
True     104
Name: employment_type, dtype: int64

In [335]:
df6.loc[sorted(list(set(full_part_ind_des + full_part_ind_title))), 'employment_type'].value_counts()

Full-time    72
Part-time    52
Contract     12
Other         8
Temporary     4
Name: employment_type, dtype: int64

In [336]:
df6.loc[sorted(list(set(full_part_ind_des + full_part_ind_title))), :].loc[
    ~df6['employment_type'].isin(['Contract', 'Temporary']), :
]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
214,Recruiter/Recruiting Assistant,"US, CA, Inglewood",,,,<p><i>“We believe our best investment is in our people.”</i> – Healthy Spot Core Value #8</p>\r\...,,,f,f,f,,Entry level,Empty requirements,,Human Resources,f,f,US,CA,Inglewood
266,Assistant Retail Manager -- Must LOVE Dogs,"US, CA, West Hollywood",,,,<p><i>“Pride is a personal commitment. It is an attitude which separates excellence from medioc...,,,f,f,f,,Mid-Senior level,Empty requirements,,Management,f,f,US,CA,West Hollywood
428,Nightlife Editor,"GB, LND, London",,,<p>DICE gets fans the best tickets at face value with No Booking Fees. We're based in Shoreditch...,<p>DICE is building an editorial team in London.</p>\r\n<p>You’re a music obsessive. You live an...,<p>You live in London</p>\r\n<p>A deep understanding of the London clubbing scene and an amazing...,<p>You'll be working with smart people who have amazing ideas that often become reality. We have...,f,t,t,,Associate,Unspecified,,Writing/Editing,f,f,GB,LND,London
570,EMTs (Lift Coaches) Marin,"US, CA, Marin",,,"<p>At Atlas Lift Tech, safety always comes first! We are a fast growing company with an innovat...",<h3>We are looking for EMTs to become Lift Coaches at Atlas Lift Tech in the Marin Area.</h3>\r\...,<p><b>Position Responsibilities:</b></p>\r\n<ul>\r\n<li>Teaching safe patient handling methodolo...,<p>At Atlas Lift Tech we are innovators and we value individual contributions! We encourage cont...,f,t,t,Full-time,Entry level,High School or equivalent,Hospital & Health Care,Health Care Provider,f,f,US,CA,Marin
585,"Clinical Optometrists, Leicester & Nuneaton","GB, , Leicester",,,"<p>Newmedica is a dynamic, innovative UK healthcare company that works in partnership with the N...",<p><b>General Ophthalmology and Glaucoma</b></p>\r\n<p><b>Leicester &amp; Nuneaton<br></b></p>\r...,<p><b>Personal:</b></p>\r\n<ul>\r\n<li>Enjoys the routine and rhythm of a process driven environ...,,f,t,t,Other,Not Applicable,Bachelor's Degree,Hospital & Health Care,Other,f,f,GB,,Leicester
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17790,Weekend Cash Jobs Part time & Full time.,"AU, NSW, Sydney",,,,<p>Weekend Cash Jobs Part time &amp; Full time.<br>No Experience Required And Never Any Fees.<br...,<p>Totally Free To Join &amp; Suitable For All.</p>,<p>Work Anytime 1 To 2 Hrs Daily In Free Time.</p>,f,f,f,Part-time,Not Applicable,Unspecified,,Other,t,t,AU,NSW,Sydney
17810,"Business Opportunity P/T,F/T Available","US, ,",,,,"<p>We have the demand. We are looking for people that are quick learners, and are very efficient...",,,f,f,f,,Mid-Senior level,Empty requirements,,Sales,t,t,US,,
17815,"Urgent Cash Jobs, Part Time & Full Time.","US, CA, Los Angeles",,,,"<p>Urgent Cash Jobs, Part Time &amp; Full Time.<br>No Experience Required And Never Any Fees.<br...",<p>Work Anytime 1 To 2 Hrs Daily In Free Time.</p>,<p>Totally Free To Join &amp; Suitable For All.</p>,f,f,f,Part-time,Entry level,Unspecified,,Other,t,t,US,CA,Los Angeles
17827,Student Positions Part-Time and Full-Time.,"US, CA, Los Angeles",,,,"<p>Student Positions Part-Time and Full-Time.<br>You can do it all from home, in your free time,...",,,f,f,f,Part-time,Not Applicable,Empty requirements,,Other,t,t,US,CA,Los Angeles


In [337]:
len(df6.loc[sorted(list(set(full_part_ind_des + full_part_ind_title))), :].loc[
    ~df6['employment_type'].isin(['Contract', 'Temporary']), :
])

236

After excluding Contract and Temporary values, we have 101 entries that we can impute/change to a new value that we intend to add.

In [338]:
df6.loc[sorted(list(set(full_part_ind_des + full_part_ind_title))), :].loc[
    ~df6['employment_type'].isin(['Contract', 'Temporary']), :
].index

Int64Index([  214,   266,   428,   570,   585,   586,   680,   695,   763,
              831,
            ...
            17677, 17687, 17700, 17713, 17730, 17790, 17810, 17815, 17827,
            17828],
           dtype='int64', length=236)

In [339]:
# Copy a new df
df7 = df6.copy()

# Fill the 101 rows with a new value: "Full-time & Part-time"
df7.loc[df6.loc[sorted(list(set(full_part_ind_des + full_part_ind_title))), :].loc[
    ~df6['employment_type'].isin(['Contract', 'Temporary']), :
].index, 'employment_type'] = "Full-time & Part-time"

In [340]:
df6['employment_type'].isnull().sum(), df7['employment_type'].isnull().sum()

(3435, 3331)

In [341]:
df7['employment_type'].value_counts()

Full-time                11385
Contract                  1517
Part-time                  722
Temporary                  237
Full-time & Part-time      236
Other                      217
Name: employment_type, dtype: int64

Let's proceed with individual full time and part time keywords without combining them like previously.

In [342]:
full_time = ['Full time', 'Full Time', 'Full-Time', 'full time', 'FULL TIME', 'FULL-TIME', 'full-time']
part_time = ['Part time', 'Part Time', 'Part-Time', 'part time', 'PART TIME', 'PART-TIME', 'part-time']

full_ind_des = []
for full in full_time:
    indexes = list(df7[(df7['description'].str.contains(full)) & (df7['employment_type'].isnull())].index)
    full_ind_des += indexes
print("Number of index extracted:", len(full_ind_des))

full_ind_des = list(set(full_ind_des))
print("Number of index extracted after applying set:", len(full_ind_des))
print("\n")

part_ind_des = []
for part in part_time:
    indexes = list(df7[(df7['description'].str.contains(part)) & (df7['employment_type'].isnull())].index)
    part_ind_des += indexes
print("Number of index extracted:", len(part_ind_des))

part_ind_des = list(set(part_ind_des))
print("Number of index extracted after applying set:", len(part_ind_des))

Number of index extracted: 212
Number of index extracted after applying set: 211


Number of index extracted: 85
Number of index extracted after applying set: 81


In [343]:
full_ind_title = []
for full in full_time:
    indexes = list(df7[(df7['title'].str.contains(full)) & (df7['employment_type'].isnull())].index)
    full_ind_title += indexes
print("Number of index extracted:", len(full_ind_title))

full_ind_title = list(set(full_ind_title))
print("Number of index extracted after applying set:", len(full_ind_title))
print("\n")

part_ind_title = []
for part in part_time:
    indexes = list(df7[(df7['title'].str.contains(part)) & (df7['employment_type'].isnull())].index)
    part_ind_title += indexes
print("Number of index extracted:", len(part_ind_title))

part_ind_title = list(set(part_ind_title))
print("Number of index extracted after applying set:", len(part_ind_title))

Number of index extracted: 20
Number of index extracted after applying set: 20


Number of index extracted: 23
Number of index extracted after applying set: 23


In [344]:
len(sorted(list(set(full_ind_des + full_ind_title)))), len(sorted(list(set(part_ind_des + part_ind_title))))

(227, 92)

Now we can impute the null values with full-time and part-time respectively.

In [345]:
# For full-time
df7.loc[sorted(list(set(full_ind_des + full_ind_title))), 'employment_type'] = "Full-time"

# For part-time
df7.loc[sorted(list(set(part_ind_des + part_ind_title))), 'employment_type'] = "Part-time"

In [346]:
df7['employment_type'].isnull().sum()

3012

What about requirements column? 

In [347]:
full_ind_req = []
for full in full_time:
    indexes = list(df7[(df7['requirements'].str.contains(full)) & (df7['employment_type'].isnull())].index)
    full_ind_req += indexes
print("Number of index extracted:", len(full_ind_req))

full_ind_req = list(set(full_ind_req))
print("Number of index extracted after applying set:", len(full_ind_req))
print("\n")

part_ind_req = []
for part in part_time:
    indexes = list(df7[(df7['requirements'].str.contains(part)) & (df7['employment_type'].isnull())].index)
    part_ind_req += indexes
print("Number of index extracted:", len(part_ind_req))

part_ind_req = list(set(part_ind_req))
print("Number of index extracted after applying set:", len(part_ind_req))

Number of index extracted: 86
Number of index extracted after applying set: 86


Number of index extracted: 6
Number of index extracted after applying set: 6


In [348]:
# For full-time
df7.loc[sorted(full_ind_req), 'employment_type'] = "Full-time"

# For part-time
df7.loc[sorted(part_ind_req), 'employment_type'] = "Part-time"

In [349]:
df7['employment_type'].isnull().sum()

2922

We are left with contracts value to impute. Let's see if the same approach can extract how many row indexes.

In [350]:
exclude_words = ["contract terms", "contracts, and real", "contract and fee", "OTC contracts", 
                 "Official employment, contract, visa", "Contract management in", "close contracts", 
                 "contract running out", "engineering and contracting", "Contract Management", "contracts and forms", 
                 "Negotiating contracts with", "Construction Contracting", "product requirements and contract", 
                 "to obtain contracts", "contracted to help", "signing the contracts", "accordance with contract", 
                 "contractual standards", "contracting professional gamers", "contracted security guards", 
                 "work with the contracts department", "digitalize contracts", "vendors contracts", 
                 "contracts and client expectations", "through contract award", "including contract payment", 
                 "discussing contract, follow", "contract negotiations", "contractual obligations", 
                 "establishment of contracts", "other contract members", "contract drafts with", "staffing contracts", 
                 "tenders for new contracts", "establishing contracts for", "all legal contract", 
                 "contractual and licensing", "zero hours contracts!", "legal contracts", "contract drafting", 
                 "supplier contracts", "negotiating contracts", "proposals and contracts", "Audit contracts and", 
                 "Contractual and Negotiating", "Contractor", "contractor"]

con_ind_des = []
for con in ['Contract', 'contract', 'CONTRACT']:
    indexes = list(df7[(df7['description'].replace(exclude_words + full_time, '', regex=True).str.contains(con, case=False)) & \
                       (df7['employment_type'].isnull())].index)
    con_ind_des += indexes
print("Number of index extracted:", len(con_ind_des))

con_ind_des = list(set(con_ind_des))
print("Number of index extracted after applying set:", len(con_ind_des))

Number of index extracted: 291
Number of index extracted after applying set: 97


In [351]:
len(exclude_words)

48

In [352]:
df7.loc[sorted(con_ind_des), :]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
20,Marketing Assistant,"US, TX, Austin",,,<p>IntelliBright was created to leverage enterprise level online business practices to generate ...,<p>IntelliBright is growing fast and is looking for a <b>Marketing Assistant </b>to join our tea...,<p><b>Job Requirements</b></p>\r\n<ul>\r\n<li>Assist in creating client online marketing campaig...,,f,t,f,,Associate,Unspecified,,Marketing,f,f,US,TX,Austin
288,Outside Sales Professional,"US, MO, Cape Girardo",,,"<p>ABC Supply Co., Inc. is the nation’s largest wholesale distributor of roofing and one of the ...","<p>As an Outside Sales Representative, you will develop and maintain a growing book of sales acc...","<p>As an Outside Sales Representative, you must have excellent sales talents as well as the will...","<p>As an Outside Sales Representative, you will receive <i>paid</i> sales training, which will i...",f,t,f,,Entry level,Bachelor's Degree,Building Materials,Sales,f,f,US,MO,Cape Girardo
440,Ruby developer,"GB, UKM, Stockholm, Sweden",,,<p>Eviture is a professional services firm that specialise in leading enterprise agile delivery ...,<p>We are looking for a full-stack Ruby software engineer on a 3-4 month contract working for a ...,<p>Essentials:</p>\r\n<ul>\r\n<li>Ruby</li>\r\n<li>Rails</li>\r\n<li>EU passport</li>\r\n<li>Eng...,,f,t,f,,Mid-Senior level,Unspecified,Banking,Information Technology,f,f,GB,UKM,"Stockholm, Sweden"
1148,Storage Administrator or Engineer (Local candidates Required)),"US, VA, Reston",,,,<p><br />Title: Storage Administrator or Engineer<br />Term: Longterm Contract<br />Location: Re...,"<p>Consultants must have a Solid understanding on <br />AIX, Linux, Solaris, and Windows Operati...",,f,f,f,,Mid-Senior level,Bachelor's Degree,,Information Technology,f,f,US,VA,Reston
1160,Bar manager in the hotel St. Regis Doha (Qatar),"QA, , Doha",,,<p><b>ClarusApex</b> is an international recruiting company with representations in the Netherla...,"<p><b>We are looking for people to join our team, with a passion for great customer service, who...","<p><b>Requirements:</b><br>Excellent communication skills and good command of the English, Russi...",,f,t,f,,Mid-Senior level,Unspecified,,Management,f,f,QA,,Doha
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17539,Cruise Staff Wanted *URGENT*,"US, CA, san diego",,,,<p><b>6* Ultra Luxury American Cruise Company is urgently looking for the following positions:</...,<p><b>Certification &amp; Experience:</b> Previous experience (not Required)<br>Good English spe...,"<p><b>Benefits:</b> On board en suite accommodation and food, Medical cover for duration of cont...",f,f,t,,Entry level,Unspecified,"Leisure, Travel & Tourism",Customer Service,t,t,US,CA,san diego
17628,Cruise Staff Wanted *URGENT*,"US, GA, ATLANTA",,,,<p><b>6* Ultra Luxury American Cruise Company is urgently looking for the following positions:</...,<p><b>Certification &amp; Experience:</b> Previous experience (not Required)<br>Good English spe...,"<p><b>Benefits:</b> On board en suite accommodation and food, Medical cover for duration of cont...",f,f,t,,Entry level,Unspecified,"Leisure, Travel & Tourism",Customer Service,t,t,US,GA,ATLANTA
17686,Cruise Staff Wanted *URGENT*,"US, CO, denver",,,,<p><b>6* Ultra Luxury American Cruise Company is urgently looking for the following positions:</...,<p><b>Certification &amp; Experience:</b> Previous experience (not Required)<br>Good English spe...,"<p><b>Benefits:</b> On board en suite accommodation and food, Medical cover for duration of cont...",f,f,t,,Entry level,Unspecified,"Leisure, Travel & Tourism",Customer Service,t,t,US,CO,denver
17703,Cruise Staff Wanted *URGENT*,"US, FL, fort lauderdale",,,,<p><b>6* Ultra Luxury American Cruise Company is urgently looking for the following positions:</...,<p><b>Certification &amp; Experience:</b> Previous experience (not Required)<br>Good English spe...,"<p><b>Benefits:</b> On board en suite accommodation and food, Medical cover for duration of cont...",f,f,f,,Entry level,Unspecified,"Leisure, Travel & Tourism",Customer Service,t,t,US,FL,fort lauderdale


In [353]:
df7.loc[7798, 'description']

'<p>The right candidate will assist in the development, planning and deployment of SEM campaigns.\xa0 Furthermore, he/she will also participate in tasks related to the monitoring of the company’s KPIs (Key Performance Indicators) and other performance metrics.</p>\r\n<p>\xa0Knowledge of various internet marketing channels, search engines, statistical analysis and general marketing principles, are considered great assets for the position.</p>\r\n<p>*Determined contract</p>\r\n<p><b><i>Responsibilities:</i></b></p>\r\n<p>-\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Participate in the management, analysis and optimization of international Google Adwords campaigns.</p>\r\n<p>-\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Assist in the development and implementation of Adwords strategies in the search- and display network.</p>\r\n<p>-\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Assist in research and analyze keywords and advertisements.</p>\r\n<p>-\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Participate in tasks related to Google analytics to prepare pa

index 2205 and 2959 and 7798 and  change back to Full-time

In [354]:
# Fill the values for description first for now
df7.loc[sorted(con_ind_des), 'employment_type'] = "Contract"

# Manual correction for 3 index
df7.loc[[2205, 2959, 7798], 'employment_type'] = "Full-time"

# Check the null counts
df7['employment_type'].isnull().sum()

2825

In [355]:
con_ind_title = []
for con in ['Contract', 'contract', 'CONTRACT']:
    indexes = list(df7[(df7['title'].str.contains(con, case=False)) & (df7['employment_type'].isnull()) & \
                       (~df7['title'].str.contains('Contractor', na=False)) & \
                       (~df7['title'].str.contains('contractor', na=False)) & \
                       (~df7['title'].str.contains('Full time', na=False)) & \
                       (~df7['title'].str.contains('Full Time', na=False)) & \
                       (~df7['title'].str.contains('Full-Time', na=False)) & \
                       (~df7['title'].str.contains('full time', na=False)) & \
                       (~df7['title'].str.contains('full-time', na=False)) & \
                       (~df7['title'].str.contains('FULL TIME', na=False)) & \
                       (~df7['title'].str.contains('FULL-TIME', na=False))].index)
    con_ind_title += indexes
print("Number of index extracted:", len(con_ind_title))

con_ind_title = list(set(con_ind_title))
print("Number of index extracted after applying set:", len(con_ind_title))

Number of index extracted: 15
Number of index extracted after applying set: 5


In [356]:
df7.loc[sorted(con_ind_title), :]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
11,Talent Sourcer (6 months fixed-term contract),"GB, LND, London",HR,,<p><b>Want to build a 21st century financial service?</b></p>\r\n<p>We're convinced that that th...,<p>TransferWise is the clever new way to move money between countries. Co-founded by Skype’s fir...,<p><b>We’re looking for someone who:</b></p>\r\n<ul>\r\n<li>Proven track record in sourcing acro...,<p>You will join one of Europe’s most hotly tipped startups with plenty of opportunities to grow...,f,t,f,,Mid-Senior level,Unspecified,,Human Resources,f,f,GB,LND,London
4191,PHP Developer / Web Developer - Phoenix AZ (Contract),"US, AZ, Phoenix",IT,,,<p><b>Duties and Responsibilities:</b></p>\r\n<ul>\r\n<li>Develop and maintain websites and func...,<p><b>Desired Qualifications:&nbsp;</b><br />Experience with:</p>\r\n<ul>\r\n<li>3 - 4 years of ...,,f,f,t,,Mid-Senior level,Bachelor's Degree,,Information Technology,f,f,US,AZ,Phoenix
6881,Contract Recruitment Specialist,"US, TX, Houston",Engineering,,,<p><b> </b></p>\r\n<ul>\r\n<li>Establish a functional client / service relationship with interna...,<p>Job Requirements</p>\r\n<p><b>Functional Requirements:</b></p>\r\n<ul>\r\n<li>Conduct interne...,,f,f,t,,Mid-Senior level,Bachelor's Degree,,Human Resources,t,f,US,TX,Houston
7674,Contract Office Assistant Needed!,"US, VA, McLean",,,,<p>Does the possibility of working for a nationally ranked accounting firm appeal to you? Our cl...,<p><b>The ideal candidate will possess the following qualifications:</b></p>\r\n<ul>\r\n<li>Bach...,<ul>\r\n<li>Exposure to a nationally ranked firm</li>\r\n<li>Great experience in a fast paced en...,f,f,t,,Mid-Senior level,Bachelor's Degree,,Administrative,f,f,US,VA,McLean
15550,Copywriter (Contract Position),"US, CA, San Diego",,,"<p>We’re Digital Telepathy, but our friends call us DT. Committed to being designers of the Web,...","<p>Here at Digital Telepathy, we’re charting a course towards making the web a better place, and...","<p>OUR REQUIREMENTS</p>\r\n<ul>\r\n<li>3+ years of writing experience, preferably within the des...",,f,t,t,,Mid-Senior level,Bachelor's Degree,,Writing/Editing,f,f,US,CA,San Diego


In [357]:
# Fill the values for description first for now
df7.loc[sorted(con_ind_title), 'employment_type'] = "Contract"

# Check the null counts
df7['employment_type'].isnull().sum()

2820

We also want to take a look at internship roles, see what employment type is assigned to them.

In [358]:
df7[(df7['required_experience'] == 'Internship') & (df7['employment_type'].notnull())]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
0,Marketing Intern,"US, NY, New York",Marketing,,"<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,f,t,f,Other,Internship,Unspecified,,Marketing,f,f,US,NY,New York
106,Gatwick Customer Service Apprenticeship 16-18 Year Olds Only,"GB, , Gatwick",,,<p>Established on the principles that full time education is not for everyone Spectrum Learning ...,<p>You must be 16-18 years old to apply for this position as it is an apprenticeship</p>\r\n<p>P...,<p>Must be 16-18 years olds</p>,<p>Career prospects</p>,f,t,t,Full-time,Internship,High School or equivalent,,Customer Service,f,f,GB,,Gatwick
128,Precision Ag Intern Spring 2015 $2000 Per Month,"US, IA, Harlan or Ames",,,<p>HTS Ag has been working with producers to prove the profitability of precision technology sin...,"<p>At<b> HTS Ag</b>, we attribute our success to our remarkable staff. We promote career growth...",<p>REQUIRED SKILLS: <br>*Requires a High School Diploma and 2+ years agricultural related experi...,<p>This position is not benefits eligible.</p>,f,t,t,Temporary,Internship,Some College Coursework Completed,Farming,Other,f,f,US,IA,Harlan or Ames
165,Sales and Marketing Intern,"GB, , London",,18500-28000,<p>Digital Shadows is a cyber threat intelligence company that protects organisations from data ...,<p><b>Please note that the deadline for applications for this position is Friday 15th August at ...,<p><b>Required skills and qualifications</b><br>The successful candidate will possess most or al...,"<p><b>Salary</b><br>Negotiable on experience. £18,500 - £28,000.<br><br></p>",f,t,t,Full-time,Internship,Bachelor's Degree,Computer Software,Marketing,f,f,GB,,London
289,"Intern with Google, Microsoft, Facebook and more! at Studyhall.com","US, DC, Washington",,,"<p><b>StudyHall</b> creates opportunities for college, university students, and recent graduates...",<p><b>#URL_ab309fb672a2b26317bd303c09c3c6762986d45c2bb1b4970cac579d697432e2#</b> is the #1 inter...,<p><b>Your must have core skills in ONE of the following: </b></p>\r\n<p>Writing Blog + Article...,<p>-Build Portfolio</p>\r\n<p>-Earn Money (Part-time or full-time)</p>\r\n<p>-Work with Top 25 C...,t,t,t,Other,Internship,Bachelor's Degree,Computer Software,Engineering,f,f,US,DC,Washington
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17394,Communications and Marketing Internship,"NZ, N, Auckland",,15000-30000,<ul>\r\n<li>\r\n<p>Frustrated with the status quo?</p>\r\n</li>\r\n<li>\r\n<p>Like to re-imagine...,<p>Want to get a head start on your peers and be part of a globally focused tech team designing ...,<p>The right candidate will have a positive attitude and the ability to work well under pressure...,"<p>We hope that this internship provides an exciting opportunity to learn in a fun, fast paced, ...",f,t,t,Part-time,Internship,Some College Coursework Completed,Consumer Electronics,Marketing,f,t,NZ,N,Auckland
17431,SA807: Object detection and embedded vision,"US, MA, Cambridge",SA,,<h3>MERL's internship program gives students excellent opportunities to work in an industrial re...,<p>MERL is looking for a self-motivated intern to work on the area of object detection and embed...,,,f,t,f,Temporary,Internship,Empty requirements,Research,Research,f,t,US,MA,Cambridge
17650,Data Entry,"US, FL, HILLIARD FL",,25-30,,<p>Prepares source data for computer entry by compiling and sorting information; establishing en...,<p>We are seeking extremely motivated and experienced individual for position of Data Entry cle...,"<p> Health, Dental, Life and AD&amp;D Insurance, Employee Wellness and 401k #URL_c801649eeb40077...",f,f,f,Contract,Internship,Unspecified,Consumer Services,Accounting/Auditing,t,t,US,FL,HILLIARD FL
17729,Intern Development Assistant,"US, CA, Los Angeles",Programming,,,<p>We Are Looking for college interns with the passion to be in the entertainment industry. Indu...,"<p>You will be working alongside of the Producers and writers, helping and developing shows to p...",,f,t,t,Other,Internship,High School or equivalent,Entertainment,Business Development,t,t,US,CA,Los Angeles


In [359]:
df7[(df7['required_experience'] == 'Internship') & (df7['employment_type'].notnull()) & \
    (df7['title'].str.contains("Intern", case=False))]['employment_type'].value_counts()

Full-time                138
Part-time                 69
Temporary                 62
Other                     55
Contract                  20
Full-time & Part-time      5
Name: employment_type, dtype: int64

It seems like by default, Internship roles are also classified as Full-time and sometimes even Part-time jobs. For full-time, it is understandable because you can say that Internship is also kind of like a special full-time job for students who are not graduating yet.

Next, we need to check on the fraudulent entries and see how should we impute them.

In [360]:
df7['employment_type'].value_counts()

Full-time                11699
Contract                  1616
Part-time                  820
Temporary                  237
Full-time & Part-time      236
Other                      217
Name: employment_type, dtype: int64

In [361]:
len(df7[(df7['fraudulent'] == 't')]), len(df7[(df7['employment_type'].isnull()) & (df7['fraudulent'] == 't')])

(858, 152)

In [362]:
df7[(df7['employment_type'] == 'Other') & (df7['fraudulent'] == 't')].head(30)

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
1662,administrative assistance,"US, NY, Moravia",admin,13-20,,<p>This position is for an Administrative Assistant whose job will primarily consist of calling ...,<ul>\r\n<li>Must be proficient with Outlook</li>\r\n<li>Some knowledge of Quickbooks</li>\r\n<li...,"<p>Benefit includes: health and welfare coverage, domestic partner coverage, a retirement progra...",f,f,f,Other,Entry level,High School or equivalent,Accounting,Administrative,t,t,US,NY,Moravia
5691,Network Marketing,"US, DE,",,7200-1380000,,"<p>Are you looking to make anywhere from 600-115,000$ a month? Are you looking to be paid to tak...","<p>An ambition to succeed, A desire to be the best at our field and not be discourage when peopl...","<p>Residual Income, Travel dollars, Car Dollars, the ability to rise in pay grade rapidly</p>",f,f,f,Other,Not Applicable,Unspecified,Market Research,Marketing,t,t,US,DE,
6574,Home Based Commission Roles,"US, IN,",,,,"<p>A Variety of Commission based jobs available.</p>\r\n<p>Visit: <a href=""#URL_0a7c4df3a2fd5dd5...",<p>Internet Access</p>\r\n<p>Home PC or Laptop</p>\r\n<p>Commitment</p>,<p>Great Commission</p>\r\n<p>Be you own Boss</p>\r\n<p>Hours to Suit</p>,f,f,t,Other,Associate,Unspecified,Marketing and Advertising,Other,t,t,US,IN,
6854,Network Marketing,"US, AK,",,7200-1380000,,"<p>Are you looking to make anywhere from 600-115,000$ a month? Are you looking to be paid to tak...","<p>An ambition to succeed, A desire to be the best at our field and not be discourage when peopl...","<p>Residual Income, Travel dollars, Car Dollars, the ability to rise in pay grade rapidly</p>",f,f,f,Other,Not Applicable,Unspecified,Marketing and Advertising,Advertising,t,t,US,AK,
10953,Recruitment & Talent Acquisition Professional,"US, CA, San Francisco",Recruiter Network,,<p>Aptitude Staffing Solutions has redesigned the recruiting wheel. Our innovative new platform ...,<p></p>\r\n<p></p>\r\n<p>We are looking for top ranked Technical Recruiters to join our network ...,<ul>\r\n<li>3+ years in the recruiting and/or staffing industry</li>\r\n<li>3+ years in a techni...,"<ul>\r\n<li>Access to cutting-edge marketing, sourcing, job posting and recruiting technology &a...",f,t,t,Other,Mid-Senior level,Bachelor's Degree,Staffing and Recruiting,Human Resources,t,f,US,CA,San Francisco
17538,administrative assistance,US,admin,13-20,,<p>This position is for an Administrative Assistant whose job will primarily consist of calling ...,<ul>\r\n<li>Must be proficient with Outlook</li>\r\n<li>Some knowledge of Quickbooks</li>\r\n<li...,"<p>Benefit includes: health and welfare coverage, domestic partner coverage, a retirement progra...",f,f,f,Other,Entry level,High School or equivalent,Accounting,Administrative,t,t,US,,
17602,Network Marketing,"US, NH,",,7200-1380000,,"<p>Are you looking to make anywhere from 600-115,000$ a month? Are you looking to be paid to tak...","<p>An ambition to succeed, A desire to be the best at our field and not be discourage when peopl...","<p>Residual Income, Travel dollars, Car Dollars, the ability to rise in pay grade rapidly</p>",f,f,f,Other,Not Applicable,Unspecified,Marketing and Advertising,Marketing,t,t,US,NH,
17661,DATA ENTRY,"US, ,","Data Entry, Clerical Admin, Administrative Assistant, Customer Service, Accounting, payroll Cle...",,,"<p>We produce Networking Software (IOS &amp; NX-OS)Optical Networking , Routers,Toric Marker,Pre...",<p>High School</p>\r\n<p>Bachelors Degree</p>\r\n<p>6 Month accounting experience</p>\r\n<p>Expe...,"<p> Health, Dental, Life and AD&amp;D Insurance, Employee Wellness and 401k #URL_c801649eeb40077...",t,f,t,Other,Entry level,High School or equivalent,Computer Networking,Accounting/Auditing,t,t,US,,
17711,Network Marketing,"US, HI,",,7200-1380000,,"<p>Are you looking to make anywhere from 600-115,000$ a month? Are you looking to be paid to tak...","<p>An ambition to succeed, A desire to be the best at our field and not be discourage when peopl...","<p>Residual Income, Travel dollars, Car Dollars, the ability to rise in pay grade rapidly as lon...",f,f,f,Other,Not Applicable,Unspecified,Marketing and Advertising,Marketing,t,t,US,HI,
17722,administrative assistance,"US, NY, Moravia",admin,13-20,,<p></p>\r\n<p>We are Looking for a person with strong writing skills and demonstrable experience...,<ul>\r\n<li>Must be proficient with Outlook</li>\r\n<li>Some knowledge of Quickbooks</li>\r\n<li...,"<p><br>Benefit includes: health and welfare coverage, domestic partner coverage, a retirement pr...",f,f,f,Other,Entry level,High School or equivalent,Accounting,Administrative,t,t,US,NY,Moravia


In [363]:
df7[(df7['employment_type'].isnull()) & (df7['fraudulent'] == 't')].head(40)

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
144,Forward Cap.,,,,,<p>The group has raised a fund for the purchase of homes in the Southeast. The student on this p...,,,f,f,f,,Not Applicable,Empty requirements,,Financial Analyst,t,t,Undefined,Undefined,Undefined
180,Sales Executive,"PK, SD, Karachi",Sales,,,<p>Sales Executive</p>,<p>Sales Executive</p>,<p>Sales Executive</p>,f,f,f,,Associate,Unspecified,,Sales,t,t,PK,SD,Karachi
1217,Small Business Benefits Consultant,"US, IL, Schaumburg",,,<p>Anthony Warren is a Marketing and Advertising consultant. After completing one enlistment as...,<p>Colonial is looking for 5 sharp people to inform small business owners about the newest healt...,<p>Insurance license.</p>\r\n<p>Car</p>\r\n<p></p>,,f,t,f,,Associate,Unspecified,,Consulting,t,f,US,IL,Schaumburg
1878,KMC,,,,,<p>This is for the KMC project.</p>\r\n<p></p>\r\n<p>We are looking for someone who is reliable ...,,,f,f,f,,Not Applicable,Empty requirements,,Other,t,t,Undefined,Undefined,Undefined
2616,"Developer and Database Administrator Pittsburgh, PA","US, PA, Pittsburgh",,,,<p>This position is a mixture of Developer and Database administrator. The main focus is develop...,<p><b>Job Requirements </b></p>\r\n<p><b>EXPERIENCE</b></p>\r\n<p>· Minimum of 5 years of experi...,<p><b>Salary:82K</b></p>,f,f,f,,Associate,Bachelor's Degree,,Information Technology,t,f,US,PA,Pittsburgh
2927,"LEAN Program Manager, Carol Stream, IL","US, IL, Carol Stream",,,,"<p>The LEAN Programs Manager will lead strategic, enterprise-level initiatives to deliver short ...","<p><b>REQUIRED KNOWLEDGE, SKILLS AND ABILITIES </b></p>\r\n<p>· Master Black Belt certification ...",<p><b>Salary:120K</b></p>\r\n<p><b></b></p>,f,f,f,,Mid-Senior level,Bachelor's Degree,,Management,t,f,US,IL,Carol Stream
2939,Customer Assistant,"CA, ON, Toronto",,,"<p>Inctor Consulting is world wide known for advising, and giving answers to some of the most di...","<p>Every project we deal with involves many different specialists. Their talent and knowledge, s...","<p>- Attention to Detail.<br />- PC Proficiency, Proficient with MS Word and Excel.<br />- Abili...",,t,t,f,,Entry level,Unspecified,,Customer Service,t,f,CA,ON,Toronto
3175,Earn the Income You Deserve,"US, DC, Washington",,,,<p><b>Prepare yourself</b> to learn about an exciting way to earn money without leaving your hom...,<ul>\r\n<li>Professional manner</li>\r\n<li>Positive outlook</li>\r\n<li>Ability to work autonom...,"<p>If you have a laptop, phone and a strong desire to achieve success in your life then this is ...",t,f,t,,Entry level,Unspecified,,General Business,t,f,US,DC,Washington
3179,Home Based Payroll Data Entry Clerk Position - Earn $100-$200 Daily,GB,,,,"<p>We are a full-service marketing and staffing firm, serving companies ranging from Fortune 100...",<p>Requirements</p>\r\n<p>All you need is access to the Internet and you can participate. Comput...,<p>This is an entry level position and we offer full online training. You do NOT need any specia...,f,f,f,,Entry level,Unspecified,,Administrative,t,t,GB,,
3182,Home Based Payroll Data Entry Clerk Position - Earn $100-$200 Daily,"US, MS, Abbeville",,,,"<p>We are a full-service marketing and staffing firm, serving companies ranging from Fortune 100...",<p>Requirements</p>\r\n<p>All you need is access to the Internet and you can participate. Comput...,<p>This is an entry level position and we offer full online training. You do NOT need any specia...,f,f,f,,Entry level,Unspecified,,Administrative,t,t,US,MS,Abbeville


In [364]:
len(df7[(df7['employment_type'].isnull()) & (df7['fraudulent'] == 't') & \
        ((df7['title'].str.contains("Home Based", case=False)) | (df7['title'].str.contains("at Home", case=False)))])

33

In [365]:
len(df7[(df7['employment_type'].isnull()) & (df7['fraudulent'] == 't') & \
        (df7['description'].str.contains('We have several openings available in this area earning', case=False))])

20

In [366]:
df7[(df7['fraudulent'] == 't')]['employment_type'].value_counts()

Full-time                503
Part-time                 74
Contract                  61
Full-time & Part-time     51
Other                     15
Temporary                  2
Name: employment_type, dtype: int64

In [367]:
len(df7[(df7['employment_type'].isnull()) & (df7['fraudulent'] == 't') & \
        (df7['description'].str.contains('Internet Base Business'))])

2

In [368]:
df7[(df7['employment_type'].isnull()) & (df7['fraudulent'] == 't') & \
        (df7['description'].str.contains('Internet Base Business'))]

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
5441,Easy Money,"US, FL, Deltona",,,<p>DMT Instant Reward. We are Hiring all The Time.</p>,"<p>""Do You Want To Own Your Internet Base Business With No Money Down??""</p>\r\n<p>"" WHO WANT`S ...","<p>Computer, Internet and Telephone</p>",<p>Work from home Full-time or Part-time be your own BOSS.</p>,f,t,t,,Not Applicable,Bachelor's Degree,,General Business,t,f,US,FL,Deltona
17571,Work From Home (Easy Money),"US, FL, Orlando,Lake City, Jacksonville,Atlanta,Ocala,Miami,Asbury Park NJ, Belmar NJ, Toms Rive...",,,<p>DMT Instant Reward. We are Hiring all The Time.</p>,"<p>""Do You Want To Own Your Internet Base Business With No Money Down??""</p>\r\n<p>"" WHO WANT`S ...","<p>Computer, Internet and Telephone</p>\r\n<p> </p>","<p>Company looking to fill a few CSR places, Job duties are very simple this is internet Home Bu...",f,t,f,,Not Applicable,Bachelor's Degree,,General Business,t,t,US,FL,"Orlando,Lake City, Jacksonville,Atlanta,Ocala,Miami,Asbury Park NJ, Belmar NJ, Toms River NJ."


Since there are a few entries of home based jobs with the employment of Other, we just assume that Other value is suitable for other home based typist jobs for simplicity. The rest of the jobs will be assumed as Full-time positions since there are no certain way of telling if those jobs are temporary or not. Same applies to those home based internet business (easy money) jobs.

In [369]:
# Home-based jobs
df7.loc[(df7['employment_type'].isnull()) & (df7['fraudulent'] == 't') & \
        ((df7['title'].str.contains("Home Based", case=False)) | (df7['title'].str.contains("at Home", case=False))), 
        'employment_type'] = "Other"

# Easy money jobs
df7.loc[(df7['employment_type'].isnull()) & (df7['fraudulent'] == 't') & \
        (df7['description'].str.contains('Internet Base Business', case=False)), 'employment_type'] = "Other"

# The rest will be Full-time
df7.loc[(df7['employment_type'].isnull()) & (df7['fraudulent'] == 't'), 'employment_type'] = "Full-time"

In [370]:
# Check the count again
len(df7[(df7['employment_type'].isnull()) & (df7['fraudulent'] == 't')])

0

In [371]:
# View the count distribution
df7[(df7['fraudulent'] == 't')]['employment_type'].value_counts()

Full-time                620
Part-time                 74
Contract                  61
Full-time & Part-time     51
Other                     50
Temporary                  2
Name: employment_type, dtype: int64

In [372]:
df7['employment_type'].isnull().sum()

2668

Let's see the count distribution up until this point before we make our next move.

In [373]:
df7['employment_type'].value_counts()

Full-time                11816
Contract                  1616
Part-time                  820
Other                      252
Temporary                  237
Full-time & Part-time      236
Name: employment_type, dtype: int64

In [374]:
df7[(df7['fraudulent'] == 'f')]['employment_type'].value_counts()

Full-time                11196
Contract                  1555
Part-time                  746
Temporary                  235
Other                      202
Full-time & Part-time      185
Name: employment_type, dtype: int64

Since we have dealt with the most crucial observations which are the fraudulent ones, we have less burden for the non-fraud ones. To speed things up, the remaining rows will be assumed as Full-time positions since that is the mode for non-fraud observations.

In [375]:
df7['employment_type'] = df7['employment_type'].fillna("Full-time")

# Check the null count now
df7['employment_type'].isnull().sum()

0

In [376]:
# Check the new distribution
df7['employment_type'].value_counts()

Full-time                14484
Contract                  1616
Part-time                  820
Other                      252
Temporary                  237
Full-time & Part-time      236
Name: employment_type, dtype: int64

Let us view the total null entries for all columns now.

In [377]:
df7.isnull().sum()

title                      0
location                 343
department             11362
salary_range           14813
company_profile         3287
description                0
requirements            2650
benefits                7103
telecommuting              0
has_company_logo           0
has_questions              0
employment_type            0
required_experience        0
required_education         0
industry                4849
function                   0
fraudulent                 0
in_balanced_dataset        0
country                    0
state                      0
city                       0
dtype: int64

Take note that columns like state and city contains empty strings that we need to take care of since these are metadata columns that cannot be used with empty strings. Also, we won't use department and industry columns considering that it takes too much time to impute those null values properly.

#### vi. Country, State and City

In [378]:
df7['country'].value_counts()

US    10524
GB     2349
GR      939
CA      450
DE      390
      ...  
SV        1
KZ        1
UG        1
JM        1
SD        1
Name: country, Length: 93, dtype: int64

In [379]:
len(df7[df7['country'] == ''])

0

In [380]:
df7['state'].value_counts()

       2200
CA     2025
NY     1248
LND    1004
TX      959
       ... 
TRF       1
OLD       1
HP        1
CBF       1
VLI       1
Name: state, Length: 329, dtype: int64

In [381]:
len(df7[df7['state'] == ''])

2200

We have 2200 empty strings which is quite bad, let's take a look at some rows.

In [382]:
df7[df7['state'] == ''].head(20)

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
1,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,f,t,f,Full-time,Not Applicable,Unspecified,Marketing and Advertising,Customer Service,f,f,NZ,,Auckland
16,Hands-On QA Leader,"IL, , Tel Aviv, Israel",R&D,,<p>At HoneyBook we’re re-imagining the events industry and building a product that is already ch...,"<p>We are looking for a Hands-On QA Leader for our talented R&amp;D team, located in the Center ...",<ul>\r\n<li>Previous experience in client &amp; server testing</li>\r\n<li>Experience in Leading...,,f,t,f,Full-time,Mid-Senior level,Unspecified,Internet,Engineering,f,f,IL,,"Tel Aviv, Israel"
22,Engagement Manager,"AE, ,",Engagement,,<p>Upstream’s mission is to revolutionise the way companies market to consumers through cutting ...,<p>The position reports to the Head of Engagement Management in the Mobile Operator Business Uni...,"<p><b>Requirements</b></p>\r\n<p>The ideal candidate will be bright, ambitious, self-driven, har...",<p><b>Salary &amp; Benefits</b></p>\r\n<ul>\r\n<li>The opportunity to learn and grow in a world-...,f,t,t,Full-time,Mid-Senior level,Bachelor's Degree,Telecommunications,Sales,f,f,AE,,
26,Marketing Exec,"SG, ,",Marketing,,<p></p>\r\n<p>If working in a cubical seems like your idea of hell then joining our awesome star...,<p>We are currently expanding our Marketing Department and we are looking to hire a Content Mark...,"<p></p>\r\n<p>This position is Junior to Mid level. Ideally, you should have:</p>\r\n<p></p>\r\n...",<p>We are looking for Singaporean residents or internationals able to relocate to Singapore imme...,f,t,f,Full-time,Associate,Unspecified,Online Media,Marketing,f,f,SG,,
42,Jr. Developer,US,,40000-50000,,"<p>Entry level Software Developer<br>Location : Atlanta, Georgia<br>Experience : 1-2 years</p>\r...",,,f,f,f,Full-time,Entry level,Bachelor's Degree,Computer Software,Engineering,f,f,US,,
55,Junior HR Marketing Manager,"PL, ,",,,<p><b>We are Netguru and we love to develop web application based on Ruby On Rails framework. We...,<p>We are Netguru and we love to develop web applications based on Ruby On Rails framework. We v...,<p><b>Great Junior HR Marketing Manager is a person who knows how to:</b></p>\r\n<ul>\r\n<li>wri...,<h3>Perks &amp; benefits:</h3>\r\n<ul>\r\n<li>co-financing international conferences</li>\r\n<li...,f,t,t,Full-time,Mid-Senior level,Unspecified,,Management,f,f,PL,,
64,SENIOR FINANCE SOFTWARE RESEARCHER AND ENGINEER,"US, ,",,,,"<p>DUTIES: Conduct research for building technical, statistical, algorithmic and math models</p>...","<p>REQUIREMENTS: Bachelor’s degree in Mathematics, statistics, computer software</p>\r\n<p>engin...",,f,f,f,Full-time,Mid-Senior level,Master's Degree,,Information Technology,f,f,US,,
82,Edinburgh Fragrance and Beauty Promotional Staff,"GB, , Edinburgh",,,<p>Established on the principles that full time education is not for everyone Spectrum Learning ...,<p>We are currently recruiting for an exciting Sales &amp; Customer Service role. We are looking...,<p>Experience in fragrance and sales.</p>,<p>Bonuses are available.</p>,f,t,t,Temporary,Associate,Unspecified,Cosmetics,Sales,f,f,GB,,Edinburgh
91,Recruitment Consultant,"GB, ,",,17000-20000,"<p>Change Automotive Recruitment was established in 1993, with the aim of supplying quality, tal...",<p>Are you looking to continue your Career working for a Boutique Agency who are located in Scar...,"<p>You will be responsible for all aspects of the 360 degree Recruitment Consultant’s role, incl...","<ul>\r\n<li>An Experienced Recruiter / Resourcer , who is looking for further Development Oppor...",f,t,f,Full-time,Mid-Senior level,Bachelor's Degree,Staffing and Recruiting,Consulting,f,f,GB,,
94,WF17 9LU Customer Service Apprenticeship under NAS 16-18 year olds only!,"GB, , Birstall",,,<p>Established on the principles that full time education is not for everyone Spectrum Learning ...,<p>This is fantastic opportunity for someone wanting to start their career in Customer Service. ...,<p>Government funding is only available for 16-18 year olds as this job is an apprenticeship. </p>,<p>Future Prospects</p>,f,t,t,Full-time,Not Applicable,High School or equivalent,,Administrative,f,f,GB,,Birstall


In [383]:
len(df7[(df7['state'] == '') & (df7['city'] == 'London')])

178

In [384]:
(df7['state'] == 'LND').sum()

1004

In [385]:
len(df7[(df7['state'] == 'LND') & (df7['city'] == 'London')])

718

We have 1004 rows with LND as the state and 178 rows with empty state but a city of London. It is also confirmed that 718 rows share both LND and city of London so this method of verification is valid.

In [386]:
len(df7[(df7['state'] == 'LND') & (df7['city'] == '')])

131

In [387]:
(df7['city'] == 'London').sum()

1053

As for empty city, same applies for LND and London, where we can impute 131 empty entries.

In [388]:
df7[(df7['state'] == '') & (df7['city'] != '') & (df7['city'].notnull())]['city'].value_counts()

London             178
Athens              61
Brussels            46
Dublin              45
Berlin              41
                  ... 
Abudhabi             1
Wirral               1
Valanciennes         1
New York/Boston      1
New Jersey           1
Name: city, Length: 344, dtype: int64

In [389]:
df7[(df7['state'] == '') & (df7['city'] != '') & (df7['city'].notnull())]['city'].value_counts().head(50)

London                                                            178
Athens                                                             61
Brussels                                                           46
Dublin                                                             45
Berlin                                                             41
Hong Kong                                                          37
Birmingham                                                         19
Manchester                                                         18
Chamberi                                                           18
Quezon City                                                        17
Stockholm                                                          15
Wellington                                                         14
Tel Aviv                                                           13
Work from home                                                     13
Leeds               

In [390]:
df7[(df7['state'] != '') & (df7['city'] == '') & (df7['state'].notnull())]['state'].value_counts()

LND    131
I       40
CA      38
NY      35
01      20
      ... 
44       1
CTT      1
21       1
SPE      1
BPL      1
Name: state, Length: 152, dtype: int64

In [391]:
df7[(df7['state'] != '') & (df7['city'] == '') & (df7['state'].notnull())]['state'].value_counts().head(50)

LND    131
I       40
CA      38
NY      35
01      20
TX      17
PA      16
MA      14
13      13
BE      11
FL      11
ON      10
DC      10
OH      10
GA      10
NJ       9
MI       9
CT       8
B        8
LA       7
34       7
MAN      7
BIR      6
AZ       6
IN       6
N        6
KY       5
TA       5
MD       5
11       5
DA       5
OR       5
DU       5
NV       5
NSW      5
03       4
BRU      4
56       4
ND       4
WI       4
5        4
IL       4
DL       4
LDS      4
SHF      4
PR       4
VA       4
SP       4
TN       4
WA       3
Name: state, dtype: int64

Perhaps do a group by so we can see what countries those states are from.

GR (Greece), DE (Germany), CA (Canada), TR (Turkey), GB (Britain), CY (Cyprus), RO (Romania), AE (United Arab Emirates), QA (Qatar), IL (Israel), AU (Australia), MT (malta), IN (india)

In [392]:
df7[(df7['state'] != '') & (df7['city'] == '') & (df7['state'].notnull())].groupby(['country', 'state']).size().sort_values(ascending=False).head(40)

country  state
GB       LND      131
GR       I         40
US       CA        38
         NY        35
         TX        17
         PA        16
         MA        14
SG       01        12
US       FL        11
DE       BE        11
CA       ON        10
US       GA        10
         DC        10
         OH        10
JP       13        10
US       NJ         9
         MI         9
         CT         7
TR       34         7
GB       MAN        7
         BIR        6
CY       01         6
US       IN         6
RO       B          6
NZ       N          6
AE       DU         5
QA       DA         5
IL       TA         5
US       KY         5
         LA         5
         MD         5
         NV         5
         OR         5
AU       NSW        5
GB       LDS        4
US       ND         4
MT       56         4
IN       DL         4
US       TN         4
GB       SHF        4
dtype: int64

The idea is that, one state can have many many cities, but one city can only belong to one state. Based on this idea, it is more feasible and logical to impute state and probably discard city later.

In [393]:
df7[(df7['state'] == '') & (df7['city'] != '') & (df7['city'].notnull())]['city'].value_counts().head(60)

London                                                            178
Athens                                                             61
Brussels                                                           46
Dublin                                                             45
Berlin                                                             41
Hong Kong                                                          37
Birmingham                                                         19
Manchester                                                         18
Chamberi                                                           18
Quezon City                                                        17
Stockholm                                                          15
Wellington                                                         14
Tel Aviv                                                           13
Work from home                                                     13
Leeds               

In [394]:
df7[(df7['state'] == '') & (df7['city'] != '') & (df7['city'].notnull())].groupby(['country', 'city']).size().sort_values(ascending=False).head(60)

country  city                                                          
GB       London                                                            177
GR       Athens                                                             60
BE       Brussels                                                           46
IE       Dublin                                                             45
DE       Berlin                                                             41
HK       Hong Kong                                                          37
GB       Birmingham                                                         19
         Manchester                                                         18
ES       Chamberi                                                           18
PH       Quezon City                                                        17
SE       Stockholm                                                          15
NZ       Wellington                                        

Can we use similar rows to impute these empty states?

In [395]:
country_city = df7[(df7['state'] == '') & (df7['city'] != '') & (df7['city'].notnull())].groupby([
    'country', 'city']).size().sort_values(ascending=False).reset_index().rename(columns={0: 'count'})
country_city

Unnamed: 0,country,city,count
0,GB,London,177
1,GR,Athens,60
2,BE,Brussels,46
3,IE,Dublin,45
4,DE,Berlin,41
...,...,...,...
359,GB,Urmston,1
360,GB,Uxbridge,1
361,GB,Various,1
362,GB,Various in West Yorkshire,1


In [396]:
for i in country_city.index:
    if len(df7.loc[(df7['country'] == country_city.loc[i, 'country']) & (df7['city'] == country_city.loc[i, 'city']), 
                      'state']) > 0:
        if df7.loc[(df7['country'] == country_city.loc[i, 'country']) & (df7['city'] == country_city.loc[i, 'city']), 
                      'state'].value_counts().idxmax() != '':
            print(df7.loc[(df7['country'] == country_city.loc[i, 'country']) & (df7['city'] == country_city.loc[i, 'city']), 
                          'state'].value_counts().idxmax())
        else:
            print("@ Mode is an empty string!")
    else:
        print("**Cannot extract")

LND
I
@ Mode is an empty string!
L
BE
@ Mode is an empty string!
BIR
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
AB
N
TA
LDS
NH
@ Mode is an empty string!
NY
84
N
01
ES
ANT
37
@ Mode is an empty string!
B
@ Mode is an empty string!
@ Mode is an empty string!
CT
@ Mode is an empty string!
DU
VL
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
WKF
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
CA
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
MZ
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
RIC
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
@ Mode is an empty string!
AP
@ Mode is an empty string!
13


In [397]:
# Copy a new df
df8 = df7.copy()

# Create a counter to count empty strings
counts = 0

# Start the imputation with messages
for i in country_city.index:
    if len(df8.loc[(df8['country'] == country_city.loc[i, 'country']) & (df8['city'] == country_city.loc[i, 'city']), 
                      'state']) > 0:
        if df8.loc[(df8['country'] == country_city.loc[i, 'country']) & (df8['city'] == country_city.loc[i, 'city']), 
                      'state'].value_counts().idxmax() != '':
            mode = df8.loc[(df8['country'] == country_city.loc[i, 'country']) & (df8['city'] == country_city.loc[i, 'city']), 
                           'state'].value_counts().idxmax()
            df8.loc[(df8['country'] == country_city.loc[i, 'country']) & (df8['city'] == country_city.loc[i, 'city']) & \
                    (df8['state'] == ''), 'state'] = mode
            print(country_city.loc[i, 'country'], country_city.loc[i, 'city'], "done, state:", mode)
        elif (df8.loc[(df8['country'] == country_city.loc[i, 'country']) & (df8['city'] == country_city.loc[i, 'city']), 
                      'state'].value_counts().index[0] == '') & \
            (len(df8.loc[(df8['country'] == country_city.loc[i, 'country']) & (df8['city'] == country_city.loc[i, 'city']), 
                      'state'].value_counts()) > 1):
            mode = df8.loc[(df8['country'] == country_city.loc[i, 'country']) & (df8['city'] == country_city.loc[i, 'city']), 
                           'state'].value_counts().index[1]
            df8.loc[(df8['country'] == country_city.loc[i, 'country']) & (df8['city'] == country_city.loc[i, 'city']) & \
                    (df8['state'] == ''), 'state'] = mode
            print(country_city.loc[i, 'country'], country_city.loc[i, 'city'], "done using 2nd non-empty state, state:", 
                  mode)
        else:
            print("@ Mode is an empty string for", country_city.loc[i, 'country'], country_city.loc[i, 'city'])
            counts += 1
    else:
        print("**Cannot extract")

GB London done, state: LND
GR Athens done, state: I
BE Brussels done using 2nd non-empty state, state: BRU
IE Dublin done, state: L
DE Berlin done, state: BE
@ Mode is an empty string for HK Hong Kong
GB Birmingham done, state: BIR
GB Manchester done using 2nd non-empty state, state: MAN
@ Mode is an empty string for ES Chamberi 
PH Quezon City done using 2nd non-empty state, state: 14
SE Stockholm done, state: AB
NZ Wellington done, state: N
IL Tel Aviv done, state: TA
GB Leeds done, state: LDS
NL Amsterdam done, state: NH
@ Mode is an empty string for NL The Hague
US New York done, state: NY
DK Copenhagen done, state: 84
NZ Auckland done, state: N
SG Singapore done, state: 01
FI Helsinki done, state: ES
GB Belfast done, state: ANT
EE Tallinn done, state: 37
GB Durham done using 2nd non-empty state, state: DUR
RO Bucharest done, state: B
GB Sheffield done using 2nd non-empty state, state: SHF
QA Doha done using 2nd non-empty state, state: DA
ES Barcelona done, state: CT
GB Bristol don

US Hillsboro done, state: OR
US Herndon done, state: VA
US Harrisburg done, state: PA
@ Mode is an empty string for US Denver, CO
@ Mode is an empty string for GB wigan
LT Kaunas done using 2nd non-empty state, state: KU
@ Mode is an empty string for GB Bury St Edmunds
@ Mode is an empty string for GB Chelmorton
@ Mode is an empty string for GB Cheadle
@ Mode is an empty string for GB Central Lodon
@ Mode is an empty string for GB Carlisle
@ Mode is an empty string for GB Cantebury
GB Cambridge done, state: CAM
GB Bournemouth done, state: DOR
@ Mode is an empty string for GB BARNSLEY
@ Mode is an empty string for GB Bodelwyddan
@ Mode is an empty string for GB Blackpool
@ Mode is an empty string for GB Birmingham or London
@ Mode is an empty string for GB Bedford
GB Basingstoke done, state: HAM
@ Mode is an empty string for GB Barnsley
@ Mode is an empty string for GB Cheltenham
GB Colchester done, state: ESS
@ Mode is an empty string for GB Cork
@ Mode is an empty string for GB Cornwa

In [398]:
counts

167

In [399]:
(df7['state'] == '').sum(), (df8['state'] == '').sum()

(2200, 1297)

903 empty states are imputed which is pretty nice. Let's see what else we can do.

ES Chamberi = MD, NL The Hague = ZH, MY Kuala Lumpur = 14, GB Soho, London = LND, AE Media City = DU, UA Kharkov = 63, GB Ossett = CLD, PH Pasig = 00, GB Wolverhampton = WLV, US West Virginia = WV, US Pittsburgh = PA, US North Portland = OR, US St Paul = MN, GB Heathrow = LND, GB Gatwick = WSX, GB Hartlepool = DUR, GB High Wycombe = BKM, GB Southport = LAN, IL Tel Aviv = TA, GB Skipton = NYK, GB Tamworth = STS, GB Brighouse = CLD, US Stocton, CA = CA, BE Antwerpen = VLG, BD Dhaka = C, US Gilroy Hollister = CA, IQ Erbil = AR, IN Hydrebad/Pune/remote = TG, IN INDORE = MP, IT Milan Area = 25, IT Rome = 62, KE Nai = 110, LK Colombo = 1, IN Gurgoan = HR, IL Tel Aviv, Israel = TA, MT Sliema = 56, IL Herzliya = TA, IE Cork = M, ID Medan = SM, GR HALANDRI = I, GT Athena = I, GB wigan = LAN, 

In [400]:
df8[(df8['state'] == '') & (df8['city'] != '') & (df8['city'].notnull())].groupby([
    'country', 'city']).size().sort_values(ascending=False).head(60)

country  city                                                          
HK       Hong Kong                                                         37
ES       Chamberi                                                          18
NL       The Hague                                                          9
US       All Locations                                                      5
GB       Southport                                                          4
         Soho, London                                                       4
         See the Requirements section for areas and locations available     4
         Birstall                                                           4
AE       Media City                                                         3
UA       Kharkov                                                            3
GB       Ossett                                                             3
         Heathrow                                                     

Potential errors: Madrid, Spain in GB. Remote in US. Remote in GR, Work from home in GB, Work from home in LU. Hydrebad/Pune/remote in IN. Work from home in IT. Work from home in IE. 

In [401]:
# Create dict?
state_dict = {
    ('ES', 'Chamberi'): "MD", ('NL', 'The Hague'): "ZH", ('MY', 'Kuala Lumpur'): "14", ('GB', 'Soho, London'): "LND", 
    ('AE', 'Media City'): "DU", ('UA', 'Kharkov'): "63", ('GB', 'Ossett'): "CLD", ('PH', 'Pasig'): "00", 
    ('GB', 'Wolverhampton'): "WLV", ('US', 'West Virginia'): "WV", ('US', 'Pittsburgh'): "PA", ('US', 'North Portland'): "OR", 
    ('US', 'St Paul'): "MN", ('GB', 'Heathrow'): "LND", ('GB', 'Gatwick'): "WSX", ('GB', 'Hartlepool'): "DUR", 
    ('GB', 'High Wycombe'): "BKM", ('GB', 'Southport'): "LAN", ('IL', 'Tel Aviv'): "TA", ('GB', 'Skipton'): "NYK", 
    ('GB', 'Tamworth'): "STS", ('GB', 'Brighouse'): "CLD", ('US', 'Stocton, CA'): "CA", ('BE', 'Antwerpen'): "VLG", 
    ('BD', 'Dhaka'): "C", ('US', 'Gilroy Hollister'): "CA", ('IQ', 'Erbil'): "AR", ('IN', 'Hydrebad/Pune/remote'): "TG", 
    ('IN', 'INDORE'): "MP", ('IT', 'Milan Area'): "25", ('IT', 'Rome'): "62", ('KE', 'Nai'): "110", ('LK', 'Colombo'): "1", 
    ('IN', 'Gurgoan'): "HR", ('IL', 'Tel Aviv, Israel'): "TA", ('MT', 'Sliema'): "56", ('IL', 'Herzliya'): "TA", 
    ('IE', 'Cork'): "M", ('ID', 'Medan'): "SM", ('GR', 'HALANDRI'): "I", ('GT', 'Athena'): "I", ('GB', 'wigan'): "LAN"
}

In [402]:
for country_city in state_dict.keys():
    print(country_city[0], country_city[1])

ES Chamberi
NL The Hague
MY Kuala Lumpur
GB Soho, London
AE Media City
UA Kharkov
GB Ossett
PH Pasig
GB Wolverhampton
US West Virginia
US Pittsburgh
US North Portland
US St Paul
GB Heathrow
GB Gatwick
GB Hartlepool
GB High Wycombe
GB Southport
IL Tel Aviv
GB Skipton
GB Tamworth
GB Brighouse
US Stocton, CA
BE Antwerpen
BD Dhaka
US Gilroy Hollister
IQ Erbil
IN Hydrebad/Pune/remote
IN INDORE
IT Milan Area
IT Rome
KE Nai
LK Colombo
IN Gurgoan
IL Tel Aviv, Israel
MT Sliema
IL Herzliya
IE Cork
ID Medan
GR HALANDRI
GT Athena
GB wigan


In [403]:
# Monitor count
print("Before imputation:", (df8['state'] == '').sum())

for country_city in state_dict.keys():
    df8.loc[(df8['state'] == '') & (df8['country'] == country_city[0]) & (df8['city'] == country_city[1]), 
            'state'] = state_dict[country_city]

# Check new count
print("After imputation:", (df8['state'] == '').sum())

Before imputation: 1297
After imputation: 1234


In [404]:
df8[df8['city'] == 'Work from home']

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset,country,state,city
1315,Associate Business Development,"GB, , Work from home",,,,<p>Want to build a career in IT? Free training in exchange for your time on revenue share basis<...,,,f,f,f,Full-time,Entry level,Empty requirements,Information Technology and Services,Business Development,f,f,GB,,Work from home
1370,Associate Business Development,"BE, , Work from home",,,,<p>Want to build a career in IT? Free training in exchange for your time on revenue share basis<...,,,f,f,f,Full-time,Entry level,Empty requirements,Information Technology and Services,Business Development,f,f,BE,,Work from home
1949,Associate Business Development,"ES, , Work from home",,,,<p>Want to build a career in IT? Free training in exchange for your time on revenue share basis<...,,,f,f,f,Full-time,Entry level,Empty requirements,Information Technology and Services,Business Development,f,f,ES,,Work from home
2199,Associate Business Development,"IE, , Work from home",,,,<p>Want to build a career in IT? Free training in exchange for your time on revenue share basis<...,,,f,f,f,Full-time,Entry level,Empty requirements,Information Technology and Services,Business Development,f,f,IE,,Work from home
2201,Associate Business Development,"LU, , Work from home",,,,<p>Want to build a career in IT? Free training in exchange for your time on revenue share basis<...,,,f,f,f,Full-time,Entry level,Empty requirements,Information Technology and Services,Business Development,f,f,LU,,Work from home
2208,Associate Business Development,"FR, , Work from home",,,,<p>Want to build a career in IT? Free training in exchange for your time on revenue share basis<...,,,f,f,f,Full-time,Entry level,Empty requirements,Information Technology and Services,Business Development,f,f,FR,,Work from home
2849,Associate Business Development,"AU, , Work from home",,,,<p>Want to build a career in IT? Free training in exchange for your time on revenue share basis<...,,,f,f,f,Full-time,Entry level,Empty requirements,Information Technology and Services,Business Development,f,f,AU,,Work from home
3218,Associate Business Development,"FI, , Work from home",,,,<p>Want to build a career in IT? Free training in exchange for your time on revenue share basis<...,,,f,f,f,Full-time,Entry level,Empty requirements,Information Technology and Services,Business Development,f,f,FI,,Work from home
3247,Associate Business Development,"NO, , Work from home",,,,<p>Want to build a career in IT? Free training in exchange for your time on revenue share basis<...,,,f,f,f,Full-time,Entry level,Empty requirements,Information Technology and Services,Business Development,f,f,NO,,Work from home
3532,Associate Business Development,"CH, , Work from home",,,,<p>Want to build a career in IT? Free training in exchange for your time on revenue share basis<...,,,f,f,f,Full-time,Entry level,Empty requirements,Information Technology and Services,Business Development,f,f,CH,,Work from home


In [405]:
len(df8['country'].unique())

93

In [406]:
len(df8['state'].unique())

334

In [407]:
len(df8['city'].unique())

2344

In [408]:
len(df8['location'].unique())

3106

In [409]:
df8['country'] + ', ' + df8['state'] + ', ' + df8['city']

0            US, NY, New York
1             NZ, N, Auckland
2               US, IA, Wever
3          US, DC, Washington
4          US, FL, Fort Worth
                 ...         
17875         CA, ON, Toronto
17876    US, PA, Philadelphia
17877         US, TX, Houston
17878           NG, LA, Lagos
17879       NZ, N, Wellington
Length: 17645, dtype: object

In [410]:
len((df8['country'] + ', ' + df8['state'] + ', ' + df8['city']).unique())

2867

In [411]:
len(df8['state'].unique())

334

In [412]:
len((df8['country'] + ', ' + df8['state']).unique())

461

After some consideration, we feel that using city is too risky as there are too many unique values with lots of noise and errors. As for state column, although it has less noise compared to city but it still has some risks even if we combine country with state. Thus, the final decision is to use only country. df8 will serve as the cutoff dataframe, we'll come back to this df in case we want to use back some stuffs.

In [413]:
df8.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17645 entries, 0 to 17879
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   title                17645 non-null  object
 1   location             17302 non-null  object
 2   department           6283 non-null   object
 3   salary_range         2832 non-null   object
 4   company_profile      14358 non-null  object
 5   description          17645 non-null  object
 6   requirements         14995 non-null  object
 7   benefits             10542 non-null  object
 8   telecommuting        17645 non-null  object
 9   has_company_logo     17645 non-null  object
 10  has_questions        17645 non-null  object
 11  employment_type      17645 non-null  object
 12  required_experience  17645 non-null  object
 13  required_education   17645 non-null  object
 14  industry             12796 non-null  object
 15  function             17645 non-null  object
 16  frau

#### Filling the nulls with empty strings.

For the textual columns, they are still useful for the tasks later so we would like to keep them. Thus, we'll fill them up with empty strings.

In [414]:
# Copy a new df
df9 = df8.copy()

df9['location'] = df9['location'].fillna('')
df9['company_profile'] = df9['company_profile'].fillna('')
df9['requirements'] = df9['requirements'].fillna('')
df9['benefits'] = df9['benefits'].fillna('')

In [415]:
df9.isnull().sum()

title                      0
location                   0
department             11362
salary_range           14813
company_profile            0
description                0
requirements               0
benefits                   0
telecommuting              0
has_company_logo           0
has_questions              0
employment_type            0
required_experience        0
required_education         0
industry                4849
function                   0
fraudulent                 0
in_balanced_dataset        0
country                    0
state                      0
city                       0
dtype: int64

In [416]:
df9.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17645 entries, 0 to 17879
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   title                17645 non-null  object
 1   location             17645 non-null  object
 2   department           6283 non-null   object
 3   salary_range         2832 non-null   object
 4   company_profile      17645 non-null  object
 5   description          17645 non-null  object
 6   requirements         17645 non-null  object
 7   benefits             17645 non-null  object
 8   telecommuting        17645 non-null  object
 9   has_company_logo     17645 non-null  object
 10  has_questions        17645 non-null  object
 11  employment_type      17645 non-null  object
 12  required_experience  17645 non-null  object
 13  required_education   17645 non-null  object
 14  industry             12796 non-null  object
 15  function             17645 non-null  object
 16  frau

Now we can finally proceed to the next major task of data preprocessing.

### 3. Dropping Irrelevant Columns

As mentioned before, we'll remove some columns from df9 onwards. These columns are department, salary range, industry, and city. It is too bad that we have to remove state after spending so much effort, perhaps we keep it for now for EDA purposes.

In [417]:
# Copy another new df
df10 = df9.copy()

# Drop a total of 4 columns, we keep location for now
df10 = df10.drop(columns=['department', 'salary_range', 'industry', 'city'])

# View the df
df10.head()

Unnamed: 0,title,location,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,function,fraudulent,in_balanced_dataset,country,state
0,Marketing Intern,"US, NY, New York","<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,f,t,f,Other,Internship,Unspecified,Marketing,f,f,US,NY
1,Customer Service - Cloud Video Production,"NZ, , Auckland","<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,f,t,f,Full-time,Not Applicable,Unspecified,Customer Service,f,f,NZ,N
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,f,t,f,Full-time,Mid-Senior level,Unspecified,Administrative,f,f,US,IA
3,Account Executive - Washington DC,"US, DC, Washington",<p>Our passion for improving quality of life through geography is at the heart of everything we ...,<p><b>THE COMPANY: ESRI – Environmental Systems Research Institute</b></p>\r\n<p>Our passion for...,"<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s or Master’s in GIS, business administration, or a r...","<p>Our culture is anything but corporate—we have a collaborative, creative environment; phone di...",f,t,f,Full-time,Mid-Senior level,Bachelor's Degree,Sales,f,f,US,DC
4,Bill Review Manager,"US, FL, Fort Worth",<p>SpotSource Solutions LLC is a Global Human Capital Management Consulting firm headquartered i...,"<p><b>JOB TITLE:</b> Itemization Review Manager</p>\r\n<p><b>LOCATION:</b> Fort Worth, TX<b> ...",<p><b>QUALIFICATIONS:</b></p>\r\n<ul>\r\n<li>RN license in the State of Texas</li>\r\n<li>Diplom...,<p>Full Benefits Offered</p>,f,t,t,Full-time,Mid-Senior level,Bachelor's Degree,Health Care Provider,f,f,US,FL


In [418]:
df10.shape

(17645, 17)

In [419]:
df10.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17645 entries, 0 to 17879
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   title                17645 non-null  object
 1   location             17645 non-null  object
 2   company_profile      17645 non-null  object
 3   description          17645 non-null  object
 4   requirements         17645 non-null  object
 5   benefits             17645 non-null  object
 6   telecommuting        17645 non-null  object
 7   has_company_logo     17645 non-null  object
 8   has_questions        17645 non-null  object
 9   employment_type      17645 non-null  object
 10  required_experience  17645 non-null  object
 11  required_education   17645 non-null  object
 12  function             17645 non-null  object
 13  fraudulent           17645 non-null  object
 14  in_balanced_dataset  17645 non-null  object
 15  country              17645 non-null  object
 16  stat

### 5. Encoding Binary Columns

The encoding is done by using binary values of 0 and 1 and we'll encode 5 columns: telecommuting, has_company_logo, has_questions, fraudulent, and in_balanced_dataset.

In [420]:
df10['in_balanced_dataset'].value_counts()

f    16751
t      894
Name: in_balanced_dataset, dtype: int64

In [421]:
df10['has_questions'].value_counts()

f    8969
t    8676
Name: has_questions, dtype: int64

In [422]:
df10['has_company_logo'].value_counts()

t    14011
f     3634
Name: has_company_logo, dtype: int64

In [423]:
df10['fraudulent'].value_counts()

f    16787
t      858
Name: fraudulent, dtype: int64

Let's start the replacement now and take a look.

In [424]:
df10['telecommuting'] = df10['telecommuting'].replace({'t': 1, 'f': 0})
df10['has_company_logo'] = df10['has_company_logo'].replace({'t': 1, 'f': 0})
df10['has_questions'] = df10['has_questions'].replace({'t': 1, 'f': 0})
df10['fraudulent'] = df10['fraudulent'].replace({'t': 1, 'f': 0})
df10['in_balanced_dataset'] = df10['in_balanced_dataset'].replace({'t': 1, 'f': 0})

In [425]:
# Check the values now
df10['fraudulent'].value_counts()

0    16787
1      858
Name: fraudulent, dtype: int64

In [426]:
df10.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17645 entries, 0 to 17879
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   title                17645 non-null  object
 1   location             17645 non-null  object
 2   company_profile      17645 non-null  object
 3   description          17645 non-null  object
 4   requirements         17645 non-null  object
 5   benefits             17645 non-null  object
 6   telecommuting        17645 non-null  int64 
 7   has_company_logo     17645 non-null  int64 
 8   has_questions        17645 non-null  int64 
 9   employment_type      17645 non-null  object
 10  required_experience  17645 non-null  object
 11  required_education   17645 non-null  object
 12  function             17645 non-null  object
 13  fraudulent           17645 non-null  int64 
 14  in_balanced_dataset  17645 non-null  int64 
 15  country              17645 non-null  object
 16  stat

We also see that the 5 columns have changed from object to int64 which is what we want.

Looking at our df, we are ready to split the df into metadata and text df. However, keep in mind that location and state columns may not be useful so remind yourself that we may need to drop them in future! In my opinion, state column can only be used for simple EDA but can't be used for ML due to many noise.

### 6. Split the Dataframe into 2

In [427]:
df10.iloc[:, [0, 1, 2, 3, 4, 5, 13]]

Unnamed: 0,title,location,company_profile,description,requirements,benefits,fraudulent
0,Marketing Intern,"US, NY, New York","<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,0
1,Customer Service - Cloud Video Production,"NZ, , Auckland","<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,0
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,0
3,Account Executive - Washington DC,"US, DC, Washington",<p>Our passion for improving quality of life through geography is at the heart of everything we ...,<p><b>THE COMPANY: ESRI – Environmental Systems Research Institute</b></p>\r\n<p>Our passion for...,"<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s or Master’s in GIS, business administration, or a r...","<p>Our culture is anything but corporate—we have a collaborative, creative environment; phone di...",0
4,Bill Review Manager,"US, FL, Fort Worth",<p>SpotSource Solutions LLC is a Global Human Capital Management Consulting firm headquartered i...,"<p><b>JOB TITLE:</b> Itemization Review Manager</p>\r\n<p><b>LOCATION:</b> Fort Worth, TX<b> ...",<p><b>QUALIFICATIONS:</b></p>\r\n<ul>\r\n<li>RN license in the State of Texas</li>\r\n<li>Diplom...,<p>Full Benefits Offered</p>,0
...,...,...,...,...,...,...,...
17875,Account Director - Distribution,"CA, ON, Toronto",<p>Vend is looking for some awesome new talent to come join us. You'll be working in an awesome ...,<p>Just in case this is the first time you’ve visited our website Vend is an award winning web b...,<p>To ace this role you:</p>\r\n<ul>\r\n<li>Will eat comprehensive Statements of Work for breakf...,<p><b>What can you expect from us?</b></p>\r\n<p>We have an open culture where we openly share o...,0
17876,Payroll Accountant,"US, PA, Philadelphia",<p>WebLinc is the e-commerce platform and services provider for the fastest growing online retai...,<p></p>\r\n<p>The Payroll Accountant will focus primarily on payroll functions for approximately...,<p></p>\r\n<p>- B.A. or B.S. in Accounting</p>\r\n<p>- <b>Desire to have fun while doing what yo...,<p></p>\r\n<h3>Health &amp; Wellness</h3>\r\n<ul>\r\n<li>Medical plan</li>\r\n<li>Prescription d...,0
17877,Project Cost Control Staff Engineer - Cost Control Exp - TX,"US, TX, Houston",<p>We Provide Full Time Permanent Positions for many medium to large US companies. We are intere...,<p>Experienced Project Cost Control Staff Engineer is required having responsibility to provide ...,<ul>\r\n<li>At least 12 years professional experience.</li>\r\n<li>Ability to work in a diverse ...,,0
17878,Graphic Designer,"NG, LA, Lagos",,<p>Nemsia Studios is looking for an experienced visual/graphic designer to join our Lagos office...,"<p>1. Must be fluent in the latest versions of Corel &amp; Adobe CC (Esp. Photoshop, Illustrator...",<p>Competitive salary (compensation will be based on experience) <br>Casual attire <br>At Nemsia...,0


In [428]:
# Metadata df, excluding state
df_structured = df10.copy()
df_structured = df_structured.iloc[:, 6:16]

# Text df
df_text = df10.copy()
df_text = df_text.iloc[:, [0, 1, 2, 3, 4, 5, 13]]

In [429]:
df_structured.shape, df_text.shape

((17645, 10), (17645, 7))

In [430]:
df_structured.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17645 entries, 0 to 17879
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   telecommuting        17645 non-null  int64 
 1   has_company_logo     17645 non-null  int64 
 2   has_questions        17645 non-null  int64 
 3   employment_type      17645 non-null  object
 4   required_experience  17645 non-null  object
 5   required_education   17645 non-null  object
 6   function             17645 non-null  object
 7   fraudulent           17645 non-null  int64 
 8   in_balanced_dataset  17645 non-null  int64 
 9   country              17645 non-null  object
dtypes: int64(5), object(5)
memory usage: 2.0+ MB


In [431]:
df_text.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17645 entries, 0 to 17879
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            17645 non-null  object
 1   location         17645 non-null  object
 2   company_profile  17645 non-null  object
 3   description      17645 non-null  object
 4   requirements     17645 non-null  object
 5   benefits         17645 non-null  object
 6   fraudulent       17645 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 1.6+ MB


## Data Cleaning (Unstructured Data)

### Step 1: Lowercase all characters

In order to process them efficiently, it is recommended to combine the texts into one column.

In [432]:
df_text.head(4)

Unnamed: 0,title,location,company_profile,description,requirements,benefits,fraudulent
0,Marketing Intern,"US, NY, New York","<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,0
1,Customer Service - Cloud Video Production,"NZ, , Auckland","<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,0
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,0
3,Account Executive - Washington DC,"US, DC, Washington",<p>Our passion for improving quality of life through geography is at the heart of everything we ...,<p><b>THE COMPANY: ESRI – Environmental Systems Research Institute</b></p>\r\n<p>Our passion for...,"<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s or Master’s in GIS, business administration, or a r...","<p>Our culture is anything but corporate—we have a collaborative, creative environment; phone di...",0


In [433]:
# Concatenate them into a new column
df_text['text'] = df_text['title'] + ' ' + df_text['location'] + ' ' + df_text['company_profile'] + ' ' + \
                  df_text['description'] + ' ' + df_text['requirements'] + ' ' + df_text['benefits']

# View the new df
df_text.head(4)

Unnamed: 0,title,location,company_profile,description,requirements,benefits,fraudulent,text
0,Marketing Intern,"US, NY, New York","<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,0,"Marketing Intern US, NY, New York <h3>We're Food52, and we've created a groundbreaking and award..."
1,Customer Service - Cloud Video Production,"NZ, , Auckland","<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,0,"Customer Service - Cloud Video Production NZ, , Auckland <h3>90 Seconds, the worlds Cloud Video ..."
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,0,"Commissioning Machinery Assistant (CMA) US, IA, Wever <h3></h3>\r\n<p>Valor Services provides Wo..."
3,Account Executive - Washington DC,"US, DC, Washington",<p>Our passion for improving quality of life through geography is at the heart of everything we ...,<p><b>THE COMPANY: ESRI – Environmental Systems Research Institute</b></p>\r\n<p>Our passion for...,"<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s or Master’s in GIS, business administration, or a r...","<p>Our culture is anything but corporate—we have a collaborative, creative environment; phone di...",0,"Account Executive - Washington DC US, DC, Washington <p>Our passion for improving quality of lif..."


In [434]:
df_text.shape

(17645, 8)

The first step of text pre-processing is to take care of cases like we mentioned before. We'll test on one entry first and see if it's working well.

In [435]:
df_text.loc[0, 'text']

"Marketing Intern US, NY, New York <h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.</h3>\r\n<p>We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.</p>\r\n<p>Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.</p>\r\n<p>We're located in Chelsea, in New York City.</p> <p>Food52, a fast-growing, James Bear

We want to try and see if the methods of removing escape characters can work or not.

In [436]:
# Store the text
word = df_text.loc[0, 'text']

# Lowercase them
word.lower()

"marketing intern us, ny, new york <h3>we're food52, and we've created a groundbreaking and award-winning cooking site. we support, connect, and celebrate home cooks, and give them everything they need in one place.</h3>\r\n<p>we have a top editorial, business, and engineering team. we're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. we attract the most talented home cooks and contributors in the country; we also publish well-known professionals like mario batali, gwyneth paltrow, and danny meyer. and we have partnerships with whole foods market and random house.</p>\r\n<p>food52 has been named the best food website by the james beard foundation and iacp, and has been featured in the new york times, npr, pando daily, techcrunch, and on the today show.</p>\r\n<p>we're located in chelsea, in new york city.</p> <p>food52, a fast-growing, james bear

In [437]:
df_text['text'].str.lower()

0        marketing intern us, ny, new york <h3>we're food52, and we've created a groundbreaking and award...
1        customer service - cloud video production nz, , auckland <h3>90 seconds, the worlds cloud video ...
2        commissioning machinery assistant (cma) us, ia, wever <h3></h3>\r\n<p>valor services provides wo...
3        account executive - washington dc us, dc, washington <p>our passion for improving quality of lif...
4        bill review manager us, fl, fort worth <p>spotsource solutions llc is a global human capital man...
                                                        ...                                                 
17875    account director - distribution  ca, on, toronto <p>vend is looking for some awesome new talent ...
17876    payroll accountant us, pa, philadelphia <p>weblinc is the e-commerce platform and services provi...
17877    project cost control staff engineer - cost control exp - tx us, tx, houston <p>we provide full t...
17878    graphic de

We won't be applying this right away but instead we just want to make sure it can be done, and we'll combine all steps into one code cell later.

### Step 2: Removal of Symbols, Punctuations, Links and Special Characters

#### Remove Unicode Characters

This step ensures that unreadable unicode characters while in ASCII format can be removed, and we want to check if they exist.

In [438]:
# Store the text
word = df_text.loc[0, 'text']

# Remove unicode chars
word.encode('ascii', 'ignore').decode()

"Marketing Intern US, NY, New York <h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.</h3>\r\n<p>We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.</p>\r\n<p>Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.</p>\r\n<p>We're located in Chelsea, in New York City.</p> <p>Food52, a fast-growing, James Bear

In [439]:
df_text['text'].apply(lambda x: x.encode('ascii', 'ignore').decode())

0        Marketing Intern US, NY, New York <h3>We're Food52, and we've created a groundbreaking and award...
1        Customer Service - Cloud Video Production NZ, , Auckland <h3>90 Seconds, the worlds Cloud Video ...
2        Commissioning Machinery Assistant (CMA) US, IA, Wever <h3></h3>\r\n<p>Valor Services provides Wo...
3        Account Executive - Washington DC US, DC, Washington <p>Our passion for improving quality of lif...
4        Bill Review Manager US, FL, Fort Worth <p>SpotSource Solutions LLC is a Global Human Capital Man...
                                                        ...                                                 
17875    Account Director - Distribution  CA, ON, Toronto <p>Vend is looking for some awesome new talent ...
17876    Payroll Accountant US, PA, Philadelphia <p>WebLinc is the e-commerce platform and services provi...
17877    Project Cost Control Staff Engineer - Cost Control Exp - TX US, TX, Houston <p>We Provide Full T...
17878    Graphic De

#### Remove Escape characters and HTML tags

This step involves 3 lines of codes by first removing all escape sequences, then followed by HTML and then lastly \xa0 and &amp

In [440]:
# Create the filter
filter = ''.join([chr(i) for i in range(1, 32)])

# Store the text
word = df_text.loc[0, 'text']

# Step 1: Apply replacement of escape characters
word = word.translate(str.maketrans('', '', filter))

# Step 2: Apply replacement of HTML characters
word = re.sub('<[^<]+?>', ' ', word)

# Step 3: Remove \xa0
word = re.sub(u'\xa0', u' ', word)

# Step 4: Apply replacement of special combinations like &amp;
word = re.sub('&amp;', ' ', word)

# View the result
word

"Marketing Intern US, NY, New York  We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.  We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.  Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.  We're located in Chelsea, in New York City.   Food52, a fast-growing, James Beard Award-winning online food communit

Try it on another set of string.

In [441]:
# Create the filter
filter = ''.join([chr(i) for i in range(1, 32)])

# Store the text
word = df_text.loc[1, 'text']

# Step 1: Apply replacement of escape characters
word = word.translate(str.maketrans('', '', filter))

# Step 2: Apply replacement of HTML characters
word = re.sub('<[^<]+?>', ' ', word)

# Step 3: Remove \xa0
word = re.sub(u'\xa0', u' ', word)

# Step 4: Apply replacement of special combinations like &amp;
word = re.sub('&amp;', ' ', word)

# View the result
word

"Customer Service - Cloud Video Production NZ, , Auckland  90 Seconds, the worlds Cloud Video Production Service.  90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. 90 Seconds makes video production fast, affordable, and all managed seamlessly in the cloud from purchase to publish.  http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#     90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing global network of over 2,000 rated video professionals in over 50 countries managed by dedicated production success teams in 5 countries, 90 Seconds provides a 100% success guarantee.  90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest

In [442]:
# Combining the first 2 lines of code
df_text['text'].apply(lambda x: re.sub('<[^<]+?>', ' ', x.translate(str.maketrans('', '', filter))))

0        Marketing Intern US, NY, New York  We're Food52, and we've created a groundbreaking and award-wi...
1        Customer Service - Cloud Video Production NZ, , Auckland  90 Seconds, the worlds Cloud Video Pro...
2        Commissioning Machinery Assistant (CMA) US, IA, Wever    Valor Services provides Workforce Solut...
3        Account Executive - Washington DC US, DC, Washington  Our passion for improving quality of life ...
4        Bill Review Manager US, FL, Fort Worth  SpotSource Solutions LLC is a Global Human Capital Manag...
                                                        ...                                                 
17875    Account Director - Distribution  CA, ON, Toronto  Vend is looking for some awesome new talent to...
17876    Payroll Accountant US, PA, Philadelphia  WebLinc is the e-commerce platform and services provide...
17877    Project Cost Control Staff Engineer - Cost Control Exp - TX US, TX, Houston  We Provide Full Tim...
17878    Graphic De

We know that the third line of code works so we will just skip that for now, since this is just a trial run.

#### Removal of Symbols, Punctuations and Links

In [443]:
df_text.loc[1, 'text']

'Customer Service - Cloud Video Production NZ, , Auckland <h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. 90 Seconds makes video production fast, affordable, and all managed seamlessly in the cloud from purchase to publish. <a href="http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#" rel="nofollow" class="external">http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#</a></p>\r\n<p></p>\r\n<p>90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing global network of over 2,000 rated video professionals in over 50 countries managed by dedicated production success teams in 5 countries, 90 Seconds provides 

First, we'll try a regex code to substitude URL links.

In [444]:
# Store the text
word = df_text.loc[0, 'text']

# Remove URL
re.sub("https*\S+", " ", word)

"Marketing Intern US, NY, New York <h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.</h3>\r\n<p>We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.</p>\r\n<p>Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.</p>\r\n<p>We're located in Chelsea, in New York City.</p> <p>Food52, a fast-growing, James Bear

In [445]:
df_text['text'].apply(lambda x: re.sub("https*\S+", " ", x))

0        Marketing Intern US, NY, New York <h3>We're Food52, and we've created a groundbreaking and award...
1        Customer Service - Cloud Video Production NZ, , Auckland <h3>90 Seconds, the worlds Cloud Video ...
2        Commissioning Machinery Assistant (CMA) US, IA, Wever <h3></h3>\r\n<p>Valor Services provides Wo...
3        Account Executive - Washington DC US, DC, Washington <p>Our passion for improving quality of lif...
4        Bill Review Manager US, FL, Fort Worth <p>SpotSource Solutions LLC is a Global Human Capital Man...
                                                        ...                                                 
17875    Account Director - Distribution  CA, ON, Toronto <p>Vend is looking for some awesome new talent ...
17876    Payroll Accountant US, PA, Philadelphia <p>WebLinc is the e-commerce platform and services provi...
17877    Project Cost Control Staff Engineer - Cost Control Exp - TX US, TX, Houston <p>We Provide Full T...
17878    Graphic De

The code works well, now we proceed to the next part which is removal of mentions and hashtags. The below code shows that there are indeed some kind of usage on twitter mentions.

In [446]:
df_text[(df_text['text'].str.contains("@")) & (df_text['text'].str.contains("#"))]

Unnamed: 0,title,location,company_profile,description,requirements,benefits,fraudulent,text
25,H1B SPONSOR FOR L1/L2/OPT,"US, NY, New York",<p>i28 Technologies has demonstrated expertise in areas strategic to different business in varyi...,"<p><b>Hello,</b></p>\r\n<p><b>Wish you are doing good...</b><b></b></p>\r\n<p> ...","<p><b>JAVA, .NET, SQL, ORACLE, SAP, Informatica, Bigdata,OBIEE, Web Technologies and Java, Share...",,0,"H1B SPONSOR FOR L1/L2/OPT US, NY, New York <p>i28 Technologies has demonstrated expertise in ar..."
201,JAVA Solution Architect,"BE, , Brussels",<p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da6774719060c42c584edbf44d74cdb94fc4ac219ca45#)...,"<p>We currently have a vacancy for a <b>JAVA Solution Architect</b>, to offer his/her services a...",<p><b>Your skills:</b></p>\r\n<ul>\r\n<li>Minimum 14 years of relevant University Studies &amp; ...,"<p><b>Our offer: </b></p>\r\n<p>If you are seeking a career in an exciting and dynamic company, ...",0,"JAVA Solution Architect BE, , Brussels <p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da67747..."
251,Senior Product Manager,"GB, LND, London",<p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da6774719060c42c584edbf44d74cdb94fc4ac219ca45#)...,"<p>We currently have a vacancy for a <b>Senior</b> <b>Product Manager</b>, fluent in English, to...",<p><b>Your skills:</b></p>\r\n<ul>\r\n<li>University degree with demonstrated experience in pro...,"<p><b>Our offer: </b></p>\r\n<p>If you are seeking a career in an exciting and dynamic company, ...",0,"Senior Product Manager GB, LND, London <p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da67747..."
298,Sales Representative with Management Training - DirecTV,"US, TX, McAllen","<p>Argenta Field Solutions values the client, creates income streams for them through our sales ...",<p><b>Interviewing Now for Sales Rep Positions in McAllen TX. </b></p>\r\n<p>Football season is ...,"<p>- Sales experience preferred<br>- Ability to work in high energy, team environment<br>- Goal/...",<p>- AFLAC<br>- Health Insurance (Management) <br>- Training (Initial &amp; Ongoing)<br>- Vacati...,0,"Sales Representative with Management Training - DirecTV US, TX, McAllen <p>Argenta Field Solutio..."
319,Engineering Graduate Trainee @ Upstream,"GR, I, Athens",<p>Upstream’s mission is to revolutionise the way companies market to consumers through cutting ...,"<p><img src=""#URL_8f48f907aab2abcab45a47fbd130e074afdf074ea8c5969308f22eb67630dc91#"">If you are ...",,,0,"Engineering Graduate Trainee @ Upstream GR, I, Athens <p>Upstream’s mission is to revolutionise ..."
...,...,...,...,...,...,...,...,...
17426,Project Manager,"BE, BRU, Brussels",<p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da6774719060c42c584edbf44d74cdb94fc4ac219ca45#)...,"<p>We currently have a vacancy for a <b>Project Manager</b>, fluent in English, to offer his/her...",<p><b>Your skills:</b></p>\r\n<ul>\r\n<li>University Degree in Computer Science or equivalent wi...,"<p><b>Our offer: </b></p>\r\n<p>If you are seeking a career in an exciting and dynamic company, ...",0,"Project Manager BE, BRU, Brussels <p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da6774719060..."
17454,Beauty & Fragrance consultants needed,"GB, , Milton Keynes",<p>Established on the principles that full time education is not for everyone Spectrum Learning ...,<p>Luxury beauty &amp; fragrance consultants needed for immediate starts!</p>\r\n<p>Pure Placeme...,,,0,"Beauty & Fragrance consultants needed GB, , Milton Keynes <p>Established on the principles that ..."
17503,SQL Developer,"US, NY, NYC","<ul>\r\n<li>Maxnet offers Staff Augmentation Solutions for Big Data Analytics in Retail, Healthc...",<p>We are a growing Fashion and retail company. We have an outstanding career opportunity for a ...,"<p>The Consultant will perform all aspects of the PLM Software Development Life Cycle, including...",,0,"SQL Developer US, NY, NYC <ul>\r\n<li>Maxnet offers Staff Augmentation Solutions for Big Data An..."
17542,Vacancies @ Hyatt Hotel - Apply now before deadline!,"PH, 41, Manila",,<p><b>Job Vacancies @ Hyatt Hotels London - Apply before deadline!!!!</b><br><b>Hyatt Hotels Lon...,,,1,"Vacancies @ Hyatt Hotel - Apply now before deadline! PH, 41, Manila <p><b>Job Vacancies @ Hyatt..."


In [447]:
df_text.loc[201, 'text']

'JAVA Solution Architect  BE, , Brussels <p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da6774719060c42c584edbf44d74cdb94fc4ac219ca45#)</b> is a leading European Software, Information and Communication Technologies company, operating internationally (Athens, Brussels, Luxembourg, Copenhagen, Berlin, Rome, Stockholm, London, Nicosia, Helsinki, Valetta, etc). The company employs over 600 engineers and IT experts. We design and develop software applications using integrated, state-of-the-art technology. Our current IT and telecoms projects have a value exceeding 250 million EURO. EUROPEAN DYNAMICS is a renowned supplier of IT services to European Union Institutions, international organizations, European Agencies and national government Administrations all over Europe.</p> <p>We currently have a vacancy for a <b>JAVA Solution Architect</b>, to offer his/her services as an expert who will be based in Brussels, Belgium. The work will be carried out either in the company’s premises or on si

In [448]:
# Store the text
word = df_text.loc[201, 'text']

# Remove mentions
re.sub("@\S+", " ", word)

'JAVA Solution Architect  BE, , Brussels <p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da6774719060c42c584edbf44d74cdb94fc4ac219ca45#)</b> is a leading European Software, Information and Communication Technologies company, operating internationally (Athens, Brussels, Luxembourg, Copenhagen, Berlin, Rome, Stockholm, London, Nicosia, Helsinki, Valetta, etc). The company employs over 600 engineers and IT experts. We design and develop software applications using integrated, state-of-the-art technology. Our current IT and telecoms projects have a value exceeding 250 million EURO. EUROPEAN DYNAMICS is a renowned supplier of IT services to European Union Institutions, international organizations, European Agencies and national government Administrations all over Europe.</p> <p>We currently have a vacancy for a <b>JAVA Solution Architect</b>, to offer his/her services as an expert who will be based in Brussels, Belgium. The work will be carried out either in the company’s premises or on si

In [449]:
# Remove hashtags
re.sub("#\S+", " ", word)

'JAVA Solution Architect  BE, , Brussels <p><b>EUROPEAN DYNAMICS (  is a leading European Software, Information and Communication Technologies company, operating internationally (Athens, Brussels, Luxembourg, Copenhagen, Berlin, Rome, Stockholm, London, Nicosia, Helsinki, Valetta, etc). The company employs over 600 engineers and IT experts. We design and develop software applications using integrated, state-of-the-art technology. Our current IT and telecoms projects have a value exceeding 250 million EURO. EUROPEAN DYNAMICS is a renowned supplier of IT services to European Union Institutions, international organizations, European Agencies and national government Administrations all over Europe.</p> <p>We currently have a vacancy for a <b>JAVA Solution Architect</b>, to offer his/her services as an expert who will be based in Brussels, Belgium. The work will be carried out either in the company’s premises or on site at the customer premises. In the context of the first assignment, the s

#### Expand Contractions before removing punctuation and ticks

This is a special step that can be important to standardize shortened words like don't and won't.

In [450]:
# Store the text
word = df_text.loc[201, 'text']

# Apply the function
contractions.fix(word)

'JAVA Solution Architect  BE, , Brussels <p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da6774719060c42c584edbf44d74cdb94fc4ac219ca45#)</b> is a leading European Software, Information and Communication Technologies company, operating internationally (Athens, Brussels, Luxembourg, Copenhagen, Berlin, Rome, Stockholm, London, Nicosia, Helsinki, Valetta, etc). The company employs over 600 engineers and IT experts. We design and develop software applications using integrated, state-of-the-art technology. Our current IT and telecoms projects have a value exceeding 250 million EURO. EUROPEAN DYNAMICS is a renowned supplier of IT services to European Union Institutions, international organizations, European Agencies and national government Administrations all over Europe.</p> <p>We currently have a vacancy for a <b>JAVA Solution Architect</b>, to offer his/her services as an expert who will be based in Brussels, Belgium. The work will be carried out either in the company’s premises or on si

Good job, so far the steps above worked as expected. We can move on to removal of ticks with the next character, then move on to punctuations from the string library.

In [451]:
# Remove ticks and the next word
re.sub("\'\w+", '', word)

'JAVA Solution Architect  BE, , Brussels <p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da6774719060c42c584edbf44d74cdb94fc4ac219ca45#)</b> is a leading European Software, Information and Communication Technologies company, operating internationally (Athens, Brussels, Luxembourg, Copenhagen, Berlin, Rome, Stockholm, London, Nicosia, Helsinki, Valetta, etc). The company employs over 600 engineers and IT experts. We design and develop software applications using integrated, state-of-the-art technology. Our current IT and telecoms projects have a value exceeding 250 million EURO. EUROPEAN DYNAMICS is a renowned supplier of IT services to European Union Institutions, international organizations, European Agencies and national government Administrations all over Europe.</p> <p>We currently have a vacancy for a <b>JAVA Solution Architect</b>, to offer his/her services as an expert who will be based in Brussels, Belgium. The work will be carried out either in the company’s premises or on si

In [452]:
# This is another version of apostrophe
re.sub("\’\w+", '', word)

'JAVA Solution Architect  BE, , Brussels <p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da6774719060c42c584edbf44d74cdb94fc4ac219ca45#)</b> is a leading European Software, Information and Communication Technologies company, operating internationally (Athens, Brussels, Luxembourg, Copenhagen, Berlin, Rome, Stockholm, London, Nicosia, Helsinki, Valetta, etc). The company employs over 600 engineers and IT experts. We design and develop software applications using integrated, state-of-the-art technology. Our current IT and telecoms projects have a value exceeding 250 million EURO. EUROPEAN DYNAMICS is a renowned supplier of IT services to European Union Institutions, international organizations, European Agencies and national government Administrations all over Europe.</p> <p>We currently have a vacancy for a <b>JAVA Solution Architect</b>, to offer his/her services as an expert who will be based in Brussels, Belgium. The work will be carried out either in the company premises or on site

For removal of punctuations.

In [453]:
# View the punctuations that will be used
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [454]:
re.escape(string.punctuation)

'!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~'

In [455]:
re.sub('[%s]' % re.escape(string.punctuation), '', word)

'JAVA Solution Architect  BE  Brussels pbEUROPEAN DYNAMICS URLc66532ffa1ce76ab447da6774719060c42c584edbf44d74cdb94fc4ac219ca45b is a leading European Software Information and Communication Technologies company operating internationally Athens Brussels Luxembourg Copenhagen Berlin Rome Stockholm London Nicosia Helsinki Valetta etc The company employs over 600 engineers and IT experts We design and develop software applications using integrated stateoftheart technology Our current IT and telecoms projects have a value exceeding 250 million EURO EUROPEAN DYNAMICS is a renowned supplier of IT services to European Union Institutions international organizations European Agencies and national government Administrations all over Europep pWe currently have a vacancy for a bJAVA Solution Architectb to offer hisher services as an expert who will be based in Brussels Belgium The work will be carried out either in the company’s premises or on site at the customer premises In the context of the firs

### Step 3: Removal of Numbers/Digits

For this step, it is not just removal of digits but also the words that are joined with the digits. For example, 123abc will be totally removed. Although some significant digits like how many years of experience will be affected but we assume that actual words will have heavier meaning compared to digits.

In [456]:
re.sub(r'\w*\d+\w*', '', word)

'JAVA Solution Architect  BE, , Brussels <p><b>EUROPEAN DYNAMICS (##)</b> is a leading European Software, Information and Communication Technologies company, operating internationally (Athens, Brussels, Luxembourg, Copenhagen, Berlin, Rome, Stockholm, London, Nicosia, Helsinki, Valetta, etc). The company employs over  engineers and IT experts. We design and develop software applications using integrated, state-of-the-art technology. Our current IT and telecoms projects have a value exceeding  million EURO. EUROPEAN DYNAMICS is a renowned supplier of IT services to European Union Institutions, international organizations, European Agencies and national government Administrations all over Europe.</p> <p>We currently have a vacancy for a <b>JAVA Solution Architect</b>, to offer his/her services as an expert who will be based in Brussels, Belgium. The work will be carried out either in the company’s premises or on site at the customer premises. In the context of the first assignment, the s

### Step 4: Removal of Over Spaces

Using the curly brackets in regex, we specify that 2 or more white spaces will be replaced with one spaces.

In [457]:
re.sub('\s{2,}', " ", word)

'JAVA Solution Architect BE, , Brussels <p><b>EUROPEAN DYNAMICS (#URL_c66532ffa1ce76ab447da6774719060c42c584edbf44d74cdb94fc4ac219ca45#)</b> is a leading European Software, Information and Communication Technologies company, operating internationally (Athens, Brussels, Luxembourg, Copenhagen, Berlin, Rome, Stockholm, London, Nicosia, Helsinki, Valetta, etc). The company employs over 600 engineers and IT experts. We design and develop software applications using integrated, state-of-the-art technology. Our current IT and telecoms projects have a value exceeding 250 million EURO. EUROPEAN DYNAMICS is a renowned supplier of IT services to European Union Institutions, international organizations, European Agencies and national government Administrations all over Europe.</p> <p>We currently have a vacancy for a <b>JAVA Solution Architect</b>, to offer his/her services as an expert who will be based in Brussels, Belgium. The work will be carried out either in the company’s premises or on sit

### Step 5: Combined the 4 steps above into one

Let's combine everything into a self defined function. This function acts as an initial processing step before we tokenize the texts and remove stopwords.

In [458]:
# Create the filter
filter = ''.join([chr(i) for i in range(1, 32)])

def preprocess_text(x):
    
    # Lowercase text
    x = x.lower()
    
    # Remove unicode chars, escape chars, HTML chars & special chars
    x = x.encode('ascii', 'ignore').decode()
    x = x.translate(str.maketrans('', '', filter))
    x = re.sub('<[^<]+?>', ' ', x)
    x = re.sub(u'\xa0', u' ', x)
    x = re.sub('&amp;', ' ', x)
    
    # Remove URL, mentions and hashtags
    x = re.sub("https*\S+", " ", x)
    x = re.sub("@\S+", " ", x)
    x = re.sub("#\S+", " ", x)
    
    # Expand contractions for shortened words
    x = contractions.fix(x)
    
    # Remove apostrophe and punctuations
    x = re.sub("\'\w+", '', x)
    x = re.sub("\’\w+", '', x)
    x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
    
    # Remove digits and words containing digits
    x = re.sub(r'\w*\d+\w*', '', x)
    
    # Strip off extra white spaces
    x = re.sub('\s{2,}', ' ', x)
    
    return x

In [459]:
df_text['text'].apply(preprocess_text)

0        marketing intern us ny new york we are and we have created a groundbreaking and award winning co...
1        customer service cloud video production nz auckland seconds the worlds cloud video production se...
2        commissioning machinery assistant cma us ia wever valor services provides workforce solutions th...
3        account executive washington dc us dc washington our passion for improving quality of life throu...
4        bill review manager us fl fort worth spotsource solutions llc is a global human capital manageme...
                                                        ...                                                 
17875    account director distribution ca on toronto vend is looking for some awesome new talent to come ...
17876    payroll accountant us pa philadelphia weblinc is the e commerce platform and services provider f...
17877    project cost control staff engineer cost control exp tx us tx houston we provide full time perma...
17878    graphic de

In [460]:
# Apply the function to create a new column
df_text['clean_text'] = df_text['text'].apply(preprocess_text)

# View the df
df_text.head(4)

Unnamed: 0,title,location,company_profile,description,requirements,benefits,fraudulent,text,clean_text
0,Marketing Intern,"US, NY, New York","<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,0,"Marketing Intern US, NY, New York <h3>We're Food52, and we've created a groundbreaking and award...",marketing intern us ny new york we are and we have created a groundbreaking and award winning co...
1,Customer Service - Cloud Video Production,"NZ, , Auckland","<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,0,"Customer Service - Cloud Video Production NZ, , Auckland <h3>90 Seconds, the worlds Cloud Video ...",customer service cloud video production nz auckland seconds the worlds cloud video production se...
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,0,"Commissioning Machinery Assistant (CMA) US, IA, Wever <h3></h3>\r\n<p>Valor Services provides Wo...",commissioning machinery assistant cma us ia wever valor services provides workforce solutions th...
3,Account Executive - Washington DC,"US, DC, Washington",<p>Our passion for improving quality of life through geography is at the heart of everything we ...,<p><b>THE COMPANY: ESRI – Environmental Systems Research Institute</b></p>\r\n<p>Our passion for...,"<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s or Master’s in GIS, business administration, or a r...","<p>Our culture is anything but corporate—we have a collaborative, creative environment; phone di...",0,"Account Executive - Washington DC US, DC, Washington <p>Our passion for improving quality of lif...",account executive washington dc us dc washington our passion for improving quality of life throu...


In [461]:
df_text.shape

(17645, 9)

### Step 6: Tokenization and Stopwords Removal

We have 2 options, first is use the split function and second is use the word_tokenize function from nltk library. We can try both and see.

In [462]:
# Test the word tokenizer
word_tokenize("Good evening my friends")

['Good', 'evening', 'my', 'friends']

In [463]:
# View the available stopwords
list(set(stopwords.words('english')))

['which',
 'for',
 'was',
 'both',
 "should've",
 "haven't",
 "that'll",
 've',
 'and',
 'aren',
 "shouldn't",
 'other',
 'i',
 'these',
 'by',
 'when',
 'hadn',
 'myself',
 'have',
 'before',
 'just',
 'ourselves',
 'of',
 'shan',
 'him',
 'am',
 'while',
 'over',
 'there',
 "hasn't",
 'any',
 "wouldn't",
 'whom',
 'same',
 'your',
 's',
 'after',
 'been',
 'y',
 'this',
 'can',
 'them',
 "needn't",
 "you'll",
 "wasn't",
 'yours',
 'don',
 'mightn',
 'ours',
 'my',
 'with',
 'its',
 'some',
 'above',
 "don't",
 'no',
 'does',
 "won't",
 "aren't",
 'shouldn',
 'most',
 'it',
 'who',
 'they',
 'had',
 'here',
 'is',
 'theirs',
 "you'd",
 'mustn',
 'an',
 "you've",
 'himself',
 'about',
 'during',
 'doesn',
 'did',
 'doing',
 'be',
 'what',
 'too',
 'than',
 'will',
 'further',
 'more',
 'up',
 'now',
 'a',
 'until',
 'not',
 'his',
 'off',
 'where',
 'own',
 "shan't",
 "hadn't",
 'as',
 'their',
 'do',
 'itself',
 'why',
 'yourself',
 "weren't",
 'so',
 'below',
 'against',
 "didn't",
 

Let's try to apply to an entire column.

In [464]:
df_text.loc[:8, 'clean_text'].apply(word_tokenize)

0    [marketing, intern, us, ny, new, york, we, are, and, we, have, created, a, groundbreaking, and, ...
1    [customer, service, cloud, video, production, nz, auckland, seconds, the, worlds, cloud, video, ...
2    [commissioning, machinery, assistant, cma, us, ia, wever, valor, services, provides, workforce, ...
3    [account, executive, washington, dc, us, dc, washington, our, passion, for, improving, quality, ...
4    [bill, review, manager, us, fl, fort, worth, spotsource, solutions, llc, is, a, global, human, c...
5    [accounting, clerk, us, md, job, overview, apex, is, an, environmental, consulting, firm, that, ...
6    [head, of, content, m, f, de, be, berlin, founded, in, the, fonpit, ag, rose, with, its, interna...
7    [lead, guest, service, specialist, us, ca, san, francisco, airenvys, mission, is, to, provide, l...
8    [hp, bsm, sme, us, fl, pensacola, is, a, woman, owned, small, business, whose, focus, is, it, se...
Name: clean_text, dtype: object

How about 2 steps in one line of code?

In [465]:
df_text.loc[:8, 'clean_text'].apply(word_tokenize).apply(lambda x: [
    w for w in x if not w.lower() in set(stopwords.words('english'))
])

0    [marketing, intern, us, ny, new, york, created, groundbreaking, award, winning, cooking, site, s...
1    [customer, service, cloud, video, production, nz, auckland, seconds, worlds, cloud, video, produ...
2    [commissioning, machinery, assistant, cma, us, ia, wever, valor, services, provides, workforce, ...
3    [account, executive, washington, dc, us, dc, washington, passion, improving, quality, life, geog...
4    [bill, review, manager, us, fl, fort, worth, spotsource, solutions, llc, global, human, capital,...
5    [accounting, clerk, us, md, job, overview, apex, environmental, consulting, firm, offers, stable...
6    [head, content, f, de, berlin, founded, fonpit, ag, rose, international, web, portal, androidpit...
7    [lead, guest, service, specialist, us, ca, san, francisco, airenvys, mission, provide, lucrative...
8    [hp, bsm, sme, us, fl, pensacola, woman, owned, small, business, whose, focus, service, manageme...
Name: clean_text, dtype: object

It is totally fine to combine tasks with less code, let's proceed now.

In [466]:
# Apply word_tokenize with stopwords removal
df_text['tokenized_text'] = df_text['clean_text'].apply(word_tokenize).apply(lambda x: [
    w for w in x if not w.lower() in set(stopwords.words('english'))
])

# View the df
df_text.head(4)

Unnamed: 0,title,location,company_profile,description,requirements,benefits,fraudulent,text,clean_text,tokenized_text
0,Marketing Intern,"US, NY, New York","<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,0,"Marketing Intern US, NY, New York <h3>We're Food52, and we've created a groundbreaking and award...",marketing intern us ny new york we are and we have created a groundbreaking and award winning co...,"[marketing, intern, us, ny, new, york, created, groundbreaking, award, winning, cooking, site, s..."
1,Customer Service - Cloud Video Production,"NZ, , Auckland","<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,0,"Customer Service - Cloud Video Production NZ, , Auckland <h3>90 Seconds, the worlds Cloud Video ...",customer service cloud video production nz auckland seconds the worlds cloud video production se...,"[customer, service, cloud, video, production, nz, auckland, seconds, worlds, cloud, video, produ..."
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,0,"Commissioning Machinery Assistant (CMA) US, IA, Wever <h3></h3>\r\n<p>Valor Services provides Wo...",commissioning machinery assistant cma us ia wever valor services provides workforce solutions th...,"[commissioning, machinery, assistant, cma, us, ia, wever, valor, services, provides, workforce, ..."
3,Account Executive - Washington DC,"US, DC, Washington",<p>Our passion for improving quality of life through geography is at the heart of everything we ...,<p><b>THE COMPANY: ESRI – Environmental Systems Research Institute</b></p>\r\n<p>Our passion for...,"<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s or Master’s in GIS, business administration, or a r...","<p>Our culture is anything but corporate—we have a collaborative, creative environment; phone di...",0,"Account Executive - Washington DC US, DC, Washington <p>Our passion for improving quality of lif...",account executive washington dc us dc washington our passion for improving quality of life throu...,"[account, executive, washington, dc, us, dc, washington, passion, improving, quality, life, geog..."


In [467]:
df_text.shape

(17645, 10)

### Step 7: Lemmatization

Lemmatization is better than porter stemmer because it considers the semantic meaning of words instead of just stripping of alphabets. Like previously, we can test the lemmatizer on some strings first.

In [468]:
[WordNetLemmatizer().lemmatize(w) for w in df_text.loc[3, 'clean_text'].split()]

['account',
 'executive',
 'washington',
 'dc',
 'u',
 'dc',
 'washington',
 'our',
 'passion',
 'for',
 'improving',
 'quality',
 'of',
 'life',
 'through',
 'geography',
 'is',
 'at',
 'the',
 'heart',
 'of',
 'everything',
 'we',
 'do',
 'esris',
 'geographic',
 'information',
 'system',
 'gi',
 'technology',
 'inspires',
 'and',
 'enables',
 'government',
 'university',
 'and',
 'business',
 'worldwide',
 'to',
 'save',
 'money',
 'life',
 'and',
 'our',
 'environment',
 'through',
 'a',
 'deeper',
 'understanding',
 'of',
 'the',
 'changing',
 'world',
 'around',
 'them',
 'carefully',
 'managed',
 'growth',
 'and',
 'zero',
 'debt',
 'give',
 'esri',
 'stability',
 'that',
 'is',
 'uncommon',
 'in',
 'today',
 'volatile',
 'business',
 'world',
 'privately',
 'held',
 'we',
 'offer',
 'exceptional',
 'benefit',
 'competitive',
 'salary',
 'k',
 'and',
 'profit',
 'sharing',
 'program',
 'opportunity',
 'for',
 'personal',
 'and',
 'professional',
 'growth',
 'and',
 'much',
 'mor

In [469]:
# Create a simple lemmatizer function
def lemmatizer(tokens):
    text = [WordNetLemmatizer().lemmatize(word) for word in tokens]
    return text

# Apply the function to an entire column
df_text['lemmatized_text'] = df_text['tokenized_text'].apply(lemmatizer)

# View the new df
df_text.head(4)

Unnamed: 0,title,location,company_profile,description,requirements,benefits,fraudulent,text,clean_text,tokenized_text,lemmatized_text
0,Marketing Intern,"US, NY, New York","<h3>We're Food52, and we've created a groundbreaking and award-winning cooking site. We support,...","<p>Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and...",<ul>\r\n<li>Experience with content management systems a major plus (any blogging counts!)</li>\...,,0,"Marketing Intern US, NY, New York <h3>We're Food52, and we've created a groundbreaking and award...",marketing intern us ny new york we are and we have created a groundbreaking and award winning co...,"[marketing, intern, us, ny, new, york, created, groundbreaking, award, winning, cooking, site, s...","[marketing, intern, u, ny, new, york, created, groundbreaking, award, winning, cooking, site, su..."
1,Customer Service - Cloud Video Production,"NZ, , Auckland","<h3>90 Seconds, the worlds Cloud Video Production Service.</h3>\r\n<p>90 Seconds is the worlds C...",<p>Organised - Focused - Vibrant - Awesome!<br><br>Do you have a passion for customer service? S...,<p><b>What we expect from you:</b></p>\r\n<p>Your key responsibility will be to communicate with...,<h3><b>What you will get from us</b></h3>\r\n<p>Through being part of the 90 Seconds team you wi...,0,"Customer Service - Cloud Video Production NZ, , Auckland <h3>90 Seconds, the worlds Cloud Video ...",customer service cloud video production nz auckland seconds the worlds cloud video production se...,"[customer, service, cloud, video, production, nz, auckland, seconds, worlds, cloud, video, produ...","[customer, service, cloud, video, production, nz, auckland, second, world, cloud, video, product..."
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",<h3></h3>\r\n<p>Valor Services provides Workforce Solutions that meet the needs of companies acr...,"<p>Our client, located in Houston, is actively seeking an experienced Commissioning Machinery As...",<ul>\r\n<li>Implement pre-commissioning and commissioning procedures for rotary equipment.</li>\...,,0,"Commissioning Machinery Assistant (CMA) US, IA, Wever <h3></h3>\r\n<p>Valor Services provides Wo...",commissioning machinery assistant cma us ia wever valor services provides workforce solutions th...,"[commissioning, machinery, assistant, cma, us, ia, wever, valor, services, provides, workforce, ...","[commissioning, machinery, assistant, cma, u, ia, wever, valor, service, provides, workforce, so..."
3,Account Executive - Washington DC,"US, DC, Washington",<p>Our passion for improving quality of life through geography is at the heart of everything we ...,<p><b>THE COMPANY: ESRI – Environmental Systems Research Institute</b></p>\r\n<p>Our passion for...,"<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s or Master’s in GIS, business administration, or a r...","<p>Our culture is anything but corporate—we have a collaborative, creative environment; phone di...",0,"Account Executive - Washington DC US, DC, Washington <p>Our passion for improving quality of lif...",account executive washington dc us dc washington our passion for improving quality of life throu...,"[account, executive, washington, dc, us, dc, washington, passion, improving, quality, life, geog...","[account, executive, washington, dc, u, dc, washington, passion, improving, quality, life, geogr..."


In [470]:
df_text.shape

(17645, 11)

Keep in mind that this lemmatized text is what we'll be using to create the word cloud during EDA later.

### Converting lemmatized list into full text

The count vectorizer and other vectorizers cannot accept list inputs and need those inputs to be in generic string formats. Thus, we need to join those tokenized strings before feeding into the vectorizers.

In [472]:
df_text.loc[:8, 'lemmatized_text'].apply(lambda x: " ".join(x))

0    marketing intern u ny new york created groundbreaking award winning cooking site support connect...
1    customer service cloud video production nz auckland second world cloud video production service ...
2    commissioning machinery assistant cma u ia wever valor service provides workforce solution meet ...
3    account executive washington dc u dc washington passion improving quality life geography heart e...
4    bill review manager u fl fort worth spotsource solution llc global human capital management cons...
5    accounting clerk u md job overview apex environmental consulting firm offer stable leadership gr...
6    head content f de berlin founded fonpit ag rose international web portal androidpit world larges...
7    lead guest service specialist u ca san francisco airenvys mission provide lucrative yet hassle f...
8    hp bsm sme u fl pensacola woman owned small business whose focus service management using best b...
Name: lemmatized_text, dtype: object

In [494]:
df_text.columns

Index(['title', 'location', 'company_profile', 'description', 'requirements',
       'benefits', 'fraudulent', 'text', 'clean_text', 'tokenized_text',
       'lemmatized_text'],
      dtype='object')

### A) Bag-of-Words Vectorization

In [476]:
count_vect = CountVectorizer()
count_vector = count_vect.fit_transform(df_text['lemmatized_text'].apply(lambda x: " ".join(x)))

# Show the dimension and feature names
print(count_vector.shape)
print(count_vect.get_feature_names())

(17645, 42596)


In [477]:
type(count_vector)

scipy.sparse.csr.csr_matrix

In [478]:
count_vector.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [479]:
print(count_vect.vocabulary_)



### B) Bi-Grams Vectorization

In [480]:
bigram_vect = CountVectorizer(ngram_range=(2, 2))
bigram_vector = bigram_vect.fit_transform(df_text['lemmatized_text'].apply(lambda x: " ".join(x)))

# Show the dimension and feature names
print(bigram_vector.shape)
print(bigram_vect.get_feature_names())

(17645, 829493)


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [481]:
type(bigram_vector)

scipy.sparse.csr.csr_matrix

In [None]:
bigram_vector.toarray()

In [483]:
print(bigram_vect.vocabulary_)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Unfortunately, the outputs of Bigram vectorizer cannot be displayed due to memory limitation of the hardware as there are too many columns in the matrix produced (829,493 columns). Regardless, the code worked well and no errors are produced.

### C) TFIDF Vectorization

In [484]:
tfidf_vect = TfidfVectorizer()
tfidf_vector = tfidf_vect.fit_transform(df_text['lemmatized_text'].apply(lambda x: " ".join(x)))

# Show the dimension and feature names
print(tfidf_vector.shape)
print(tfidf_vect.get_feature_names())

(17645, 42596)


In [485]:
type(tfidf_vector)

scipy.sparse.csr.csr_matrix

In [486]:
tfidf_vector.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [488]:
print(tfidf_vect.vocabulary_)



In [489]:
tfidf_vect.idf_

array([ 6.77093122,  7.22291635, 10.08511723, ...,  7.08938495,
        7.78253213,  9.39197005])

In [490]:
tfidf_vect.idf_.size

42596

## Export the cleaned data for EDA

For EDA, it will be done in a new notebook to enable smoother coding experience. To do this, we'll export the cleaned data (df10) as csv to be used later.

In [491]:
df10.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17645 entries, 0 to 17879
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   title                17645 non-null  object
 1   location             17645 non-null  object
 2   company_profile      17645 non-null  object
 3   description          17645 non-null  object
 4   requirements         17645 non-null  object
 5   benefits             17645 non-null  object
 6   telecommuting        17645 non-null  int64 
 7   has_company_logo     17645 non-null  int64 
 8   has_questions        17645 non-null  int64 
 9   employment_type      17645 non-null  object
 10  required_experience  17645 non-null  object
 11  required_education   17645 non-null  object
 12  function             17645 non-null  object
 13  fraudulent           17645 non-null  int64 
 14  in_balanced_dataset  17645 non-null  int64 
 15  country              17645 non-null  object
 16  stat

In [492]:
df10.shape

(17645, 17)

In [493]:
df10.to_csv("D:/Documents/Data Science Learning/My Project/Recruitment Scam/02-data/recruitment_cleaned.csv", 
            index=False)