## Requirements

1. Scrape and prepare your own data.

2. **Create and compare at least two models for each section**. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).
   - Section 1: Job Salary Trends
   - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists. 
   - Make sure to clearly describe and label each section.
   - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.
   - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.


## Suggestions for Getting Started

1. Collect data from [Indeed.com](www.indeed.com) (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  - Select and parse data from *at least 1000 postings* for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Discover which features have the greatest importance when determining a low vs. high paying job.
  - Your Boss is interested in what overall features hold the greatest significance.
  - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.   
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.
5. If tackling the bonus question, try framing the salary problem as a classification problem detecting low vs. high salary positions.

In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import os 
import re
import pickle

from bs4 import BeautifulSoup
import urllib
import urllib.parse
from time import sleep

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
from sklearn.model_selection import train_test_split,KFold, cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.linear_model import Lasso, ElasticNet, Ridge, LassoCV, ElasticNetCV, \
RidgeCV, LinearRegression
from sklearn.metrics import mean_squared_error

import statsmodels.formula.api as sm
import statsmodels.api as smf

In [3]:
pd.set_option('display.max_columns', 100)
pd.set_option('max_colwidth',100)


In [4]:
import warnings
warnings.filterwarnings('ignore')
df = pd.read_pickle("./master1.pkl")
df.shape

(1224, 14)

In [5]:
# something wrong with pickle. had to manually update each one as for loop produced unpickling error 123
# x = pd.read_pickle("./CFData.pkl")
# print(x.shape)
# df = df.append(x, ignore_index=True)
# print(df.shape)

In [8]:
# **ERROR 123 when executing function below. tried edo's solution and it still doesn't work.

#write a function to extract similar pickles and concat them
df = pd.DataFrame()
list = os.listdir('.')
print(list)
for a in list:
    print(a)
    found = re.search("CF*", a)
    if found is not None:
        b = './{}'.format(a)
        x = pd.read_pickle(b)
        df.append(x)
        print(df.shape)

In [9]:
df.shape

(1224, 14)

***SALARY

In [10]:
# check nulls for salary
df[df.salary.isnull()].sample(5)

Unnamed: 0,advertiser,elapsed,industry,jd,jobtitle,jobtype,location,ojoblink,postcode,salary,searchstring,seniority,source,url
969,,,,,,,Singapore,https://content.mycareersfuture.sg/dressing-job-interview-success/,,,Machine Learning,,,https://content.mycareersfuture.sg/dressing-job-interview-success/
624,,,,,,,Singapore,https://www.mycareersfuture.sg/job/avp-senior-associate-data-analyst-analytic-center-excellence-...,,,Data Engineer,,,https://www.mycareersfuture.sg/job/avp-senior-associate-data-analyst-analytic-center-excellence-...
623,,,,,,,Singapore,https://content.mycareersfuture.sg/dressing-job-interview-success/,,,Data Engineer,,,https://content.mycareersfuture.sg/dressing-job-interview-success/
401,,,,,,,Singapore,https://www.mycareersfuture.sg/job/fpa-manager-international-baccalaureate-organization-751ae87b...,,,Business Intelligence,,,https://www.mycareersfuture.sg/job/fpa-manager-international-baccalaureate-organization-751ae87b...
153,,,,,,,Singapore,https://www.mycareersfuture.sg/job/big-data-architect-open-text-78285fb98a68ca067591320823a45c84,,,Big Data,,,https://www.mycareersfuture.sg/job/big-data-architect-open-text-78285fb98a68ca067591320823a45c84


***JD

In [11]:
# let's see if salary info can be extracted from the JD
df.jd.sample(10)

227     Roles & ResponsibilitiesCOMPANY DESCRIPTION CEVA provides world class supply chain solutions for...
1142    Roles & ResponsibilitiesThe Business Consultant is responsible for providing functional applicat...
214     Roles & Responsibilities Lead the design of Cost management software at Airline. Map invoice man...
904                                                                                                     NaN
204                                                                                                     NaN
222                                                                                                     NaN
185     Roles & Responsibilities Demonstrates understanding of business needs & complex business require...
303     Roles & ResponsibilitiesEmployment Type – Permanent Location – Singapore Job roles  Work closely...
1074    Roles & ResponsibilitiesGeneral  Assist & support the CEO in all aspects of work related to the ...
593     Roles & Responsibili

In [12]:
#zooming in on sample
df.jd[14].split('\n')

['Roles & ResponsibilitiesAre you an experienced data warehouse and big data analytics specialist? Do you like to solve the most complex and high scale data challenges in the world today? Do you want to have an impact in the development and use of new data analytics technologies? Would you like a career that gives you opportunities to help customers and partners use cloud computing web services to do big new things faster, at lower cost?\xa0 At AWS, we’re hiring highly technical cloud computing architects to collaborate with our customers and partners on key engagements. Our consultants will develop and deliver proof-of-concept projects, technical workshops, and support implementation projects. These professional services engagements will focus on\xa0customer solutions\xa0such as batch data processing, designing and deploying future state of fully managed, petabyte-scale data warehouse service and assist in building or designing reference configurations to enable our customers and infl

In [13]:
# drop rows with no jds. they have expired. 
df.dropna(axis=0, subset=['jd'], inplace=True)
df.isnull().sum()

advertiser        0
elapsed           0
industry          0
jd                0
jobtitle          0
jobtype           0
location          0
ojoblink          0
postcode        202
salary            0
searchstring      0
seniority         0
source            0
url               0
dtype: int64

In [14]:
# no missing salaries! great!
df.salary.value_counts(dropna=False)[:5]

Salary undisclosed    49
$6,000 $8,000         36
$4,000 $8,000         36
$5,000 $7,000         35
$7,000 $12,000        25
Name: salary, dtype: int64

In [15]:
#let's split salaries into min and max 
df.salary = np.where(df.salary=='Salary undisclosed', None, df.salary)
try:
    df['minsalary'] = df.salary.apply(lambda x: x.split(' ')[0] if x is not None else x)
    df['maxsalary'] = df.salary.apply(lambda x: x.split(' ')[1] if x is not None else x)
    df['minsalary'] = df['minsalary'].apply(lambda x: x.replace('$','').replace(',', '') if x is not None else x).astype(float)
    df['maxsalary'] = df['maxsalary'].apply(lambda x: x.replace('$','').replace(',', '') if x is not None else x).astype(float)
except:
    pass
#.str.replace('$','').str.replace(',', ''))

In [16]:
# create avgsalary col 
df['avgsalary'] = (df['minsalary']+df['maxsalary'])/2
df.head()

Unnamed: 0,advertiser,elapsed,industry,jd,jobtitle,jobtype,location,ojoblink,postcode,salary,searchstring,seniority,source,url,minsalary,maxsalary,avgsalary
0,ERNST & YOUNG ADVISORY PTE. LTD.,0.0,"Consulting , Banking and Finance, Information Technology",Roles & ResponsibilitiesWe are the only professional services organisation who has a separate bu...,Big Data Engineer (Financial Services),Full Time,Singapore,https://www.mycareersfuture.sg/job/big-data-engineer-ernst-young-advisory-5c6015c43915d53b5bec72...,48583,"$6,000 $12,000",Big Data,Manager,ERNST & YOUNG ADVISORY PTE. LTD.,https://www.mycareersfuture.sg/job/big-data-engineer-ernst-young-advisory-5c6015c43915d53b5bec72...,6000.0,12000.0,9000.0
1,THATZ INTERNATIONAL PTE LTD,0.0,Information Technology,Roles & Responsibilities Perform and manage BDA software setup and configuration for the new Dat...,Big Data Administrator,Full Time,Singapore,https://www.mycareersfuture.sg/job/big-data-administrator-thatz-international-9564001377ca7f4bf1...,179803,"$4,500 $5,700",Big Data,Executive,THATZ INTERNATIONAL PTE LTD,https://www.mycareersfuture.sg/job/big-data-administrator-thatz-international-9564001377ca7f4bf1...,4500.0,5700.0,5100.0
3,ICON CONSULTING-GROUP PTE. LTD.,0.0,Information Technology,Roles & ResponsibilitiesSolution Architect Description A Solution Architect is expected to d...,Analytics Architect,Permanent,Singapore,https://www.mycareersfuture.sg/job/analytics-architect-icon-consulting-group-440ccbcb6320596315d...,534045,"$15,000 $20,000",Big Data,Professional,ICON CONSULTING-GROUP PTE. LTD.,https://www.mycareersfuture.sg/job/analytics-architect-icon-consulting-group-440ccbcb6320596315d...,15000.0,20000.0,17500.0
5,NTT DATA SINGAPORE PTE. LTD.,0.0,Information Technology,Roles & ResponsibilitiesProvided technical consultation to VGC to build their Big Data Lake and ...,Technical Solutions Architecture Manager,"Permanent, Contract",Singapore,https://www.mycareersfuture.sg/job/technical-solutions-architecture-manager-ntt-data-singapore-8...,89315,"$9,000 $12,000",Big Data,Middle Management,NTT DATA SINGAPORE PTE. LTD.,https://www.mycareersfuture.sg/job/technical-solutions-architecture-manager-ntt-data-singapore-8...,9000.0,12000.0,10500.0
6,SMARTSOFT PTE. LTD.,0.0,Information Technology,Roles & Responsibilities Responsibilities include understanding ETL & Data Engineering requirem...,Senior ETL and DATA Engineer,Full Time,Singapore,https://www.mycareersfuture.sg/job/senior-etl-data-engineer-smartsoft-4b1aeea089cbaa6726eb2cb5dc...,79903,"$6,000 $11,000",Big Data,Senior Executive,SMARTSOFT PTE. LTD.,https://www.mycareersfuture.sg/job/senior-etl-data-engineer-smartsoft-4b1aeea089cbaa6726eb2cb5dc...,6000.0,11000.0,8500.0


In [17]:
df.avgsalary.isnull().sum()

49

In [18]:
df.dtypes

advertiser       object
elapsed         float64
industry         object
jd               object
jobtitle         object
jobtype          object
location         object
ojoblink         object
postcode         object
salary           object
searchstring     object
seniority        object
source           object
url              object
minsalary       float64
maxsalary       float64
avgsalary       float64
dtype: object

In [19]:
# to pull out salaries from JD description

# for row, jd in enumerate(df.jd):
#     if '$' in jd:
#         if jd.find('$')>=0 and df.salary[row] is None :
#             dollar = [pos for pos, char in enumerate(jd) if char == '$']
#             #print('**', dollar)
#             for d in dollar:
#                 if ('month' or 'year') in jd[d:d+30]:
#                     df['salary'][row] = jd[d:d+30]
            
#             print(df['salary'][row], '**')        


In [20]:
#let's investigate null advertisers
df[df.advertiser.isnull()].source.unique()

array([], dtype=object)

In [21]:
df.elapsed.unique()

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       14., 15., 16., 17., 13., 18., 21., 22., 23., 25., 26., 27., 29.,
       30., 19., 28., 20., 24.])

In [22]:
try:
    df[df.elapsed=='save job'].url[:5]
    dfm[dfm.elapsed =='ISAC Inc']
except:
    pass
    

In [23]:
df.jobtype.value_counts()

Full Time                          373
Permanent                          214
Permanent, Full Time               212
Contract                            93
Contract, Full Time                 79
Permanent, Contract, Full Time      14
Permanent, Contract                 11
Internship                           6
Temporary, Contract                  4
Contract, Full Time, Internship      3
Temporary, Contract, Full Time       3
Temporary                            3
Contract, Internship                 1
Full Time, Internship                1
Part Time, Contract                  1
Part Time, Full Time                 1
Name: jobtype, dtype: int64

In [24]:
df.jobtype = np.where(df.jobtype.str.contains('Full Time'), 'Permanent', df.jobtype)
df.jobtype = np.where(df.jobtype.str.contains('Permanent'), 'Permanent', df.jobtype)
df.jobtype = np.where(df.jobtype.str.contains('Contract'), 'Contract', df.jobtype)
df.jobtype = np.where(df.jobtype.str.contains('Internship'), 'Internship', df.jobtype)

In [25]:
df.jobtype.unique()

array(['Permanent', 'Contract', 'Internship', 'Temporary'], dtype=object)

In [26]:
# find dups but keep those found using alternate searchstring
df[df.duplicated(subset=['source', 'jobtitle', 'searchstring'], keep=False)]


Unnamed: 0,advertiser,elapsed,industry,jd,jobtitle,jobtype,location,ojoblink,postcode,salary,searchstring,seniority,source,url,minsalary,maxsalary,avgsalary
30,PALO IT SINGAPORE PTE. LTD.,3.0,Information Technology,Roles & ResponsibilitiesPalo IT is an innovation & agile development company. From Design Resear...,Senior Application Designer - Lead / Senior UX Designer,Permanent,Singapore,https://www.mycareersfuture.sg/job/senior-application-designer-lead-senior-ux-designer-palo-sing...,49406,"$7,000 $14,000",Big Data,"Manager, Professional",PALO IT SINGAPORE PTE. LTD.,https://www.mycareersfuture.sg/job/senior-application-designer-lead-senior-ux-designer-palo-sing...,7000.0,14000.0,10500.0
31,ADDSTONES SAS,4.0,Consulting,"Roles & ResponsibilitiesGFI is an international IT services company, currently employing about 1...",Big Data Security Engineer,Permanent,Singapore,https://www.mycareersfuture.sg/job/big-data-security-engineer-addstones-sas-540c77673bfefe3bafdb...,68913,"$5,000 $10,000",Big Data,Manager,ADDSTONES SAS,https://www.mycareersfuture.sg/job/big-data-security-engineer-addstones-sas-540c77673bfefe3bafdb...,5000.0,10000.0,7500.0
35,ORACLE CAPAC SERVICES UNLIMITED COMPANY (SINGAPORE BRANCH),4.0,Information Technology,"Roles & ResponsibilitiesWe are looking for talented, passionate and self-motivated individuals w...",Solution Specialist,Permanent,Singapore,https://www.mycareersfuture.sg/job/solution-specialist-oracle-capac-services-unlimited-company-2...,138522,"$7,500 $15,000",Big Data,Professional,ORACLE CAPAC SERVICES UNLIMITED COMPANY (SINGAPORE BRANCH),https://www.mycareersfuture.sg/job/solution-specialist-oracle-capac-services-unlimited-company-2...,7500.0,15000.0,11250.0
36,ORACLE CAPAC SERVICES UNLIMITED COMPANY (SINGAPORE BRANCH),4.0,Information Technology,"Roles & ResponsibilitiesWe are looking for talented, passionate and self-motivated individuals w...",Solution Specialist,Permanent,Singapore,https://www.mycareersfuture.sg/job/solution-specialist-oracle-capac-services-unlimited-company-9...,138522,"$5,000 $10,000",Big Data,Professional,ORACLE CAPAC SERVICES UNLIMITED COMPANY (SINGAPORE BRANCH),https://www.mycareersfuture.sg/job/solution-specialist-oracle-capac-services-unlimited-company-9...,5000.0,10000.0,7500.0
37,ORACLE CAPAC SERVICES UNLIMITED COMPANY (SINGAPORE BRANCH),4.0,Information Technology,"Roles & ResponsibilitiesWe are looking for talented, passionate and self-motivated individuals w...",Solution Specialist,Permanent,Singapore,https://www.mycareersfuture.sg/job/solution-specialist-oracle-capac-services-unlimited-company-9...,138522,"$7,500 $15,000",Big Data,Professional,ORACLE CAPAC SERVICES UNLIMITED COMPANY (SINGAPORE BRANCH),https://www.mycareersfuture.sg/job/solution-specialist-oracle-capac-services-unlimited-company-9...,7500.0,15000.0,11250.0
38,ORACLE CAPAC SERVICES UNLIMITED COMPANY (SINGAPORE BRANCH),4.0,Information Technology,"Roles & ResponsibilitiesWe are looking for talented, passionate and self-motivated individuals w...",Solution Specialist,Permanent,Singapore,https://www.mycareersfuture.sg/job/solution-specialist-oracle-capac-services-unlimited-company-e...,138522,"$5,000 $10,000",Big Data,Professional,ORACLE CAPAC SERVICES UNLIMITED COMPANY (SINGAPORE BRANCH),https://www.mycareersfuture.sg/job/solution-specialist-oracle-capac-services-unlimited-company-e...,5000.0,10000.0,7500.0
57,ADDSTONES SAS,8.0,"Consulting , Banking and Finance, Information Technology",Roles & ResponsibilitiesGFI Group is an international business and technology solutions provider...,Big Data Security Engineer,Permanent,Singapore,https://www.mycareersfuture.sg/job/big-data-security-engineer-addstones-sas-47d32e8fabf96e2a4f09...,,"$7,000 $12,000",Big Data,Manager,ADDSTONES SAS,https://www.mycareersfuture.sg/job/big-data-security-engineer-addstones-sas-47d32e8fabf96e2a4f09...,7000.0,12000.0,9500.0
61,PALO IT SINGAPORE PTE. LTD.,8.0,Information Technology,Roles & ResponsibilitiesYour profile & role on the project YOU: Thrive on challenge. When was t...,Senior Database Consultant - Big Data Engineer,Permanent,Singapore,https://www.mycareersfuture.sg/job/senior-database-consultant-big-data-engineer-palo-singapore-1...,49406,"$6,000 $12,000",Big Data,Professional,PALO IT SINGAPORE PTE. LTD.,https://www.mycareersfuture.sg/job/senior-database-consultant-big-data-engineer-palo-singapore-1...,6000.0,12000.0,9000.0
62,PALO IT SINGAPORE PTE. LTD.,8.0,Information Technology,Roles & ResponsibilitiesYour profile & role on the project YOU: Thrive on challenge. When was t...,Senior Database Consultant - Big Data Engineer,Permanent,Singapore,https://www.mycareersfuture.sg/job/senior-database-consultant-big-data-engineer-palo-singapore-2...,49406,"$6,000 $12,000",Big Data,Professional,PALO IT SINGAPORE PTE. LTD.,https://www.mycareersfuture.sg/job/senior-database-consultant-big-data-engineer-palo-singapore-2...,6000.0,12000.0,9000.0
64,ERNST & YOUNG ADVISORY PTE. LTD.,8.0,Consulting,"Roles & ResponsibilitiesPowered by big data and advanced technologies, insights from analytics a...","Associate, Advisory Data and Analytics",Permanent,Singapore,https://www.mycareersfuture.sg/job/associate-advisory-data-analytics-ernst-young-advisory-206ebc...,48583,"$3,000 $6,000",Big Data,Executive,ERNST & YOUNG ADVISORY PTE. LTD.,https://www.mycareersfuture.sg/job/associate-advisory-data-analytics-ernst-young-advisory-206ebc...,3000.0,6000.0,4500.0


In [27]:
df.drop_duplicates(subset=['source', 'jobtitle', 'searchstring'], inplace=True)

In [28]:
# multiple industry overlap...
df.industry.value_counts()[:5]

Information Technology         464
Banking and Finance             68
Engineering                     54
Sciences / Laboratory / R&D     40
Others                          40
Name: industry, dtype: int64

In [29]:
df.shape

(963, 17)

In [30]:
# some postcodes have decimal points and some don't!
# get rid of decimals and add a '0' in front of postcodes starting with 0
df.postcode = df.postcode.astype(str)
df.postcode = df.postcode.apply(lambda x: str(x)[:-2] if x.find('.')>0 else x)


In [31]:
df.postcode = df.postcode.apply(lambda x: '0'+str(x) if len(x) ==5 else x)


In [32]:
# https://en.wikipedia.org/wiki/Postal_codes_in_Singapore
# according to above, first two digits are sector, so let's group postcodes by first two digits
df['district'] = df.postcode.str[:2]
plist = list(set(df.district))


In [33]:
print(plist)

['01', '15', '80', '24', '13', '50', '61', '17', '08', '34', '46', '52', '41', '55', '62', '20', '25', '53', '81', '73', '63', '12', 'No', '38', '11', '65', '49', '03', '33', '07', '31', '10', '26', '19', '75', 'na', '14', '09', '18', '48', '40', '57', '04', '22', '06', '16', '23', '05', '36', '56', '60']


In [34]:
# let's further group the sectors into districts per table in wiki
postaldict = {np.nan: np.nan, 'No': np.nan,'01':'01', '03':'01', '04':'01','05':'01',\
              '07':'02', '06':'01', '07': '02', '08':'02', '09':'04', '10':'04','11':'05',\
              '13':'05', '12':'05', '14': '03', '15':'03', '16': '03', '17':'06', \
              '18':'07', '19':'07', '20': '08', '22': '09', '23':'09','24':'10','25':'10', \
              '26':'10', '31':'12', '33':'12','34':'13', '38':'14', '36':'13','40':'14',    \
              '41':'14', '46': '16', '48':'16', '49':'17','50':'17','52':'18','53':'19', \
              '55': '19', '56':'20','57':'20', '60':'22','61':'22', '62':'22','63':'22',\
              '65':'23', '73':'25', '75':'27', '80':'28','81':'28'  }

In [35]:
df.district = df.district.map(postaldict)

In [36]:
df.district.value_counts()

01    304
02     83
05     68
14     41
22     38
06     30
16     27
09     26
07     22
03     21
12     15
13     15
20     13
19     12
18      9
04      8
17      8
28      7
10      7
27      7
23      5
08      2
25      1
Name: district, dtype: int64

In [37]:
df.head()

Unnamed: 0,advertiser,elapsed,industry,jd,jobtitle,jobtype,location,ojoblink,postcode,salary,searchstring,seniority,source,url,minsalary,maxsalary,avgsalary,district
0,ERNST & YOUNG ADVISORY PTE. LTD.,0.0,"Consulting , Banking and Finance, Information Technology",Roles & ResponsibilitiesWe are the only professional services organisation who has a separate bu...,Big Data Engineer (Financial Services),Permanent,Singapore,https://www.mycareersfuture.sg/job/big-data-engineer-ernst-young-advisory-5c6015c43915d53b5bec72...,48583,"$6,000 $12,000",Big Data,Manager,ERNST & YOUNG ADVISORY PTE. LTD.,https://www.mycareersfuture.sg/job/big-data-engineer-ernst-young-advisory-5c6015c43915d53b5bec72...,6000.0,12000.0,9000.0,1
1,THATZ INTERNATIONAL PTE LTD,0.0,Information Technology,Roles & Responsibilities Perform and manage BDA software setup and configuration for the new Dat...,Big Data Administrator,Permanent,Singapore,https://www.mycareersfuture.sg/job/big-data-administrator-thatz-international-9564001377ca7f4bf1...,179803,"$4,500 $5,700",Big Data,Executive,THATZ INTERNATIONAL PTE LTD,https://www.mycareersfuture.sg/job/big-data-administrator-thatz-international-9564001377ca7f4bf1...,4500.0,5700.0,5100.0,6
3,ICON CONSULTING-GROUP PTE. LTD.,0.0,Information Technology,Roles & ResponsibilitiesSolution Architect Description A Solution Architect is expected to d...,Analytics Architect,Permanent,Singapore,https://www.mycareersfuture.sg/job/analytics-architect-icon-consulting-group-440ccbcb6320596315d...,534045,"$15,000 $20,000",Big Data,Professional,ICON CONSULTING-GROUP PTE. LTD.,https://www.mycareersfuture.sg/job/analytics-architect-icon-consulting-group-440ccbcb6320596315d...,15000.0,20000.0,17500.0,19
5,NTT DATA SINGAPORE PTE. LTD.,0.0,Information Technology,Roles & ResponsibilitiesProvided technical consultation to VGC to build their Big Data Lake and ...,Technical Solutions Architecture Manager,Permanent,Singapore,https://www.mycareersfuture.sg/job/technical-solutions-architecture-manager-ntt-data-singapore-8...,89315,"$9,000 $12,000",Big Data,Middle Management,NTT DATA SINGAPORE PTE. LTD.,https://www.mycareersfuture.sg/job/technical-solutions-architecture-manager-ntt-data-singapore-8...,9000.0,12000.0,10500.0,2
6,SMARTSOFT PTE. LTD.,0.0,Information Technology,Roles & Responsibilities Responsibilities include understanding ETL & Data Engineering requirem...,Senior ETL and DATA Engineer,Permanent,Singapore,https://www.mycareersfuture.sg/job/senior-etl-data-engineer-smartsoft-4b1aeea089cbaa6726eb2cb5dc...,79903,"$6,000 $11,000",Big Data,Senior Executive,SMARTSOFT PTE. LTD.,https://www.mycareersfuture.sg/job/senior-etl-data-engineer-smartsoft-4b1aeea089cbaa6726eb2cb5dc...,6000.0,11000.0,8500.0,2


In [38]:
df['portal'] = 'CareersFuture'

In [39]:
df.seniority.unique()

array(['Manager', 'Executive', 'Professional', 'Middle Management',
       'Senior Executive', 'Non-executive',
       'Professional, Senior Executive', 'Middle Management, Manager',
       'Fresh/entry level', 'Professional, Executive',
       'Manager, Professional', 'Junior Executive',
       'Fresh/entry level, Non-executive', 'Executive, Senior Executive',
       'Senior Management, Manager, Professional',
       'Manager, Professional, Senior Executive',
       'Fresh/entry level, Executive, Junior Executive',
       'Professional, Executive, Senior Executive',
       'Executive, Junior Executive', 'Senior Management',
       'Manager, Senior Executive', 'Professional, Non-executive',
       'Fresh/entry level, Executive',
       'Fresh/entry level, Junior Executive',
       'Senior Management, Manager',
       'Senior Management, Middle Management',
       'Senior Management, Middle Management, Manager',
       'Middle Management, Manager, Senior Executive',
       'Executive, J

In [40]:
# split words in 'seniority' and perform feature engineering if necessary
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(lowercase=False)

In [41]:
tokens = cvec.fit_transform(df.seniority)

In [42]:
cvec.vocabulary_

{'Manager': 4,
 'Executive': 0,
 'Professional': 7,
 'Middle': 5,
 'Management': 3,
 'Senior': 8,
 'Non': 6,
 'executive': 10,
 'Fresh': 1,
 'entry': 9,
 'level': 11,
 'Junior': 2}

In [43]:
Xt = pd.DataFrame(tokens.todense(), 
                    columns=cvec.get_feature_names(), index=df.index)

In [44]:
Xt.shape

(963, 12)

In [45]:
Xt.head()

Unnamed: 0,Executive,Fresh,Junior,Management,Manager,Middle,Non,Professional,Senior,entry,executive,level
0,0,0,0,0,1,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0
5,0,0,0,1,0,1,0,0,0,0,0,0
6,1,0,0,0,0,0,0,0,1,0,0,0


In [46]:
#Xt.head()
Xt.drop(['executive', 'level', 'Fresh'], axis=1, inplace=True)

In [47]:
Xt.head()

Unnamed: 0,Executive,Junior,Management,Manager,Middle,Non,Professional,Senior,entry
0,0,0,0,1,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0
5,0,0,1,0,1,0,0,0,0
6,1,0,0,0,0,0,0,1,0


In [48]:
# split seniority into two parts  - executive/professional and seniority level.
# function : 1 = non-exec, 2 = executive, 3 = professional
# level: 1 = entry level, 2 = junior, 3 = manager, 4 = middle, 5 = senior
# if words with more than one function or level are specified, take average
df['function'] = ((Xt.Non)*1+ (Xt.Executive)*2+(Xt.Professional)*3+(Xt.Management)*4)/\
(Xt.Non+Xt.Executive+Xt.Professional+Xt.Management)
df['level'] = ((Xt.entry)*1+ (Xt.Junior)*2+ (Xt.Manager)*3+(Xt.Middle)*4+(Xt.Senior)*5)/\
((Xt.entry)+ (Xt.Junior)+ (Xt.Manager)+(Xt.Middle)+(Xt.Senior))

In [53]:
df[['seniority', 'function', 'level']].sample(10)

Unnamed: 0,seniority,function,level
810,Manager,0.0,3.0
5,Middle Management,4.0,4.0
363,Executive,2.0,0.0
344,Professional,3.0,0.0
809,Fresh/entry level,0.0,1.0
607,Executive,2.0,0.0
225,"Senior Management, Manager",4.0,4.0
295,Junior Executive,2.0,2.0
1082,Professional,3.0,0.0
1188,Fresh/entry level,0.0,1.0


In [50]:
df.function.fillna(0, inplace=True)
df.level.fillna(0, inplace=True)
df['rank'] = df['function']+df['level']

In [51]:
df.head()

Unnamed: 0,advertiser,elapsed,industry,jd,jobtitle,jobtype,location,ojoblink,postcode,salary,searchstring,seniority,source,url,minsalary,maxsalary,avgsalary,district,portal,function,level,rank
0,ERNST & YOUNG ADVISORY PTE. LTD.,0.0,"Consulting , Banking and Finance, Information Technology",Roles & ResponsibilitiesWe are the only professional services organisation who has a separate bu...,Big Data Engineer (Financial Services),Permanent,Singapore,https://www.mycareersfuture.sg/job/big-data-engineer-ernst-young-advisory-5c6015c43915d53b5bec72...,48583,"$6,000 $12,000",Big Data,Manager,ERNST & YOUNG ADVISORY PTE. LTD.,https://www.mycareersfuture.sg/job/big-data-engineer-ernst-young-advisory-5c6015c43915d53b5bec72...,6000.0,12000.0,9000.0,1,CareersFuture,0.0,3.0,3.0
1,THATZ INTERNATIONAL PTE LTD,0.0,Information Technology,Roles & Responsibilities Perform and manage BDA software setup and configuration for the new Dat...,Big Data Administrator,Permanent,Singapore,https://www.mycareersfuture.sg/job/big-data-administrator-thatz-international-9564001377ca7f4bf1...,179803,"$4,500 $5,700",Big Data,Executive,THATZ INTERNATIONAL PTE LTD,https://www.mycareersfuture.sg/job/big-data-administrator-thatz-international-9564001377ca7f4bf1...,4500.0,5700.0,5100.0,6,CareersFuture,2.0,0.0,2.0
3,ICON CONSULTING-GROUP PTE. LTD.,0.0,Information Technology,Roles & ResponsibilitiesSolution Architect Description A Solution Architect is expected to d...,Analytics Architect,Permanent,Singapore,https://www.mycareersfuture.sg/job/analytics-architect-icon-consulting-group-440ccbcb6320596315d...,534045,"$15,000 $20,000",Big Data,Professional,ICON CONSULTING-GROUP PTE. LTD.,https://www.mycareersfuture.sg/job/analytics-architect-icon-consulting-group-440ccbcb6320596315d...,15000.0,20000.0,17500.0,19,CareersFuture,3.0,0.0,3.0
5,NTT DATA SINGAPORE PTE. LTD.,0.0,Information Technology,Roles & ResponsibilitiesProvided technical consultation to VGC to build their Big Data Lake and ...,Technical Solutions Architecture Manager,Permanent,Singapore,https://www.mycareersfuture.sg/job/technical-solutions-architecture-manager-ntt-data-singapore-8...,89315,"$9,000 $12,000",Big Data,Middle Management,NTT DATA SINGAPORE PTE. LTD.,https://www.mycareersfuture.sg/job/technical-solutions-architecture-manager-ntt-data-singapore-8...,9000.0,12000.0,10500.0,2,CareersFuture,4.0,4.0,8.0
6,SMARTSOFT PTE. LTD.,0.0,Information Technology,Roles & Responsibilities Responsibilities include understanding ETL & Data Engineering requirem...,Senior ETL and DATA Engineer,Permanent,Singapore,https://www.mycareersfuture.sg/job/senior-etl-data-engineer-smartsoft-4b1aeea089cbaa6726eb2cb5dc...,79903,"$6,000 $11,000",Big Data,Senior Executive,SMARTSOFT PTE. LTD.,https://www.mycareersfuture.sg/job/senior-etl-data-engineer-smartsoft-4b1aeea089cbaa6726eb2cb5dc...,6000.0,11000.0,8500.0,2,CareersFuture,2.0,5.0,7.0


In [54]:
df.to_pickle("./master1.pkl")