## Requirements

1. Scrape and prepare your own data.

2. **Create and compare at least two models for each section**. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).
   - Section 1: Job Salary Trends
   - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists. 
   - Make sure to clearly describe and label each section.
   - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.
   - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.


## Suggestions for Getting Started

1. Collect data from [Indeed.com](www.indeed.com) (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  - Select and parse data from *at least 1000 postings* for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Discover which features have the greatest importance when determining a low vs. high paying job.
  - Your Boss is interested in what overall features hold the greatest significance.
  - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.   
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.
5. If tackling the bonus question, try framing the salary problem as a classification problem detecting low vs. high salary positions.

In [351]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import os 
import re

from bs4 import BeautifulSoup
import urllib
import urllib.parse
from time import sleep

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [352]:
from sklearn.model_selection import train_test_split,KFold, cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.linear_model import Lasso, ElasticNet, Ridge, LassoCV, ElasticNetCV, \
RidgeCV, LinearRegression
from sklearn.metrics import mean_squared_error

import statsmodels.formula.api as sm
import statsmodels.api as smf

In [353]:
pd.set_option('display.max_columns', 100)
pd.set_option('max_colwidth',100)


In [354]:
import warnings
warnings.filterwarnings('ignore')

In [355]:
# write a function to extract similar pickles and concat them
def concat(portal):
    df = pd.DataFrame()
    list = os.listdir('.')
    for a in list:
        found = re.search(portal + "*", a)
        if found is not None:
            x = pd.read_pickle("./{}".format(a))
            print(x.shape)
            df = df.append(x, ignore_index=True) 
    return df

In [356]:
df = concat('IDD')

(4986, 11)
(9977, 11)
(9987, 11)
(9290, 11)


In [357]:
df.shape

(34240, 11)

In [358]:
df.head()

Unnamed: 0,advertiser,elapsed,jd,jobtitle,jobtype,location,ojoblink,salary,searchstring,source,url
0,Accenture,16 days ago,Join Accenture and help transform leading organizations and communities around the world. The sh...,Business Analyst (Health and Public Service),,Singapore,https://www.indeed.com.sg/rc/clk?jk=ec5934f649cbf86d&from=vj&pos=top,,Business Analyst,Accenture,https://www.indeed.com.sg/rc/clk?jk=ec5934f649cbf86d&fccid=a4e4e2eaf26690c9&vjs=3
1,Hewlett Packard Enterprise,20 days ago,"At HPE, we bring together the brightest minds to create breakthrough technology solutions and ad...",Business Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=847dc76922d52115&from=vj&pos=top,,Business Analyst,Hewlett Packard Enterprise,https://www.indeed.com.sg/rc/clk?jk=847dc76922d52115&fccid=216eb700022de6f6&vjs=3
2,Carecone Technologies,save job,PermanentI hope you are doing great.Please go through the below job profile and please response ...,Business Analyst,Permanent,Singapore,,,Business Analyst,,https://www.indeed.com.sg/company/Carecone-Technologies/jobs/Business-Analyst-9d4ecfc9e405881d?f...
3,MasterCard,19 days ago,It is essential to continue our growth momentum in the region to deliver on our key strategic pi...,Strategy and Operations Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=89d35853db917c9a&from=vj&pos=top,,Business Analyst,MasterCard,https://www.indeed.com.sg/rc/clk?jk=89d35853db917c9a&fccid=10b5c722d846df43&vjs=3
4,CCTS Global Pte Ltd,save job,"$3,500 - $5,000 a monthJob DescriptionThis role will partner with internal stakeholders across m...",Junior Business Analyst,,Singapore,,"$3,500 - $5,000 a month",Business Analyst,,https://www.indeed.com.sg/company/Consumer-Cloud-Technology-Services/jobs/Junior-Business-Analys...


In [359]:
# count dups
df.duplicated(subset=['source', 'jobtitle', 'searchstring']).sum()

31327

In [360]:
# find dups but keep those found using alternate searchstring

df[df.duplicated(subset=['source', 'jobtitle', 'searchstring'], keep=False)]


Unnamed: 0,advertiser,elapsed,jd,jobtitle,jobtype,location,ojoblink,salary,searchstring,source,url
2,Carecone Technologies,save job,PermanentI hope you are doing great.Please go through the below job profile and please response ...,Business Analyst,Permanent,Singapore,,,Business Analyst,,https://www.indeed.com.sg/company/Carecone-Technologies/jobs/Business-Analyst-9d4ecfc9e405881d?f...
9,Amazon Web Services Singapore,15 days ago,"Qualifications:\n\nAt least 10+ years' experience working in Sales operations, Business operatio...",ASEAN Business Operations Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=49b819b5514a39c5&from=vj&pos=top,,Business Analyst,Amazon.com,https://www.indeed.com.sg/rc/clk?jk=49b819b5514a39c5&fccid=fe2d21eef233e94a&vjs=3
10,Amazon Web Services Singapore,15 days ago,"Qualifications:\n\nAt least 10+ years' experience working in Sales operations, Business operatio...",ASEAN Business Operations Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=49b819b5514a39c5&from=vj&pos=top,,Business Analyst,Amazon.com,https://www.indeed.com.sg/rc/clk?jk=49b819b5514a39c5&fccid=fe2d21eef233e94a&vjs=3
11,Bambu,save job,We are looking for a Business Analyst who shares the same passion with us. You will be at forefr...,Business Analyst,,Singapore,,,Business Analyst,,https://www.indeed.com.sg/company/Bambu/jobs/Business-Analyst-193aef4f4b4eb1ea?fccid=04d5946612e...
18,FINSURGE,save job,ContractRoles & ResponsibilitiesFinSurge is undertaking a major multi-year Murex upgrade program...,Business Analyst,Contract,Singapore,,,Business Analyst,,https://www.indeed.com.sg/company/FINSURGE/jobs/Business-Analyst-b583256035c72a57?fccid=6728301d...
19,Hearti Lab Pte Ltd,save job,FinRisk builds and implements software to help banks and other financial institutions with their...,Business Analyst,,Singapore,,,Business Analyst,,https://www.indeed.com.sg/company/Hearti-Lab-Pte-Ltd/jobs/Business-Analyst-a25df78f29fa0810?fcci...
32,Vision Manpower,save job,"$3,000 - $4,000 a monthPermanentResponsibilitiesTo perform user requirement gathering and explo...",System Analyst,Permanent,Singapore,,"$3,000 - $4,000 a month",Business Analyst,,https://www.indeed.com.sg/company/Vision-Manpower-Pte-Ltd/jobs/System-Analyst-e6c60967e52d102f?f...
42,Percept Solutions,save job,"$5,000 - $7,000 a monthContract, PermanentMandatory Skills: JavaPreferred Skills: Google Apps En...",Business System Analyst,"Contract, Permanent",Singapore,,"$5,000 - $7,000 a month",Business Analyst,,https://www.indeed.com.sg/company/Percept-Solutions/jobs/Business-System-Analyst-dd845643800b9df...
48,Percept Solutions,save job,"$5,500 - $6,500 a monthContract, PermanentJob Description: We are seeking highly motivated candi...",Test Analyst,"Contract, Permanent",Singapore,,"$5,500 - $6,500 a month",Business Analyst,,https://www.indeed.com.sg/company/Percept-Solutions/jobs/Test-Analyst-3f75150eca3fa377?fccid=f62...
58,Expedia,30+ days ago,InternshipExpedia\n\nHomeAway Business Analyst (Emerging Markets) Intern - Singapore\n\nHomeAway...,Internship - Singapore - Business Analyst (Emerging Markets) Intern 2019,Internship,Singapore,https://www.indeed.com.sg/rc/clk?jk=88a3fc593aaa9e11&from=vj&pos=top,,Business Analyst,Expedia,https://www.indeed.com.sg/rc/clk?jk=88a3fc593aaa9e11&fccid=160efb82f2462f14&vjs=3


In [361]:
df.drop_duplicates(subset=['source', 'jobtitle', 'searchstring'], inplace=True)

In [362]:
df.shape

(2913, 11)

In [363]:
# check for nulls in general
df.isnull().sum()

advertiser        41
elapsed            3
jd                 3
jobtitle           3
jobtype         2199
location           0
ojoblink         359
salary          2792
searchstring       0
source           214
url                0
dtype: int64

***SALARY

In [364]:
# check nulls for salary
df[df.salary.isnull()].sample(5)

Unnamed: 0,advertiser,elapsed,jd,jobtitle,jobtype,location,ojoblink,salary,searchstring,source,url
5874,Argus Media,30+ days ago,The crude reporter will cover the crude oil market in the Asia Pacific region through price asse...,Crude Oil Reporter,,Singapore,https://www.indeed.com.sg/rc/clk?jk=c494ba6f817b69ba&from=vj&pos=top,,Business Intelligence,Argus Media,https://www.indeed.com.sg/rc/clk?jk=c494ba6f817b69ba&fccid=d1bdb1e0aad69657&vjs=3
5711,Intel,26 days ago,"Job Description\nManages programs of large (may be global) scope, impact and complexity through ...",Engineering Program Manager,,Singapore,https://www.indeed.com.sg/rc/clk?jk=84d4b6569ef9011b&from=vj&pos=top,,Business Intelligence,Intel,https://www.indeed.com.sg/rc/clk?jk=84d4b6569ef9011b&fccid=f1374be6a45f4b8a&vjs=3
87,Dion,30+ days ago,"Junior Business Analyst\nLocation:Dion Global Solutions, Singapore\nExperience: Fresher\nEmploym...",Junior Business Analyst,,Singapore,,,Business Analyst,Dion,https://www.indeed.com.sg/rc/clk?jk=1b63219e73f244e6&fccid=936ef7c2d3db3abc&vjs=3
25147,BIOFOURMIS SINGAPORE PTE. LTD.,16 days ago,PermanentRoles & Responsibilities\nThe candidate uses strong coding skills to help in the develo...,Data Scientist - Signal Processing,Permanent,Singapore,https://www.indeed.com.sg/rc/clk?jk=73e232e76dc3e61b&from=vj&pos=top,,Data Scientist,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=73e232e76dc3e61b&fccid=ed255c0f8f8768e9&vjs=3
907,ALACROFT GLOBAL PTE. LTD.,8 days ago,Roles & Responsibilities\nThe Corporate Finance Analyst analyses the business performance and af...,analyst market research,,Singapore,https://www.indeed.com.sg/rc/clk?jk=787adabb9bf41f29&from=vj&pos=top,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=787adabb9bf41f29&fccid=f464645afcc32bb2&vjs=3


***JD

In [365]:
# let's see if salary info can be extracted from the JD
df.jd.sample(10)

25374    Req. ID: 111302\nDescription\nIn this position, you are expected to deliver business value from ...
100      Works with external clients to resolve day-to-day issues and direct them to appropriate resource...
4995     Req. ID: 129563\nRecruiter: JAYA VELUPILLAI\nAs a Strategic Category Analyst at Micron Technolog...
22513    PermanentRoles & Responsibilities\nMSD is an innovative, global healthcare leader committed to i...
25472    Today, developing countries in Asia and the most remote regions across the world lack access to ...
177      Contract, PermanentThe Urban Redevelopment Authority is the national planning authority of Singa...
15976    Unilever is one of the world’s leading suppliers of Food, Home and Personal Care products with s...
810      MathWorks is establishing a new technical team dedicated to developing the market strategy and d...
15521    PermanentThis is an exciting time to join Dentsu Aegis Network (DAN) as we focus on becoming a 1...
25218    Nokia is a

In [366]:
#zooming in on sample
df.jd[14].split('\n')

['What impact will you make?At Deloitte, we offer a unique and exceptional career experience to inspire and empower talents like you to make an impact that matters for our clients, people and society. Whatever your aspirations, Deloitte offers you unrivalled opportunities to realise your full potential. We are always looking for people with the relentless energy to push themselves further, and to find new avenues and unique ways to reach our shared goals. So what are you waiting for?',
 '',
 'Join the winning team now. Responsibilities: Gather and analyze information, formulate and test hypotheses Involve in discussions and work closely with Project Manager in developing recommendations for presentation to client management Implement recommendations with project and client team members Provide support to clients to deliver organisational and change initiatives Support and/or facilitate client workshops Assisting in financial administration of engagements such as budgets, billing, and c

In [367]:
# drop rows with no jds. they have expired. 
df.dropna(axis=0, subset=['jd'], inplace=True)
df.isnull().sum()

advertiser        38
elapsed            0
jd                 0
jobtitle           0
jobtype         2196
location           0
ojoblink         356
salary          2789
searchstring       0
source           211
url                0
dtype: int64

In [368]:
# lots of missing salaries
df.salary.value_counts(dropna=False)[:5]

NaN                        2789
$5,000 - $6,000 a month       7
$3,000 - $5,000 a month       6
$3,000 - $4,000 a month       4
$5,000 - $7,000 a month       4
Name: salary, dtype: int64

In [369]:
#to pull out salaries from JD description

for row, jd in enumerate(df.jd):
    if '$' in jd:
        try:
            if (df.salary[row] is None) :
                dollar = [pos for pos, char in enumerate(jd) if char == '$']
                for d in dollar:
                    if ('month' or 'year') in jd[d:d+30]:
                        df['salary'][row] = jd[d:d+30]
                        #print(df['salary'][row], '**')        
        except:
            continue

In [370]:
string = '$8,000 - $9,000 a month'
string.split('-')[0].strip()
string.split('-')[1].strip()

'$9,000 a month'

In [371]:
df.salary.value_counts(dropna=False)

NaN                                 2742
$5,000 - $6,000 a month                7
$3,000 - $5,000 a month                6
$5,000 - $7,000 a month                4
$3,500 - $5,000 a month                4
$3,000 - $4,000 a month                4
$4,500 - $6,500 a month                3
$4,000 - $6,000 a month                3
$3,500 - $4,500 a month                3
$3,000 - $6,000 a month                3
$8,000 - $11,000 a month               3
$4,000 - $4,500 a month                3
$6,000 - $8,000 a month                3
$8,000 - $9,000 a month                3
$5,000 - $5,500 a month                3
$5,000 - $8,000 a month                3
$4,500 - $5,600 a month                3
$5,000 - $10,000 a month               3
$4,500 - $5,800 a month                3
$9,000 a month                         3
$5,000 - $9,000 a month                2
$4,000 - $8,000 a month                2
$11,000.00 /month                      2
$5,500.00 /month                       2
$2,400 - $3,000 

In [372]:
#let's split salaries into min and max 
df['minsalary'] = df[~df.salary.isnull()].salary.apply(lambda x: x.split('-')[0].strip() if '-' in x else 0)
df['maxsalary'] = df[~df.salary.isnull()].salary.apply(lambda x: x.split('-')[1].strip() if '-' in x else 0)


In [373]:
#let's see what's left
df[(df.minsalary==0)&~(df.salary==0)].salary

24          $4,000.00 /monthExperience:Fin
33                        $6,000.00 /month
35          $3,600.00 /monthExperience:CRM
38          $7,000.00 /monthExperience:SDL
44          $6,500.00 /monthExperience:tes
49          $5,000.00 /monthExperience:Fra
109         $2,000.00 /monthExperience:cal
121       $6100 per month\n\nResponsibilit
122                       $5,500.00 /month
128         $1,200.00 /monthExperience:Mic
143         $9,000.00 /monthExperience:con
163                         $7,000 a month
166                         $9,000 a month
177         $9,000.00 /monthExperience:Dat
187        $6000\nDuration: 6 months contr
203                         $9,000 a month
214         $6,000.00 /monthExperience:dev
224                       $6,500.00 /month
232                       $5,500.00 /month
235                      $11,000.00 /month
263                       $6,000.00 /month
265         $8,000.00 /monthLocation:Singa
269                       $5,000.00 /month
288        

In [374]:
# put what's left in avgsalary column
df['avgsalary'] = df[(df.minsalary==0)&~(df.salary==0)].salary.apply(lambda x: x.split(' ')[0].strip())

In [375]:
df.head()

Unnamed: 0,advertiser,elapsed,jd,jobtitle,jobtype,location,ojoblink,salary,searchstring,source,url,minsalary,maxsalary,avgsalary
0,Accenture,16 days ago,Join Accenture and help transform leading organizations and communities around the world. The sh...,Business Analyst (Health and Public Service),,Singapore,https://www.indeed.com.sg/rc/clk?jk=ec5934f649cbf86d&from=vj&pos=top,,Business Analyst,Accenture,https://www.indeed.com.sg/rc/clk?jk=ec5934f649cbf86d&fccid=a4e4e2eaf26690c9&vjs=3,,,
1,Hewlett Packard Enterprise,20 days ago,"At HPE, we bring together the brightest minds to create breakthrough technology solutions and ad...",Business Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=847dc76922d52115&from=vj&pos=top,,Business Analyst,Hewlett Packard Enterprise,https://www.indeed.com.sg/rc/clk?jk=847dc76922d52115&fccid=216eb700022de6f6&vjs=3,,,
2,Carecone Technologies,save job,PermanentI hope you are doing great.Please go through the below job profile and please response ...,Business Analyst,Permanent,Singapore,,,Business Analyst,,https://www.indeed.com.sg/company/Carecone-Technologies/jobs/Business-Analyst-9d4ecfc9e405881d?f...,,,
3,MasterCard,19 days ago,It is essential to continue our growth momentum in the region to deliver on our key strategic pi...,Strategy and Operations Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=89d35853db917c9a&from=vj&pos=top,,Business Analyst,MasterCard,https://www.indeed.com.sg/rc/clk?jk=89d35853db917c9a&fccid=10b5c722d846df43&vjs=3,,,
4,CCTS Global Pte Ltd,save job,"$3,500 - $5,000 a monthJob DescriptionThis role will partner with internal stakeholders across m...",Junior Business Analyst,,Singapore,,"$3,500 - $5,000 a month",Business Analyst,,https://www.indeed.com.sg/company/Consumer-Cloud-Technology-Services/jobs/Junior-Business-Analys...,"$3,500","$5,000 a month",


In [376]:
#inspect min and max salaries
df.minsalary = df.minsalary.astype(str)

In [377]:
df['minsalary'] = np.where(df['minsalary'].str.contains('$'), df['minsalary'], 0)

In [378]:
df['minsalary'] = df['minsalary'].apply(lambda x: x.replace('$','').replace(',', '').replace('nan','0') if x is not None else 0)

In [379]:
df.minsalary.unique()

array(['0', '3500', '2800', '3000', '2400', '4000', '5000', '5500',
       '1700', '8000', '4500', '2500', '6000', '7500', '3800', '7000',
       '2000', '8500', '3200', '3300', '2200', '2700', '42000', '10000',
       '7300'], dtype=object)

In [380]:
df.maxsalary.unique()

array([nan, '$5,000 a month', 0, '$4,000 a month', '$3,000 a month',
       '$6,000 a month', '$3,600 a month', '$7,000 a month',
       '$8,000 a month', '$6,500 a month', '$2,000 a month',
       '$5,500 a month', '$11,000 a month', '$4,500 a month',
       '$9,000 a month', '$3,500 a month', '$5,600 a month',
       '$4,200 a month', '$5,900 a month', '$10,000 a month',
       '$2,200 a month', '$9,500 a month', '$3,400 a month',
       '$6,200 a month', '$12,000 a month', '$5,800 a month',
       '$6,800 a month', '$3,800 a month', '$45,000 a year',
       '$2,500 a month', '$15,000 a month', '$7,500 a month'],
      dtype=object)

In [381]:
df.maxsalary = df.maxsalary.astype(str)
df['maxsalary'] = df.maxsalary.apply(lambda x: x.split(' ')[0].strip())
df['maxsalary'] = df['maxsalary'].apply(lambda x: x.replace('$','').replace(',', '').replace('nan', '0') if x is not None else 0)

In [382]:
df.maxsalary.unique()

array(['0', '5000', '4000', '3000', '6000', '3600', '7000', '8000',
       '6500', '2000', '5500', '11000', '4500', '9000', '3500', '5600',
       '4200', '5900', '10000', '2200', '9500', '3400', '6200', '12000',
       '5800', '6800', '3800', '45000', '2500', '15000', '7500'],
      dtype=object)

In [383]:
df.avgsalary.unique()

array([nan, '$4,000.00', '$6,000.00', '$3,600.00', '$7,000.00',
       '$6,500.00', '$5,000.00', '$2,000.00', '$6100', '$5,500.00',
       '$1,200.00', '$9,000.00', '$7,000', '$9,000', '$6000\nDuration:',
       '$11,000.00', '$8,000.00', '$3300', '$3,500.00', '$4,500.00',
       '$3,500', '$5,900.00', '$10,000.00', '$3500\n13', '$4.8K',
       '$1,000', '$2,200.00', '$9,500.00', '$3,300', '$4,000'],
      dtype=object)

In [384]:
df['avgsalary'] = np.where(df['avgsalary']=='$4.8K', '$4,800', df.avgsalary)
df['avgsalary'] = np.where(df['avgsalary']=='$3500\n13', '$3,500', df.avgsalary)
df['avgsalary'] = np.where(df['avgsalary']=='$6000\nDuration:', '$6,000', df.avgsalary)

In [385]:
df.avgsalary = df.avgsalary.astype(str)
df['avgsalary'] = df['avgsalary'].apply(lambda x: x.replace('$','').replace(',', '').replace('nan', '0') if x is not None else 0)

In [386]:
df.avgsalary.unique()

array(['0', '4000.00', '6000.00', '3600.00', '7000.00', '6500.00',
       '5000.00', '2000.00', '6100', '5500.00', '1200.00', '9000.00',
       '7000', '9000', '6000', '11000.00', '8000.00', '3300', '3500.00',
       '4500.00', '3500', '5900.00', '10000.00', '4800', '1000',
       '2200.00', '9500.00', '4000'], dtype=object)

In [387]:
#convert to float
df[['avgsalary', 'minsalary', 'maxsalary']]=df[['avgsalary', 'minsalary', 'maxsalary']].astype(float)


In [388]:
df.dtypes

advertiser       object
elapsed          object
jd               object
jobtitle         object
jobtype          object
location         object
ojoblink         object
salary           object
searchstring     object
source           object
url              object
minsalary       float64
maxsalary       float64
avgsalary       float64
dtype: object

In [389]:
df.maxsalary = np.where(df.maxsalary==45000, df.maxsalary/12, df.maxsalary)
df.minsalary = np.where(df.minsalary==42000, df.minsalary/12, df.minsalary)

In [390]:
# average max and min
df.avgsalary = np.where(df.avgsalary==0, (df.minsalary+df.maxsalary)/2, df.avgsalary)

In [391]:
df.avgsalary.unique()

array([    0.,  4250.,  4000.,  3400.,  3500.,  6000.,  3600.,  2700.,
        4500.,  7000.,  3200.,  5000.,  6500.,  2000.,  6100.,  5500.,
        1850.,  1200.,  4750.,  9000., 11000.,  5750.,  5250.,  9500.,
        8000.,  3300.,  8500.,  3000.,  5050.,  3750.,  5900., 10000.,
        4800.,  1000.,  8250.,  5450.,  4150.,  7500.,  2200.,  2100.,
        5850.,  3850.,  4550.,  2600.,  3350.,  3625.,  5150.,  2250.,
       12500., 11500.,  7650.])

In [392]:
#let's investigate null advertisers
df[df.advertiser.isnull()].source.unique()

array(['MyCareersFuture.SG'], dtype=object)

In [393]:
df[df.source=='MyCareersFuture.SG']

Unnamed: 0,advertiser,elapsed,jd,jobtitle,jobtype,location,ojoblink,salary,searchstring,source,url,minsalary,maxsalary,avgsalary
104,,8 days ago,"PermanentRoles & Responsibilities\nPresales support\nDevelop, update and improve marketing mater...",IT Business Analyst,Permanent,Singapore,https://www.indeed.com.sg/rc/clk?jk=2a568326890570be&from=vj&pos=top,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=2a568326890570be&fccid=dd616958bd9ddc12&vjs=3,0.0,0.0,0.0
114,,2 days ago,Roles & Responsibilities\nCOMPANY DESCRIPTION\nCEVA provides world class supply chain solutions ...,Business Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=edb2138041a3ee01&from=vj&pos=top,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=edb2138041a3ee01&fccid=dd616958bd9ddc12&vjs=3,0.0,0.0,0.0
133,,7 days ago,PermanentRoles & Responsibilities\nBusiness Transformation team is responsible for managing the ...,"Business Analyst, Business Transformation",Permanent,Singapore,https://www.indeed.com.sg/rc/clk?jk=414e17ba109d563f&from=vj&pos=top,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=414e17ba109d563f&fccid=dd616958bd9ddc12&vjs=3,0.0,0.0,0.0
236,CLSA SINGAPORE PTE LTD,15 days ago,PermanentRoles & Responsibilities\nCompany Description\nCLSA is an Asia’s leading capital market...,Sr. IT Analyst,Permanent,Singapore,https://www.indeed.com.sg/rc/clk?jk=42205fd0e650b521&from=vj&pos=top,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=42205fd0e650b521&fccid=aa7c119d5927ac09&vjs=3,0.0,0.0,0.0
237,MAVERICKS CONSULTING PTE. LTD.,2 days ago,"Contract, PermanentRoles & Responsibilities\nMavericks Consulting is a powerhouse of skilled IT ...",IT Business Process Consultant / Business Analyst,"Contract, Permanent",Singapore,https://www.indeed.com.sg/rc/clk?jk=851122345e4736b8&from=vj&pos=top,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=851122345e4736b8&fccid=ab53d60a9132c36f&vjs=3,0.0,0.0,0.0
256,,14 days ago,PermanentRoles & Responsibilities\nThe Business Systems Analyst is the Subject Matter Expert in ...,Business Systems Analyst,Permanent,Singapore,https://www.indeed.com.sg/rc/clk?jk=e198753782e4745a&from=vj&pos=top,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=e198753782e4745a&fccid=dd616958bd9ddc12&vjs=3,0.0,0.0,0.0
282,,9 days ago,"Roles & Responsibilities\nWork with different IT teams across infrastructure, and other division...","Vice President, Business Analyst",,Singapore,https://www.indeed.com.sg/rc/clk?jk=717717c772da2d69&from=vj&pos=top,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=717717c772da2d69&fccid=dd616958bd9ddc12&vjs=3,0.0,0.0,0.0
311,M & C SAATCHI (S) PTE LTD,2 days ago,"Roles & Responsibilities\nAs an Insights Strategist, you will hit the ground running and provide...",Strategy Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=9935b8d51ef2d8f0&from=vj&pos=top,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=9935b8d51ef2d8f0&fccid=98c81e207b341814&vjs=3,0.0,0.0,0.0
361,ERP21 PTE LTD,22 days ago,ContractRoles & Responsibilities\nPosition Overview\nThe ITSM Analyst shall work closely with th...,IT Service Management (ITSM) Analyst,Contract,Singapore,https://www.indeed.com.sg/rc/clk?jk=d3a628e959014107&from=vj&pos=top,"$3,500.00 /monthExperience:ERP",Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=d3a628e959014107&fccid=7cdfbf67ffc29a5b&vjs=3,0.0,0.0,3500.0
370,SIEMENS HEALTHCARE PTE. LTD.,9 days ago,PermanentRoles & Responsibilities\nResponsibilities / main duties:\nConducts analyses and create...,Strategy & Business Development Analyst,Permanent,Singapore,https://www.indeed.com.sg/rc/clk?jk=7544f9c0edc40c05&from=vj&pos=top,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=7544f9c0edc40c05&fccid=3b89b9ec324f96c8&vjs=3,0.0,0.0,0.0


In [394]:
#mycareersfuture in separate search
df.dropna(axis=0, subset=['advertiser'], inplace=True)

In [395]:
# investigate time elapsed
df.elapsed.unique()

array(['16 days ago', '20 days ago', 'save job', '19 days ago',
       '9 days ago', '12 days ago', '4 days ago', 'Pacific',
       '15 days ago', '1 day ago', '7 days ago', '14 days ago',
       '2 days ago', '30+ days ago', '9 hours ago', '13 days ago',
       '18 hours ago', '22 days ago', '29 days ago', '8 days ago',
       '28 days ago', '6 days ago', '21 days ago', '18 days ago',
       '24 days ago', 'Cola Refreshments', '22 hours ago', '3 days ago',
       '25 days ago', '10 days ago', '26 days ago', '5 days ago',
       '17 days ago', '2 hours ago', '17 hours ago', '19 hours ago',
       '27 days ago', '15 hours ago', '20 hours ago', '23 days ago',
       '11 days ago', '30 days ago', '21 hours ago', '13 hours ago',
       '5 hours ago', 'ISAC Inc', '1 hour ago', '4 hours ago',
       'Singbridge Pte Ltd', '16 hours ago', '23 hours ago',
       'Artificial Intelligence for the Visual Web', '14 hours ago',
       '6 hours ago', 'SG', 'Royce', 'Sunrice GlobalChef Academy',
     

In [396]:
# get rid of entries with words
df.elapsed = df.elapsed.apply(lambda x: x if x[0].isdigit() else None)
df.dropna(axis=0, subset=['elapsed'], inplace=True)

In [397]:
# convert hours and minutes to days elapsed
df.elapsed = np.where(df.elapsed.str.contains('hour'), 0, df.elapsed)
df.elapsed = np.where(df.elapsed.str.contains('minute'), 0, df.elapsed)

In [398]:
df.elapsed.unique()

array(['16 days ago', '20 days ago', '19 days ago', '9 days ago',
       '12 days ago', '4 days ago', '15 days ago', '1 day ago',
       '7 days ago', '14 days ago', '2 days ago', '30+ days ago', 0,
       '13 days ago', '22 days ago', '29 days ago', '8 days ago',
       '28 days ago', '6 days ago', '21 days ago', '18 days ago',
       '24 days ago', '3 days ago', '25 days ago', '10 days ago',
       '26 days ago', '5 days ago', '17 days ago', '27 days ago',
       '23 days ago', '11 days ago', '30 days ago'], dtype=object)

In [399]:
# remove remaining text and non-numeric signs
df.elapsed = df.elapsed.apply(lambda x: x.split(' ')[0].strip().replace('+','') if x!=0 else 0)

In [400]:
df.elapsed.unique()

array(['16', '20', '19', '9', '12', '4', '15', '1', '7', '14', '2', '30',
       0, '13', '22', '29', '8', '28', '6', '21', '18', '24', '3', '25',
       '10', '26', '5', '17', '27', '23', '11'], dtype=object)

In [401]:
# investigate jobtype
df.jobtype.value_counts()

Permanent                                    336
Contract                                      89
Internship                                    71
Contract, Permanent                           42
Temporary                                     32
Temporary, Internship                         13
Temporary, Contract                            5
Part-time, Contract                            2
Part-time, Temporary, Contract, Permanent      1
Part-time                                      1
Name: jobtype, dtype: int64

In [402]:
df.jobtype = np.where(df.jobtype.str.contains('Permanent'), 'Permanent', df.jobtype)
df.jobtype = np.where(df.jobtype.str.contains('Contract'), 'Contract', df.jobtype)
df.jobtype = np.where(df.jobtype.str.contains('Internship'), 'Internship', df.jobtype)
df.jobtype = np.where(df.jobtype.str.contains('Temporary'), 'Contract', df.jobtype)
df.jobtype = np.where(df.jobtype.str.contains('Part-time'), 'Parttime', df.jobtype)

In [403]:
df.jobtype.unique()

array([None, 'Internship', 'Permanent', 'Contract', 'Parttime'],
      dtype=object)

In [404]:
df.head()

Unnamed: 0,advertiser,elapsed,jd,jobtitle,jobtype,location,ojoblink,salary,searchstring,source,url,minsalary,maxsalary,avgsalary
0,Accenture,16,Join Accenture and help transform leading organizations and communities around the world. The sh...,Business Analyst (Health and Public Service),,Singapore,https://www.indeed.com.sg/rc/clk?jk=ec5934f649cbf86d&from=vj&pos=top,,Business Analyst,Accenture,https://www.indeed.com.sg/rc/clk?jk=ec5934f649cbf86d&fccid=a4e4e2eaf26690c9&vjs=3,0.0,0.0,0.0
1,Hewlett Packard Enterprise,20,"At HPE, we bring together the brightest minds to create breakthrough technology solutions and ad...",Business Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=847dc76922d52115&from=vj&pos=top,,Business Analyst,Hewlett Packard Enterprise,https://www.indeed.com.sg/rc/clk?jk=847dc76922d52115&fccid=216eb700022de6f6&vjs=3,0.0,0.0,0.0
3,MasterCard,19,It is essential to continue our growth momentum in the region to deliver on our key strategic pi...,Strategy and Operations Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=89d35853db917c9a&from=vj&pos=top,,Business Analyst,MasterCard,https://www.indeed.com.sg/rc/clk?jk=89d35853db917c9a&fccid=10b5c722d846df43&vjs=3,0.0,0.0,0.0
5,Helix Leisure,9,Helix Leisure is a leading global supplier to the Out of Home Entertainment industry – locations...,Business Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=4435e4019055afff&from=vj&pos=top,,Business Analyst,Helix Leisure,https://www.indeed.com.sg/rc/clk?jk=4435e4019055afff&fccid=778fa220f9b5c1b7&vjs=3,0.0,0.0,0.0
6,General Electric,12,Role Summary:\nThe Business Analyst role is responsible for building live/dynamic models that ca...,Business Analyst - Digitization,,Singapore,https://www.indeed.com.sg/rc/clk?jk=0c87a64013378b6d&from=vj&pos=top,,Business Analyst,GE Careers,https://www.indeed.com.sg/rc/clk?jk=0c87a64013378b6d&fccid=c5c99ec01e2125aa&vjs=3,0.0,0.0,0.0


In [408]:
df[df.avgsalary !=0].avgsalary.count()

41

In [409]:
df['portal'] = 'Indeed'

In [410]:
df.to_pickle("./master2.pkl")