### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

### 1) Inspect the elements of this page to confirm we can find all of the information we're interested in.

### 2) Use `urllib` and `BeautifulSoup` to read the contents of the HTML.

In [29]:
from bs4 import BeautifulSoup
import urllib
import urllib.parse
from time import sleep
import pandas as pd


In [30]:
# Set the URL we want to visit.
searchstr = "Business Analyst"
plus = searchstr.replace(" ", '+')
location = "Singapore"
jtype = ""  # "fulltime", "permanent", "contract","internship", "temporary", "parttime"
baseurl = "https://www.indeed.com.sg"
url = baseurl + "/jobs?q={}&l={}&jt={}".format(plus, location, jtype)
print(url)


https://www.indeed.com.sg/jobs?q=Business+Analyst&l=Singapore&jt=


In [31]:
# Visit the URL and grab the HTML of the page.
html = urllib.request.urlopen(url).read()
# We need to convert this into a soup object.
soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")


### 5) Extract the html link to each job



**investigate how to find link to 'next' page**


In [32]:
# there are 10 entries on each page. so, if we are on page one, the next page 'starts with' 10.
# therefore if on page n, next page starts with 'n x 10' entries
nextpg = url+"&start=40"
nextpg

'https://www.indeed.com.sg/jobs?q=Business+Analyst&l=Singapore&jt=&start=40'

**write loop to extract all links**

In [33]:
links = []
for i in range(1,500):
    # after examining html, determined that class:jobtitle is unique to job entry
    hits = soup.find_all('h2', {"class":"jobtitle"})
    # the url we want is within the text extracted (inside a element)
    for hit in hits:
        h = urllib.parse.urljoin(url, hit.find("a")['href'])
        links.append(h)
    
    # construct url for next page
    
    nextpg = url+"&start={}".format(i*10)
    #print(nextpg)
    # open and read next page into BS4
    html = urllib.request.urlopen(nextpg).read()
    #sleep(1)
    
    #reinitialize soup object to new page
    soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")

In [34]:
# let's check out number of items in links
len(links)

4990

In [35]:
links[:5]

['https://www.indeed.com.sg/rc/clk?jk=ec5934f649cbf86d&fccid=a4e4e2eaf26690c9&vjs=3',
 'https://www.indeed.com.sg/rc/clk?jk=847dc76922d52115&fccid=216eb700022de6f6&vjs=3',
 'https://www.indeed.com.sg/company/Carecone-Technologies/jobs/Business-Analyst-9d4ecfc9e405881d?fccid=d0ab2a8c843ca2df&vjs=3',
 'https://www.indeed.com.sg/rc/clk?jk=89d35853db917c9a&fccid=10b5c722d846df43&vjs=3',
 'https://www.indeed.com.sg/company/Consumer-Cloud-Technology-Services/jobs/Junior-Business-Analyst-fa48f3dfd37a8185?fccid=aa0ccaeee49d1a69&vjs=3']

###  now let's write a loop to extract all the info we want and stick it into pandas

In [36]:
# if extracting in batches grey out datadict = [] after initial run. 
datadict = []

In [37]:
for i in links:
    try:
        readurl = urllib.request.urlopen(i).read()
        
    except urllib.error.HTTPError as err:
        print(err.code)
        continue

    #reinitialize soup object to new page
    soup2 = BeautifulSoup(readurl, 'html.parser', from_encoding="utf-8")
    
    # extracting info
    entry ={}
    #entry['jobid'] = 'jobid'
    entry['url'] = i
    entry['searchstring'] = searchstr
    entry['location'] = location
    try:
        entry['jobtitle'] = soup2.find('h3', {"class":"icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title"}).text
    except:
        entry['jobtitle'] = None
    try:
        entry['advertiser'] = soup2.find('div', {"class":"icl-u-lg-mr--sm icl-u-xs-mr--xs"}).text
    except:
        entry['advertiser'] = None
    try:
        entry['jd'] = soup2.find('div', {"class":"jobsearch-JobComponent-description icl-u-xs-mt--md"}).text
    except:
        entry['jd'] = None
    try:
        ojob = soup2.find('span', {"id":"originalJobLinkContainer"})
        entry['ojoblink'] = ojob.find('a')['href']
    except:
        entry['ojoblink'] = None
    try:
        elapsed = soup2.find('div', {"class":"jobsearch-JobMetadataFooter"}).text
        entry['elapsed'] = elapsed.split('-')[1].strip()
    except:
        entry['elapsed'] = None
    try:
        entry['source'] = soup2.find('span', {"icl-u-textColor--success"}).text
    except:
        entry['source'] = None
    try:
        entry['salary'] = soup2.find('span', {"class":"icl-u-xs-mr--xs"}).text
    except:
        entry['salary'] = None
    try:
        entry['jobtype'] = soup2.find('div', {"class":"jobsearch-JobMetadataHeader-item icl-u-xs-mt--xs"}).text
    except:
        entry['jobtype'] = None
    datadict.append(entry)
    

500
404
500
500


In [38]:
len(datadict)

4986

In [39]:
df = pd.DataFrame(datadict)
df.sample(10)

Unnamed: 0,advertiser,elapsed,jd,jobtitle,jobtype,location,ojoblink,salary,searchstring,source,url
1085,TELEDIRECT PTE LTD,6 days ago,PermanentRoles & Responsibilities\nAs a Schedu...,Scheduling Analyst,Permanent,Singapore,https://www.indeed.com.sg/rc/clk?jk=b788fbe02e...,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=b788fbe02e...
2854,MSD,30+ days ago,The Senior Specialist Ops Lead is responsible ...,IT Snr Specialist Ops Lead,,Singapore,https://www.indeed.com.sg/rc/clk?jk=792cb07197...,,Business Analyst,MSD,https://www.indeed.com.sg/rc/clk?jk=792cb07197...
1738,Mercer,30+ days ago,An opportunity to work for a market leader\nWo...,Global Mobility Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=96be90325f...,,Business Analyst,Marsh & McLennan Companies,https://www.indeed.com.sg/rc/clk?jk=96be90325f...
4581,Power IT Consultancy,30+ days ago,PermanentJob Location : Singapore Job Type : P...,Business Analyst - Calypso,Permanent,Singapore,https://www.indeed.com.sg/rc/clk?jk=c88932d8ad...,,Business Analyst,Power IT Consultancy,https://www.indeed.com.sg/rc/clk?jk=c88932d8ad...
295,SINGAPORE EXCHANGE LIMITED,8 days ago,Roles & Responsibilities\nKey Accountabilities...,Business Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=3e2ed800fa...,,Business Analyst,MyCareersFuture.SG,https://www.indeed.com.sg/rc/clk?jk=3e2ed800fa...
37,Techsource Systems Pte Ltd,save job,"$3,000 - $6,000 a monthThe Business Analyst pl...",Business Systems Analyst,,Singapore,,"$3,000 - $6,000 a month",Business Analyst,,https://www.indeed.com.sg/company/TechSource-S...
4363,Sciente,21 days ago,Exciting career opportunity for Business Analy...,Business Analyst (Murex),,Singapore,,,Business Analyst,Sciente,https://www.indeed.com.sg/rc/clk?jk=7498502484...
2218,CCN Hub,30+ days ago,Software Analyst\n\n\n\nTechnical Requirements...,SOFTWARE ANALYST,,Singapore,https://www.indeed.com.sg/rc/clk?jk=b90d99f056...,,Business Analyst,CCN Hub,https://www.indeed.com.sg/rc/clk?jk=b90d99f056...
94,Singapore LNG Corporation Pte Ltd,12 days ago,The successful candidate will liaise closely w...,Business Process Analyst,,Singapore,https://www.indeed.com.sg/rc/clk?jk=02115dfa11...,,Business Analyst,Singapore LNG Corporation Pte Ltd,https://www.indeed.com.sg/rc/clk?jk=02115dfa11...
40,ECnet Limited,save job,"$4,000 - $6,000 a monthYou will be working col...",IT Business Analyst (PERM),,Singapore,,"$4,000 - $6,000 a month",Business Analyst,,https://www.indeed.com.sg/company/R-Systems/jo...


In [40]:
df.shape

(4986, 11)

In [41]:
df.to_pickle("./IDD{}.pkl".format(searchstr))