# WEBSCRAPPING

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

## Requirements

1. Scrape and prepare your own data.

2. **Create and compare at least two models for each section**. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).
   - Section 1: Job Salary Trends
   - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists. 
   - Make sure to clearly describe and label each section.
   - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.
   - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

#### BONUS

5. Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions.

6. Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.

---

## Suggestions for Getting Started

1. Collect data from [Indeed.com](www.indeed.com) (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  - Select and parse data from *at least 1000 postings* for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Discover which features have the greatest importance when determining a low vs. high paying job.
  - Your Boss is interested in what overall features hold the greatest significance.
  - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.   
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.
5. If tackling the bonus question, try framing the salary problem as a classification problem detecting low vs. high salary positions.


# WebScrapping the Data

In [26]:
import requests
from bs4 import BeautifulSoup
import os
from selenium import webdriver
from time import sleep
import numpy as np
import pandas as pd

In [12]:
from selenium.common.exceptions import NoSuchElementException
#for Pages

### Approach
Append the links of each job posting on each page to a list, then load each link with an iteration and use BeautifulSoup to copy the contents of the job posting

In [13]:
joblinks=[]

In [21]:
chromedriver='/Users/techno/Desktop/chromedriver'
for page_no in range(0,100):
    driver = webdriver.Chrome(executable_path=chromedriver)
    driver.get("https://www.mycareersfuture.sg/search?search=data&sortBy=new_posting_date&page=%s" % str(page_no))
    sleep(5)
    html = driver.page_source
# turn into a BeautifulSoup object
    soup = BeautifulSoup(html, 'lxml')
    jobelement=soup.findAll('a',{'class':'bg-white mb3 w-100 dib v-top pa3 no-underline flex-ns flex-wrap JobCard__card___22xP3'})

    # add to our joblinks list
    for x in range(1,len(jobelement)):
        joblink=jobelement[x]['href']
        joblinks.append(joblink)
    driver.close()

In [23]:
joblinkslist=[]
for x in joblinks:    
    y= 'https://www.mycareersfuture.sg' + x
    joblinkslist.append(y)

In [27]:
jobdata=pd.DataFrame(columns=['salary','salarytype','joblist','commitment','location','seniority','company','requirements','job_description'])

In [28]:
#This is where i get the features of each job listing

#Feature Extractor
for i, link in enumerate(joblinkslist):
    linkeddriver=webdriver.Chrome(executable_path=chromedriver)
    linkeddriver.get(link)
    sleep(8)
    linkedhtml=linkeddriver.page_source
    linkedhtml=BeautifulSoup(linkedhtml,'lxml')
    #Choosing Features:
    try:
        salaryt=linkedhtml.find('span',{'class': 'salary_type dib f5 fw4 black-60 pr1 i pb'}).getText()
        jobdata.loc[i, 'salarytype']=salaryt
    except AttributeError:
        jobdata.loc[i, 'salarytype']=np.nan
        
        
    try:
        sal=linkedhtml.find('div', {'class':'lh-solid'}).getText()
        jobdata.loc[i, 'salary']=sal
    except AttributeError:
        jobdata.loc[i, 'salary']=np.nan
        
    try:
        job=linkedhtml.find('h1',{'class': 'f3 fw6 mv0 pv0 mb1 dark-pink w-100 dib'}).getText()
        jobdata.loc[i, 'joblist']=job
    except AttributeError:
        jobdata.loc[i, 'joblist']=np.nan
    
    try:
        commit=linkedhtml.find('p',{'id':'employment_type'}).getText()
        jobdata.loc[i, 'commitment']=commit
    except AttributeError:
        jobdata.loc[i, 'commitment']=np.nan
        
        
    try:
        locate=linkedhtml.find('a',{'class':'link dark-pink underline-hover'}).getText()
        jobdata.loc[i, 'location']=locate
    except AttributeError:
        jobdata.loc[i, 'location']=np.nan
        
        
    try:
        senior=linkedhtml.find('p',{'id':'seniority'}).getText()
        jobdata.loc[i, 'seniority']=senior
    except AttributeError:
        jobdata.loc[i, 'seniority']=np.nan
        
    try:
        comp=linkedhtml.find('p',{'name':'company'}).getText()
        jobdata.loc[i, 'company']=comp
    except AttributeError:
        jobdata.loc[i, 'company']=np.nan
    
    try:
        req=linkedhtml.find('div',{'id':'requirements'}).getText()
        jobdata.loc[i, 'requirements']=req
    except AttributeError:
        jobdata.loc[i, 'requirements']=np.nan
    
    try:
        des=linkedhtml.find('div',{'id':'job_description'}).getText()
        jobdata.loc[i, 'job_description']=des
    except AttributeError:
        jobdata.loc[i, 'job_description']=np.nan
        
    linkeddriver.close()
    jobdata.to_csv('jobdata.csv')
    
    

In [35]:
jobdata

Unnamed: 0,salary,salarytype,joblist,commitment,location,seniority,company
0,,,Data Analyst,Permanent,55 MARKET STREET 048941,Executive,KIMBERLEY CONSULTING PTE. LTD.
1,"$6,500to$11,700",Monthly,"AVP, Data Scientist, Business Analytics, Consu...",Full Time,"MARINA BAY FINANCIAL CENTRE, 12 MARINA BOULEVA...",Senior Management,DBS BANK LTD.
2,"$5,500to$11,000",Monthly,High-Performance Data Engineer,Permanent,"PARKVIEW SQUARE, 600 NORTH BRIDGE ROAD 188778",Professional,NIOMETRICS (PTE.) LTD.
3,"$5,000to$10,000",Monthly,Data Scientist,Permanent,"PARKVIEW SQUARE, 600 NORTH BRIDGE ROAD 188778",Professional,NIOMETRICS (PTE.) LTD.
4,"$5,000to$7,500",Monthly,Data Scientist,"Contract, Full Time",21 LOWER KENT RIDGE ROAD 119077,Non-executive,NATIONAL UNIVERSITY OF SINGAPORE
5,"$8,000to$16,000",Monthly,Data scientist / Application production engi...,Full Time,"HONG LEONG BUILDING, 16 RAFFLES QUAY 048581",Non-executive,KEYTEO CONSULTING PTE. LTD.
6,"$6,000to$8,500",Monthly,Enterprise Architecture Tool Admin,"Contract, Full Time","GUOCO TOWER, 1 WALLICH STREET 078881",Executive,MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD
7,"$10,000to$15,000",Monthly,Managing Consultant – Banking and Blockchain,"Permanent, Contract",31 CANTONMENT ROAD 089747,Professional,WIPRO LIMITED (SINGAPORE BRANCH)
8,"$4,500to$9,000",Monthly,Global Mobility Solution Operations Specialist...,Full Time,"MARINA BAY FINANCIAL CENTRE, 8 MARINA BOULEVAR...",Junior Executive,GOOGLE ASIA PACIFIC PTE. LTD.
9,"$8,500to$10,500",Monthly,Project Manager,Permanent,31 CANTONMENT ROAD 089747,Professional,WIPRO LIMITED (SINGAPORE BRANCH)
