# Project 4 : Job Market Analysis

### Notebook 01: Data Gathering (Web Scrape)

---

## Contents
- [1.0 Introduction](#Intro)
- [2.0 Web Scrape](#scrapemain)
    - [2.1 Defining scraping functions](#functions)
    - [2.2 Scraping the web](#scrape)
- [3.0 Next Steps](#next)

## 1.0 Introduction <a name="Intro"></a>
The goal of this project is to answer the following questions:
1. Which factors in the job market have the most affect on salary?
2. Is it possible to identify the key skills and buzzwords across job category / title?

In order to answer the above questions, I will gather data from the job search site [SEEK Limited AU](https://www.seek.com.au/). For this project I will limit the study to available jobs in data related fields (e.g. data science, data analyst etc) in the Sydney and Melbourne areas.

I will be splitting this project in 3 notebooks as follows:
- **Notebook 01: Data Gathering (Web Scrape)**
- Notebook 02: Data Cleaning and Exporatory Analysis
- Notebook 03: Predictive Model Building

In [1]:
# Import required libraries
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from scrapy.selector import Selector
from time import sleep

## 2.0 Web Scrape <a name="scrapemain"></a>
The following code will scrape the data from the job site SEEK.com.au. I have defined two functions as follows:
- The first function will scrape the search result page and identify all the href links to the job advertisement pages.
- The second function will go to each job advertisement page, and scrape the available data.

The output of these functions will give me a dataframe with all the scraped information which I will use for analysis and modelling. For the scrape and parsing, I am using Requests for fetching the website, and a combination of BeautifulSoup and Scrapy/Xpath for the parsing process.

**Important Note:**
> The code below is functional at the time of creating this notebook. However, it is possible in future that the original webpage structure may change, and would not give the intended results!

### 2.1 Defining scraping functions <a name="functions"></a>

In [2]:
# Define scraping function for each job page
def individual_page(url):
    result = requests.get(url)
    
    if result.status_code == 200:
        x_select = Selector(text=result.text)
        job_title = x_select.xpath('//*/*[@*="job-detail-title"]/span/h1/text()').extract()[0] # Extract Title
        advertiser = x_select.xpath('//*/h2/span/span/text()').extract()                       # Extract Advertiser
        if len(advertiser) > 0:
            advertiser = advertiser[0]
        else:
            advertiser = None
        rating = x_select.xpath('//*/h2/span/span/span/span/text()').extract()                 # Extract Advertiser Rating
        if len(rating) > 0:
            rating = rating[0]
        else:
            rating = None
            
        posted_date = x_select.xpath('//*/*[@*="job-detail-date"]/span/span/text()').extract()[0]  # Extract Post Date
        salary = x_select.xpath('//*/*[@*="jobInfoHeader"]/dl/div/dd/span/span/text()').extract()  # Extract Salary
        if len(salary) > 0:
            salary = salary[0]
        else:
            salary = None
            
        work_type = x_select.xpath('//*/*[@*="job-detail-work-type"]/span/span/text()').extract()[0]  # Extract Contract Type
        category = x_select.xpath('//*/section[@*="jobInfoHeader"]/dl/div/dd/span/span/strong/text()').extract()[0]  # Job Category
        subcategory = x_select.xpath('//*/section[@*="jobInfoHeader"]/dl/div/dd/span/span/span/text()').extract()[0] # Job Category

        # Extract Job description text
        results_parsed = BeautifulSoup(result.text, 'lxml')
              
        body = results_parsed.find('div', {'data-automation': 'mobileTemplate'}).text
        
        return job_title, advertiser, rating, posted_date, salary, work_type, category, subcategory, body
    
    else:
        print('Failed')

In [3]:
# Define function for job search page scraping
def job_scrape(job, location, res_per_page=20):

    scraped = {'searched_job' : [],
               'searched_city' : [],
               'job_title' : [],
               'advertiser' : [],
               'advertiser_rating' : [],
               'date_posted' : [],
               'salary' : [],
               'work_type' : [],
               'category' : [],
               'subcategory' : [],
               'job_description' : [],
               'url' : []
              }

    url = 'https://www.seek.com.au/' + job + '-jobs/in-' + location
    result = requests.get(url)
    x_selector = Selector(text=result.text)
    num_results = np.ceil(int(x_selector.xpath('//*/*[@*="totalJobsCount"]/text()').extract()[0].replace(',','')) / 20)
    num_results = int(num_results)
    
    for i in range(1, num_results+1):
        url = 'https://www.seek.com.au/' + job + '-jobs/in-' + location + '?page=' + str(i)
        searches = requests.get(url)
        x_selector = Selector(text=searches.text)
        h_refs = x_selector.xpath('//*/*[@*="searchResults"]/div/div/div/article/span/span/h1/a/@href').extract()
        
        href_count = 1
        for h_ref in h_refs:
            url = 'https://www.seek.com.au' + h_ref
            print(url)
            job_title, advertiser, rating, posted_date, salary, work_type, category, subcategory, body = individual_page(url)
            scraped['searched_job'].append(job)
            scraped['searched_city'].append(location)
            scraped['job_title'].append(job_title)
            scraped['advertiser'].append(advertiser)
            scraped['advertiser_rating'].append(rating)
            scraped['date_posted'].append(posted_date)
            scraped['salary'].append(salary)
            scraped['work_type'].append(work_type)
            scraped['category'].append(category)
            scraped['subcategory'].append(subcategory)
            scraped['job_description'].append(body)
            scraped['url'].append(url)
            
            href_count += 1
            sleep(1)
            
    return pd.DataFrame(scraped)

### 2.2 Scraping the web <a name="scrape"></a>
Now that the functions are defined, let us run the functions and get the data.


**WARNING: Takes a very long time, do not run unless necessary!!**

In [None]:
# WARNING: Takes a very long time, do not run unless necessary!!

# Searching for Data Science and Data Analyst jobs in Sydney and Melbourne
jobs = pd.DataFrame({'searched_job' : [],
               'searched_city' : [],
               'job_title' : [],
               'advertiser' : [],
               'advertiser_rating' : [],
               'date_posted' : [],
               'salary' : [],
               'work_type' : [],
               'category' : [],
               'subcategory' : [],
               'job_description' : [],
               'url' : []
              })

for loc in ['All-Sydney-NSW', 'All-Melbourne-VIC']:
    for job in ['data-scientist', 'data-analyst']
    these_jobs = job_scrape(job, loc)
    jobs = pd.concat([jobs, these_jobs], axis=0)
    
# Saving original scraped data to csv file:
jobs.to_csv('./datasets/seekjobs.csv')

The below code will only scrape partially and append to the original data. I have run this multiple times to try and balance out the unbalanced searches across different search keywords (e.g. Data Analyst gives 4,000 results while Data Scientist only gives around a 100).

In [11]:
# Saving subsequent Data Science jobs appended to original file
jobs = pd.DataFrame({'searched_job' : [],
               'searched_city' : [],
               'job_title' : [],
               'advertiser' : [],
               'advertiser_rating' : [],
               'date_posted' : [],
               'salary' : [],
               'work_type' : [],
               'category' : [],
               'subcategory' : [],
               'job_description' : [],
               'url' : []
              })

for loc in ['All-Sydney-NSW', 'All-Melbourne-VIC']:             # , 'All-Sydney-NSW', 'All-Melbourne-VIC'
    for job in ['data-engineer']:                              # , 'data-analyst', 'data-scientist'
        these_jobs = job_scrape(job, loc)
    jobs = pd.concat([jobs, these_jobs], axis=0)

cols_for_dupes = ['searched_city', 'job_title', 'advertiser', 'advertiser_rating', 
                  'salary', 'work_type', 'job_description']
jobs = jobs.drop_duplicates(subset=cols_for_dupes, keep='first')    

jobs.to_csv('./datasets/seekjobs.csv', header=False, mode='a')
print('Number of jobs :', jobs.shape[0])

## 3.0 Next Steps <a name="next"></a>
We have our data! In the next notebook (Notebook 02) I will go over the cleaning and exploratory analysis process.

__________________________________________________________________
**-- NOTEBOOK 01 END --**