Web scraping indeed website for data science jobs based on Greg Reda's [excellent tutorial](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/) and Jesse Steinweg's [excellent analysis](https://jessesw.com/Data-Science-Skills/)

## 1) Admin and Setup

I've already created a virtual environment in conda by downloading bs4. Please feel free to use my environment.yaml to create a similar virtual env. I'll update it as I go.

In [97]:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import pandas as pd
#import re

## 2) What do I want to achieve

http://au.indeed.com/jobs?as_and=&as_phr=&as_any=%22customer+analytics%22+%22data+analysis%22&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=50&l=&fromage=any&limit=10&sort=&psf=advsrch

OR

http://au.indeed.com/jobs?as_and=&as_phr=&as_any=%22customer+analytics%22+%22data+analysis%22&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=0&l=Sydney+NSW&fromage=last&limit=10&sort=&psf=advsrch

in a nice tabular format for data exploration

## 3) String together webpage url based on different parameters

Create logic for converting different search parameters such as search query, city, salary etc. as separate lists and then stringing them together into one final query

final_query = base_url + job_query_string + company_name + salary + location + fromage

#### Base URL

In [47]:
# start of the url - this will not change because I'm including search query to search
# anywhere in the job ad. Not just the job title.
base_url = 'http://au.indeed.com/jobs?as_and=&as_phr=&as_any='

#### Job query string inputs as parameters are stored in job_query1, 2 and 3. Restrict to only 3 parameters. 

Step1 - collect 3 search queries<br/>
When you refactor this, make sure this is converted into parameters

In [48]:
job_query1 = 'data scientist'
job_query2 = 'customer analytics'
job_query3 = 'data analysis'

Step2 - create query string in the required format

In [49]:
# 1) within search string, spaces are replaced by '+' in html
# 2) each search query is preceded and succeeded by a "%22"
# 3) string the elements of the list into one string separated by a "+"
job_query_string = []
job_query_string.append("%22" + job_query1.replace(" ","+") + "%22")
job_query_string.append("%22" + job_query2.replace(" ","+") + "%22")
job_query_string.append("%22" + job_query3.replace(" ","+") + "%22")
job_query_string = "+".join(job_query_string)
job_query_string

'%22data+scientist%22+%22customer+analytics%22+%22data+analysis%22'

#### Company name, salary, location, fromage - not working on these right now

In [84]:
company_name=''
salary=''
location=''
fromage='any'

#### Create final query

In [85]:
final_query = [base_url,job_query_string,'&as_not=&as_ttl=&as_cmp=',company_name,
               '&jt=all&st=&salary=',salary,'&radius=50&l=',location,
               '&fromage=',fromage,'&limit=10&sort=&psf=advsrch']
final_query = "".join(final_query)
final_query

'http://au.indeed.com/jobs?as_and=&as_phr=&as_any=%22data+scientist%22+%22customer+analytics%22+%22data+analysis%22&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=50&l=&fromage=any&limit=10&sort=&psf=advsrch'

## 4) Open website and read it

Step1: Open the first page <br/>
Step2: Get the html of the first page

In [87]:
html = urlopen(final_query).read()  
soup = BeautifulSoup(html, "lxml")  
soup

<!DOCTYPE html>\n<html>\n<head>\n<meta content="text/html;charset=unicode-escape" http-equiv="content-type"/>\n<!-- pll --><script src="/s/567da8a/en_AU.js" type="text/javascript"></script>\n<link href="/s/a6a334e/jobsearch_all.css" rel="stylesheet" type="text/css"/>\n<link href="http://au.indeed.com/rss?q=%28%22data+scientist%22+or+%22customer+analytics%22+or+%22data+analysis%22%29" rel="alternate" title="Data Scientist Customer Analytics Data Analysis Jobs" type="application/rss+xml"/>\n<link href="/m/jobs?q=%28%22data+scientist%22+or+%22customer+analytics%22+or+%22data+analysis%22%29" media="handheld" rel="alternate"/>\n<script type="text/javascript">\n    \n    window['closureReadyCallbacks'] = [];\n\n    function call_when_jsall_loaded(cb) {\n        if (window['closureReady']) {\n            cb();\n        } else {\n            window['closureReadyCallbacks'].push(cb);\n        }\n    }\n</script>\n<script src="/s/37c105f/jobsearch-all-compiled.js" type="text/javascript"></script

Step3: find out how many jobs returned from the search query

In [99]:
number_of_jobs_page_area = soup.find(id="searchCount").string.encode('utf-8')
number_of_jobs_page_area

'Jobs 1 to 10 of 985'

In [100]:
number_of_jobs = re.findall('\d+', number_of_jobs_page_area)
number_of_jobs

['1', '10', '985']

In [104]:
total_number_of_jobs = int(number_of_jobs[2])
total_number_of_jobs

985

Step4: calculate how many pages to scroll

In [111]:
# define function to return the number of pages to scroll based on the total number of jobs
def return_number_of_scrolls(total_number_of_jobs):
    if total_number_of_jobs%10 == 0:
        return total_number_of_jobs/10
    else:
        return total_number_of_jobs/10+1

In [110]:
number_of_pages_to_scroll = return_number_of_scrolls(total_number_of_jobs)
number_of_pages_to_scroll

99

## 5) Store all the deets into pd.df