Web scraping indeed website for data science jobs based on Greg Reda's [excellent tutorial](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/) and Jesse Steinweg's [excellent analysis](https://jessesw.com/Data-Science-Skills/) and finally, Sung Pil Moon's [awesome analysis](http://blog.nycdatascience.com/students-work/project-3-web-scraping-company-data-from-indeed-com-and-dice-com/)

## 1) Admin and Setup

I've already created a virtual environment in conda by downloading bs4. Please feel free to use my environment.yaml to create a similar virtual env. I'll update it as I go.

In [19]:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import pandas as pd
import re
import numpy as np

## 2) What do I want to achieve

http://au.indeed.com/jobs?as_and=&as_phr=&as_any=%22customer+analytics%22+%22data+analysis%22&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=50&l=&fromage=any&limit=10&sort=&psf=advsrch

OR

http://au.indeed.com/jobs?as_and=&as_phr=&as_any=%22customer+analytics%22+%22data+analysis%22&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=0&l=Sydney+NSW&fromage=last&limit=10&sort=&psf=advsrch

in a nice tabular format for data exploration

## 3) String together webpage url based on different parameters

Create logic for converting different search parameters such as search query, city, salary etc. as separate lists and then stringing them together into one final query

final_query = base_url + job_query_string + company_name + salary + location + fromage

#### Base URL

In [20]:
# start of the url - this will not change because I'm including search query to search
# anywhere in the job ad. Not just the job title.
base_url = 'http://au.indeed.com/jobs?as_and=&as_phr=&as_any='

#### Job query string inputs as parameters are stored in job_query1, 2 and 3. Restrict to only 3 parameters. 

Step1 - collect 3 search queries<br/>
When you refactor this, make sure this is converted into parameters

In [21]:
job_query1 = 'data scientist'
job_query2 = 'customer analytics'
job_query3 = 'data analysis'

Step2 - create query string in the required format

In [22]:
# 1) within search string, spaces are replaced by '+' in html
# 2) each search query is preceded and succeeded by a "%22"
# 3) string the elements of the list into one string separated by a "+"
job_query_string = []
job_query_string.append("%22" + job_query1.replace(" ","+") + "%22")
job_query_string.append("%22" + job_query2.replace(" ","+") + "%22")
job_query_string.append("%22" + job_query3.replace(" ","+") + "%22")
job_query_string = "+".join(job_query_string)
job_query_string

'%22data+scientist%22+%22customer+analytics%22+%22data+analysis%22'

#### Company name, salary, location, fromage - not working on these right now

In [23]:
company_name=''
salary=''
location=''
fromage='any'

#### Create final query

In [24]:
final_query = [base_url,job_query_string,'&as_not=&as_ttl=&as_cmp=',company_name,
               '&jt=all&st=&salary=',salary,'&radius=50&l=',location,
               '&fromage=',fromage,'&limit=10&sort=&psf=advsrch']
final_query = "".join(final_query)
final_query

'http://au.indeed.com/jobs?as_and=&as_phr=&as_any=%22data+scientist%22+%22customer+analytics%22+%22data+analysis%22&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=50&l=&fromage=any&limit=10&sort=&psf=advsrch'

## 4) Open website and read it

Step1: Open the first page <br/>
Step2: Get the html of the first page

In [25]:
html = urlopen(final_query).read()  
soup = BeautifulSoup(html, "lxml")

Step3: find out how many jobs returned from the search query

In [26]:
number_of_jobs_page_area = soup.find(id="searchCount").string.encode('utf-8')
number_of_jobs_page_area

'Jobs 1 to 10 of 966'

In [27]:
number_of_jobs = re.findall('\d+', number_of_jobs_page_area)
number_of_jobs

['1', '10', '966']

In [28]:
total_number_of_jobs = int(number_of_jobs[2])
total_number_of_jobs

966

Step4: calculate how many pages to scroll

In [29]:
#round up the # of records divided by 10 as the number of pages in order to ensure coverage. 
number_of_pages_to_scroll = np.ceil(total_number_of_jobs/10.0)
number_of_pages_to_scroll

97.0

## 5) Load all the deets into separate lists

In [30]:
# get all the deets of each row. One row pertains to one job
# incidentally, i noticed that the class "row result" was only picking 9 results in the first page.
# this was because the last row was populated in another class 'lastRow row result'

targetElements = soup.findAll('div', attrs = {'class' : ' row result'})
targetElements.extend(soup.findAll('div', attrs = {'class' : 'lastRow row result'}))

In [31]:
type(targetElements)

bs4.element.ResultSet

In [32]:
for elem in targetElements:
    #print elem # commenting out this line so as to remove metadata from the notebook
    print "\n"
    print "*****************************************************"
    print "\n"



*****************************************************




*****************************************************




*****************************************************




*****************************************************




*****************************************************




*****************************************************




*****************************************************




*****************************************************




*****************************************************




*****************************************************




#### Job Title

In [33]:
jobtitle = []
for elem in targetElements:
    #print elem.find('a', attrs = {'class':'turnstileLink'}).attrs['title']
    jobtitle.append(elem.find('a', attrs = {'class':'turnstileLink'}).attrs['title'])

jobtitle

['Data Analyst - 4 Days A Week -',
 'Reporting Data Analyst',
 u'Test Analyst \u2013 Business Intelligence & Reporting',
 'Data Analyst Road',
 'Business Intelligence Analyst',
 'Policy Officer / Data Analyst',
 'Administration Support Officer - Melbourne - Victoria',
 'Data Analyst - Health Data',
 'Business Analyst - Market Data',
 'Insight Analytics Manager']

#### Company Name

In [18]:
companyname = []
for elem in targetElements:
    if elem.find('span', attrs = {'itemprop':'name'}) is None:
        companyname.append(None)
    else:
        companyname.append(elem.find('span', attrs = {'itemprop':'name'})
                                     .getText().strip().encode('utf-8'))    
companyname

[None,
 None,
 None,
 'NSW Government',
 'Ashdown Consulting',
 'Macquarie Group Limited',
 'Australian Institute of Family Studies',
 'Cancer Council Victoria',
 'Daimler AG',
 'Unisys']

#### Location

In [34]:
location = []
for elem in targetElements:
    location.append(elem.find('span', attrs = {'itemprop':'addressLocality'}).getText().strip().encode('utf-8'))

location

['Melbourne VIC',
 'Sydney NSW',
 'Brisbane QLD',
 'Sydney NSW',
 'Sydney NSW',
 'Sydney NSW',
 'Melbourne VIC',
 'North Sydney NSW',
 'Sydney NSW',
 'Melbourne VIC']

#### Summary

In [35]:
summary = []
for elem in targetElements:
    summary.append(elem.find('span', attrs = {'class':'summary'}).getText().strip().encode('utf-8'))
    
summary

['Performing complex data analysis to uncover trends. Critically evaluate data. Collect data through multiple channels;...',
 'If you are successful you will be responsible for dashboarding and reporting with SSRS and data analysis. Due to new projects in the organisation a role is...',
 'Good understanding of Data Warehouse & reporting concepts, with strong Data Analysis skills. Hands on testing experience with business intelligence / data...',
 'We are looking for a Data Analyst who has knowledge and experience to undertake project management change requirements and operational data analysis....',
 'Data visualisation and analysis of large data sets. Obtained an understanding of data analysis through experience gained in a reporting, analytics or insights...',
 'The ideal candidate will have proven skills in data collection and analysis, as well as working with statistics....',
 'Assist with month end ad-hoc reports and data analysis. It is essential that you have a professional cust

#### Company Rating

In [36]:
## cant seem to get this to work for some reason
company_rating = []
for elem in targetElements:
    if elem.find('span', attrs = [{'class':'ratingNumber'}]) is None:
        company_rating.append(None)
    else:
        company_rating.append(elem.find('span', attrs = {'class':'ratingNumber'})
                                     .getText().strip().encode('utf-8'))
    
company_rating

[None, None, None, None, None, None, None, None, None, None]

#### Company rating - Number of Reviews

In [37]:
company_rating_counts = []
for elem in targetElements:
    if elem.find('span', attrs = {'class':'slNoUnderline'}) is None:
        company_rating_counts.append(None)
    else:
        company_rating_counts.append(elem.find('span', attrs = {'class':'slNoUnderline'})
                                     .getText().strip().encode('utf-8'))

company_rating_counts

['99 reviews',
 None,
 '9 reviews',
 '5 reviews',
 '17 reviews',
 None,
 '1,276 reviews',
 None,
 '17 reviews',
 '2,351 reviews']

#### Advertised number of days ago

In [38]:
advertised_number_of_days_ago = []
for elem in targetElements:
    if elem.find('span', attrs = {'class':'date'}) is None:
        advertised_number_of_days_ago.append(None)
    else:
        advertised_number_of_days_ago.append(elem.find('span', attrs = {'class':'date'})
                                     .getText().strip().encode('utf-8'))

advertised_number_of_days_ago

['11 days ago',
 '23 hours ago',
 '20 hours ago',
 '7 days ago',
 '16 days ago',
 '1 day ago',
 '26 minutes ago',
 '10 days ago',
 '13 days ago',
 '30+ days ago']

#### Salary

In [39]:
salary = []
for elem in targetElements:
    if elem.find('nobr') is None:
        salary.append(None)
    else:
        salary.append(elem.find('nobr').getText().strip().encode('utf-8'))

salary

[None,
 None,
 None,
 None,
 None,
 '$60,000 - $70,000 a year',
 None,
 None,
 None,
 None]

#### Job Link

In [40]:
joblink = []
home_url = 'http://www.indeed.com'

for elem in targetElements:
        joblink.append("%s%s" % (home_url,elem.find('a').get('href')))

joblink

['http://www.indeed.com/rc/clk?jk=0b4d12879ab9039a&fccid=77087bd1709a8148',
 'http://www.indeed.com/rc/clk?jk=a1cddefcdf84358a&fccid=ce783271e420d14a',
 'http://www.indeed.com/rc/clk?jk=43e8543e972f1260&fccid=14616a0554eb52c0',
 'http://www.indeed.com/rc/clk?jk=8e6637871e0d5914&fccid=e60642a4b38e634f',
 'http://www.indeed.com/rc/clk?jk=d0b72c272713c8a4&fccid=ca2285c3548c3efe',
 'http://www.indeed.com/rc/clk?jk=728fe7507d385b30&fccid=ce783271e420d14a',
 'http://www.indeed.com/rc/clk?jk=141c9045356a8de3&fccid=8177021cdb259d9d',
 'http://www.indeed.com/rc/clk?jk=4fc279281b3db1aa&fccid=ce783271e420d14a',
 'http://www.indeed.com/rc/clk?jk=19358d5471cc40cc&fccid=ca2285c3548c3efe',
 'http://www.indeed.com/rc/clk?jk=29209de19e74c885&fccid=5e964c4afc56b180']

## Create a dataframe based on information collected

In [41]:
df_columns=['query_date','jobtitle','companyname','location',
             'advertised_number_of_days_ago','company_rating',
             'company_rating_counts','salary','summary',
             'joblink','job_query_string']

df_joblist = pd.DataFrame({'query_date':pd.to_datetime('today'),
                                'jobtitle':jobtitle,
                                'companyname':companyname,
                                'location':location,
                                'advertised_number_of_days_ago':advertised_number_of_days_ago,
                                'company_rating':company_rating,
                                'company_rating_counts':company_rating_counts,
                                'salary':salary,
                                'summary':summary,
                                'joblink':joblink,
                                'job_query_string':job_query_string},
                         columns = df_columns)

df_joblist.head()

Unnamed: 0,query_date,jobtitle,companyname,location,advertised_number_of_days_ago,company_rating,company_rating_counts,salary,summary,joblink,job_query_string
0,2016-06-17,Data Analyst - 4 Days A Week -,,Melbourne VIC,11 days ago,,99 reviews,,Performing complex data analysis to uncover tr...,http://www.indeed.com/rc/clk?jk=0b4d12879ab903...,%22data+scientist%22+%22customer+analytics%22+...
1,2016-06-17,Reporting Data Analyst,,Sydney NSW,23 hours ago,,,,If you are successful you will be responsible ...,http://www.indeed.com/rc/clk?jk=a1cddefcdf8435...,%22data+scientist%22+%22customer+analytics%22+...
2,2016-06-17,Test Analyst – Business Intelligence & Reporting,,Brisbane QLD,20 hours ago,,9 reviews,,Good understanding of Data Warehouse & reporti...,http://www.indeed.com/rc/clk?jk=43e8543e972f12...,%22data+scientist%22+%22customer+analytics%22+...
3,2016-06-17,Data Analyst Road,NSW Government,Sydney NSW,7 days ago,,5 reviews,,We are looking for a Data Analyst who has know...,http://www.indeed.com/rc/clk?jk=8e6637871e0d59...,%22data+scientist%22+%22customer+analytics%22+...
4,2016-06-17,Business Intelligence Analyst,Ashdown Consulting,Sydney NSW,16 days ago,,17 reviews,,Data visualisation and analysis of large data ...,http://www.indeed.com/rc/clk?jk=d0b72c272713c8...,%22data+scientist%22+%22customer+analytics%22+...


## Use the information collected above to create a function for data collection

In [64]:
def get_job_board_data(job_query1,job_query2,job_query3):
    #STEP1
    # create query string in the required format
    # 1) within search string, spaces are replaced by '+' in html
    # 2) each search query is preceded and succeeded by a "%22"
    # 3) string the elements of the list into one string separated by a "+"
    job_query_string = []
    job_query_string.append("%22" + job_query1.replace(" ","+") + "%22")
    job_query_string.append("%22" + job_query2.replace(" ","+") + "%22")
    job_query_string.append("%22" + job_query3.replace(" ","+") + "%22")
    job_query_string = "+".join(job_query_string)
    
    #STEP2
    # start of the url - this will not change because I'm including search query to search
    # anywhere in the job ad. Not just the job title.
    base_url = 'http://au.indeed.com/jobs?as_and=&as_phr=&as_any='
    
    #STEP3
    #company name, salary, location and fromage are not being worked on currently
    company_name=''
    salary=''
    location=''
    fromage='any'
    
    #STEP4
    #create final query
    final_query = [base_url,job_query_string,'&as_not=&as_ttl=&as_cmp=',company_name,
               '&jt=all&st=&salary=',salary,'&radius=50&l=',location,
               '&fromage=',fromage,'&limit=10&sort=&psf=advsrch']
    final_query = "".join(final_query)
    
    #STEP5
    #open website and read it
    html = urlopen(final_query).read()  
    soup = BeautifulSoup(html, "lxml")
    
    #STEP6
    #determine total number of jobs and consequently, number of pages to scroll
    number_of_jobs_page_area = soup.find(id="searchCount").string.encode('utf-8')
    number_of_jobs = re.findall('\d+', number_of_jobs_page_area)
    total_number_of_jobs = int(number_of_jobs[2])
    number_of_pages_to_scroll = np.ceil(total_number_of_jobs/10.0)
    
    return number_of_pages_to_scroll

In [65]:
get_job_board_data('data scientist','customer analytics','data analysis')

97.0

Create step to load details into list and subsequently, append this list into a dataframe. Then, create a logic for looping through the required number of pages and thereby create a solid dataset.