Skip to content
This repository has been archived by the owner on Mar 7, 2018. It is now read-only.

williamxie11/glassdoor-interview-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

glassdoor-interview-scraper

Glassdoor web crawler and scraper providing interview experience data for Decoding The Interview.

This was an academic project for CS 410 - Text and Information Systems at UIUC and is no longer maintained.

Installation

  • Python 2.7.*

  • Beautiful Soup 4 (4.4.1)

$ pip install bs4
  • Selenium Webdriver
$ pip install selenium

Usage

  1. Open the scraper Python script with a text editor of your choice.
  2. Add your Glassdoor account username and password username and password
  3. Specify the number of pages, the company name, and the URL of the interviews page for the company on Glassdoor with your specified filters selected scraper settings
  4. Run the scraper
$ python scraper_v1.2.py

NOTE: Glassdoor will require you to insert CAPTCHA on login or during the scraping process. The script will poll until CAPTCHA is entered during scraping.

Results

response

The web scraper will output a JSON with the name "[company name].json" in the same directory. Each data point in the JSON corresponds to one interview review on Glassdoor with attributes (see above) for each portion of the review.

Changelog

###v1.2

  • companyURL now accepts full path of Interview page for ease of use
  • Fixed issue with pagination not working
  • Fixed issue where scraper would erroneously get stuck waiting for the page to load
  • Increased initial sleep time in case of CAPTCHA
  • Reduced polling time on waiting for page load or captcha input
  • Now takes an additional short break every 10 pages to avoid rate limiting
  • Cleaned up and added some more progress dialogue

###v1.1

  • Made maxnum a global pages variable for easier use
  • Removed option and dependency for URL2 as each link ends in ".htm" anyways
  • Removed unnecessary concatenation of URL links at the beginning of get_data(). Glassdoor automatically redirects _IP1 link to the first interview page.
  • Increased sleep time after login
  • Increased sleep time in between scraping interview pages
  • Added some more progess dialogue

###v1.0

  • And so it begins ...

About

Web scraper for Glassdoor interview review data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages