Skip to content

zencodism/DoodleCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is start of (fairly simple) crawling / parsing project done on (fairly nasty to crawl) university website. The problems encountered involve heavy AJAX use and lack of meaningful URLs. Attempted solution is to use headless browser to simulate normal user session. Right now project is in research and experimenting stage and both the sample crawling and parsing steps do work. Next steps will involve deciding on target architecture and preferred tech stack, then concentrating on the best agreed upon approach.

Currently available:

  • Java crawler using HtmlUnit library: stored in DCrawler project subdirectory (valid Eclipse project), uses htmlunit jar in topmost directory. Script run_java_crawler.sh sets the correct classpath to include both crawler class and said jar file. This crawler is strongly suggested as target solution: the technology looks more mature, documentation is better, working with htmlunit lib is easier.
  • Python crawler using Selenium and PhantomJS web driver. Stored in Python_Crawler subdirectory, depends on requirements listed in .txt file and on PhantomJS - available both in source and binary form, needs to be separately installed. This crawler is much less usable and would require more work to become stable.
    • To note, there are other possibilities, such as other Selenium-supported drivers, Casper.js+PhantomJS, perhaps a Node solution. These were examined, but not throughly researched.
  • Python parser operating on files obtained by crawler. It is a sample script located in 'parser' subdirectory along with few sample files. The output is Python dictionary containing extracted data. In one case the parser will fail to find expected element - this is simple bug left for now as a reminder to take into account possible differences in input format.

About

Headless crawling experiments, rudimentary html parsing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors