GitHub - zencodism/DoodleCrawler: Headless crawling experiments, rudimentary html parsing

This is start of (fairly simple) crawling / parsing project done on (fairly nasty to crawl) university website. The problems encountered involve heavy AJAX use and lack of meaningful URLs. Attempted solution is to use headless browser to simulate normal user session. Right now project is in research and experimenting stage and both the sample crawling and parsing steps do work. Next steps will involve deciding on target architecture and preferred tech stack, then concentrating on the best agreed upon approach.

Currently available:

Java crawler using HtmlUnit library: stored in DCrawler project subdirectory (valid Eclipse project), uses htmlunit jar in topmost directory. Script run_java_crawler.sh sets the correct classpath to include both crawler class and said jar file. This crawler is strongly suggested as target solution: the technology looks more mature, documentation is better, working with htmlunit lib is easier.
Python crawler using Selenium and PhantomJS web driver. Stored in Python_Crawler subdirectory, depends on requirements listed in .txt file and on PhantomJS - available both in source and binary form, needs to be separately installed. This crawler is much less usable and would require more work to become stable.
- To note, there are other possibilities, such as other Selenium-supported drivers, Casper.js+PhantomJS, perhaps a Node solution. These were examined, but not throughly researched.
Python parser operating on files obtained by crawler. It is a sample script located in 'parser' subdirectory along with few sample files. The output is Python dictionary containing extracted data. In one case the parser will fail to find expected element - this is simple bug left for now as a reminder to take into account possible differences in input format.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
DCrawler		DCrawler
Python_Crawler		Python_Crawler
parser		parser
README.md		README.md
htmlunit-2.15-OSGi.jar		htmlunit-2.15-OSGi.jar
java_crawler.bat		java_crawler.bat
java_crawler.sh		java_crawler.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages