GSoC 2014 Ideas

Google Summer of Code 2014 - Ideas

Scrapy is a fast high-level web crawling and scraping framework, to write spiders for crawling and extracting structed data from websites. Scrapy has a healthy and active community of developers.

Getting involved

If you're interested in participating in GSoC 2014 as a student, you should join the scrapy-developers mailing list and post any questions, comments, etc. there.

You can also join the #scrapy IRC channel at Freenode to chat with other Scrapy users & developers. You can join using the web browser or any IRC client.

All Scrapy development happens at GitHub: https://github.com/scrapy/scrapy

Adding ideas

Please follow this template:

Brief explanation Expected results Required skills Mentor(s)

For more information on conventions and best practices see: http://en.flossmanuals.net/GSoCMentoring/making-your-ideas-page/

Ideas for GSoC 2014

Add-ons (SEP-021)

Brief explanation	Simplified architecture for extensions
Expected results	The add-ons functionality implemented as described in SEP-021
Required skills	Python, general understanding of Scrapy extensions desirable but not required
Mentor(s)	Pablo Hoffman

Core API cleanup & per-spider settings (SEP-19)

Brief explanation	Finish core API cleanup and native support for per-spider settings
Expected results	Core API implemented, documented and tested, as documented in SEP-019
Required skills	Python
Mentor(s)	Pablo Hoffman, Nicolas Ramirez

Better generator support

TODO:: This one should be better specified, "expected results" missing.

Brief explanation	Improve Scrapy API using generators
Expected results	---
Required skills	Python, general understanding of async code, API design
Mentor(s)	Mikhail Korobov, Rolando Espinoza

There are areas where Scrapy usability and efficiency can be improved by using generators:

Integrate something like Rolando's https://github.com/darkrho/scrapy-inline-requests;
ensure generators are not exhausted needlessly in various places;
provide an easier alternative to spider_idle signal, something in line with https://github.com/scrapy/scrapy/issues/456
...

Python 3 support

Brief explanation	Add Python 3.3 support to Scrapy, keeping 2.7 compatibility
Expected results	Scrapy testing suite should pass many tests and basic spider should work under Python 3.3
Required skills	Python 2 & 3, Testing, Communication skills
Mentor(s)	Mikhail Korobov

Mikhail:: Python 3 porting project is quite hard because Twisted doesn't even install on Mac with Python 3 This project would require contributing not only Scrapy, but to Twisted as well.
Shane:: One good thing about this project is that it's easy to make progress and you can keep going.. I think it's OK if it doesn't go as far as windows or mac support.

API from a Scrapy spider

Brief explanation	Write an HTTP API that wraps any Scrapy spider, it should accept Requests, execute them in Scrapy, and return data extracted by the spider.
Expected results	Working server, Twisted API, docs and tests.
Required skills	Python, Twisted, Scrapy
Skill level	Intermediate
Mentor(s)	Shane Evans

Scrapy supports crawling in batch mode very well, i.e. a long running process starting from seed requests, extracting and following links and exporting data. Sometimes users would like to reuse the same code to extract interactively. This project provides an API to support this usage and allows scrapy extraction code to be reused from other applications.

Better IPython integration

TODO:: Expected results missing

Brief explanation	Create a better UI for developing Scrapy spiders using IPython
Expected results	---
Required skills	Python, JavaScript, HTML, Interface Design, Security
Mentor(s)	Mikhail Korobov, Shane Evans

Mikhail:: develop IPython + Scrapy layer. It is possible to dislpay the HTML page inline in console, provide some interactive widgets and run Python code against the results (an old hacky demo of Scrapy+IPython is in attachements). IPython guys are going to release 2.0 version soon, and it should provide a standard protocol for such things.

Profiling Scrapy

Brief explanation	Develop more comprehensive benchmarks. Profile and address CPU bottlenecks found. Address both known memory inefficiencies (which will be provided) and new ones uncovered.
Expected results	Reusable benchmarks, measureable performance improvements.
Required skills	Python, Profiling, Algorithms and Data Structures
Skill level	Advanced
Mentor(s)	Mikhail Korobov, Daniel Graña, Shane Evans

Improve javascript integration

Brief explanation	Improve Javascript integration by using Splash to render and execute Javascript.
Expected results	A Scrapy middleware to integrate with Splash
Required skills	Scrapy
Mentor(s)	Mikhail Korobov, Daniel Graña

Support for spiders in other languages

Brief explanation	A project that allows users to define a Scrapy spider by creating a stand alone script or executable
Expected results	Demo spiders in a programming languge other than Python, documented API and tests.
Required skills	Python and other programming language
Mentor(s)	Shane Evans

Scrapy has a lot of useful functionality not available in frameworks for other programming languages. The goal of this project is to allow developers to write spiders simply and easily in any programming language, while permitting Scrapy to manage concurrency, scheduling, item exporting, caching, etc. This project takes inspiration from hadoop streaming, a utility allowing hadoop mapreduce jobs to be written in any language.

This task will involve writing a Scrapy spider that forks a process and communicates with it using a protocol that needs to be defined and documented. It should also allow for crashed processes to be restarted without stopping the crawl.

Stretch goals:

Library support in python and another language. This should make writing spiders similar to how it is currently done in Scrapy
Recycle spiders periodically (e.g. to control memory usage)
Use multiple cores by forking multiple processes and load balancing between them.

Multi-platform Scrapy GUI for running spiders

Brief explanation	Develop a multi-platform GUI interface for running Scrapy spiders.
Expected results	This interface is a companion to Scrapyd and must use (and possibly extend) the Scrapyd API. Basic features: schedule spider jobs (start, stop, pause), view/search items, view/filter/search logs, export items/logs.
Required skills	Multi-platform Python GUI development.
Mentor(s)	Rolando Espinoza

Integration tests

Brief explanation	Add integration tests for different networking scenarios
Expected results	Be able to tests from vertical to horizontal crawling against websites in same and different ips respecting throttling and handling timeouts, retries, dns failures. It must be simple to define new scenarios with predefined components (websites, proxies, routers, injected error rates)
Required skills	Python, Networking and Virtualization
Mentor(s)	Daniel Graña

New HTTP1.1 download handler

Brief explanation	Replace current HTTP1.1 downloader handler with a in-house solution easily customizable to crawling needs.
Expected results	A HTML parser that degrades nicely to parse invalid responses, filtering out the offending headers and cookies as browsers does. It must be able to avoid downloading responses bigger than a size limit, it can be configured to throttle bandwidth used per download, and if there is enough time it can lay out the interface to response streaming
Required skills	Python, Twisted and HTTP protocol
Mentor(s)	Daniel Graña

Current HTTP1.1 download handler depends on code shipped with Twisted that is not easily extensible by us, we ship twisted code under scrapy.xlib.tx to support running Scrapy in older twisted versions for distributions that doesn't ship uptodate Twisted packages. But this is an ongoing cat-mouse game, the http download handler is an essential component of a crawling framework and having no control over its release cycle leaves us with code that is hard to support.

The idea of this task is to depart from current Twisted code looking for a design that can cover current and future needs taking in count the goal is to deal with websites that doesn't follow standards to the letter.

Refactor signal dispatcher

Brief explanation	Profile and look for alternatives to the backend of our signal dispatcher based on pydispatcher lib, Django moved out of pydispatcher by simplifying the api and improving its signal dispatching performance long time ago. Scrapy issue #8
Required skills	Python
Mentor(s)	Daniel Graña

Mentors

Daniel Graña (@dangra)
Shane Evans (@shane42)
Pablo Hoffman (@pablohoffman)
Mikhail Korobov (@kmike)
Nicolas Ramirez (@nramirezuy)
Rolando Espinoza (@darkrho)
Paul Tremberth (@redapple)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly