Skip to content

Google Summer of Code 2013

Chris Aniszczyk edited this page Sep 5, 2013 · 20 revisions

Google Summer of Code 2013

At Twitter, we love Open Source, working with students and Google Summer of Code (GSOC)! What is GSOC? Every year, Google invites students to come up with interesting problems for their favorite open-source projects and work on them over the summer. Participants get support from the community, plus a mentor who makes sure you don't get lost and that you meet your goals. Aside from the satisfaction of solving challenging problems and contributing to the open source community, students get paid and get some sweet swag for their work! In our opinion, this is a great opportunity to get involved with open source, improve your skills and help out the community!

Information for Students

These ideas were contributed by our developers and our community, they are only meant to be a starting point. If you wish to submit a proposal based on these ideas, you may wish to contact the developers and find out more about the particular suggestion you're looking at.

Being accepted as a Google Summer of Code student is quite competitive. Accepted students typically have thoroughly researched the technologies of their proposed project and have been in frequent contact with potential mentors. Simply copying and pasting an idea here will not work. On the other hand, creating a completely new idea without first consulting potential mentors is unlikely to work out.

If there is no specific contact given you can ask questions via @TwitterOSS or via the twitter-gsoc mailing list.

Accepted Projects

Mesos: Security and authentication

  • Brief explanation: Add security and authentication support to Mesos (including integration with LDAP).
  • Student: Ilim Ugur (@ilimugur)
  • Mentor: Benjamin Hindman (@benh) and Vinod Kone (@vinodkone)

Scalding: Matrix optimizations

  • Brief explanation: How should we multiple A*B*C? Perhaps (A*B)*C takes a lot longer than A*(B*C) due to the sizes of the matrices? What about matrices with huge skew (some columns or rows are MUCH denser than others)? What about matrices that are nearly vectors (just a few columns but MANY rows, for instance). By optimizing at the Matrix API layer, we can easily reap the benefits at higher layers.
  • Expected results: Matrix multiplications for a few selected benchmark matrices should be faster. May use either hinting from the user about the matrix sizes, or sampling/approximation to estimate the sizes.
  • Student: Tomas Tauber (@shoguncz)
  • Mentor: Oscar Boykin @posco

Netty: Asynchronous DNS resolver

  • Brief explanation: Being an asynchronous network application framework, having a built-in asynchronous DNS resolver, instead of a blocking DNS resolver provided by JDK, will prevent the applications built on top of Netty from their performance being impacted by slow or overloaded DNS servers.
  • Expected resuts: 1) A DNS protocol codec package which can be reused by user applications built on top of Netty. 2) Built-in asynchronous DNS resolution mechanism for Netty based on (1).
  • Student: Mohamed Bakkar
  • Mentor: Trustin Lee (@trustin)

Adding a Proposal

Please follow this template:

  • Brief explanation:
  • Expected results:
  • Knowledge Prerequisite:
  • Mentor:

When adding an idea to this section, please try to include the following data.

If you are not a developer but have a good idea for a proposal, get in contact with relevant developers first or @TwitterOSS.

Project Ideas

Check out the Finagle mailing list if you have any questions.

Distributed debugging (DTrace-like instrumentation for distributed systems)

  • Brief explanation: DTrace is a very powerful and versatile tool for debugging local application. We would like to employ similar types of instrumentation on a cluster of machines that form a distributed system, tracing requests based on specific conditions like the state of the server.
  • Knowledge Prerequisite: Scala, Distributed systems
  • Mentor: Marius Eriksen (@marius) and Steve Gury (@stevegury)

System profiler

  • Brief explanation: Being able to analyze performance characteristics of a server based on the requests that pass through it (where does the latency comes from, ...)
  • Knowledge Prerequisite: Scala, Distributed systems
  • Mentor: Marius Eriksen (@marius) and Steve Gury (@stevegury)

Kerberos authentication in Mux

  • Brief explanation: Mux is a new RPC session protocol in use at Twitter. We would like to add kerberos authentication.
  • Knowledge Prerequisite: Scala, Distributed systems
  • Mentor: Marius Eriksen (@marius) and Steve Gury (@stevegury)

Pure finagle zookeeper client

  • Brief explanation: Zookeeper is the open sourced library of cluster membership that we use at Twitter, right now the integration is made by using the zookeeper library. We would like to implement a ZooKeeper client purely in Finagle.
  • Knowledge Prerequisite: Scala, Distributed systems
  • Mentor: Marius Eriksen (@marius) and Steve Gury (@stevegury)

Security and authentication

  • Brief explanation: Add security and authentication support to Mesos (including integration with LDAP).
  • Knowledge Prerequisite: C++
  • Mentor: Benjamin Hindman (@benh) and Vinod Kone (@vinodkone)

Scalding Twitter's library for programming in scala on Hadoop. It is approachable by new-comers with a fields/Data-frame-like API as well as a type-safe API. There is also a linear algebra API to support working with giant matrices and vectors on Hadoop.

Scalding REPL

Make a REPL to allow playing with scalding in local and remote mode with a REPL.

  • Brief explanation: The challenge here is scheduling which portions of the items can be scheduled to run, and which portions are not yet ready to run. You will build a DAG and when one is materialized, you schedule the part of the job that is dependent on that output.
  • Expected results: Be able to type "scalding" and be inside and interactive scalding job.
  • Knowledge Prerequisite: Need to know scala, and familiarity with the scala compiler/REPL would be helpful. Knowing the basics of Hadoop would help, but not needed.
  • Mentor: Oscar Boykin @posco

Matrix optimizations

  • Brief explanation: How should we multiple A*B*C? Perhaps (A*B)*C takes a lot longer than A*(B*C) due to the sizes of the matrices? What about matrices with huge skew (some columns or rows are MUCH denser than others)? What about matrices that are nearly vectors (just a few columns but MANY rows, for instance). By optimizing at the Matrix API layer, we can easily reap the benefits at higher layers.

  • Expected results: Matrix multiplications for a few selected benchmark matrices should be faster. May use either hinting from the user about the matrix sizes, or sampling/approximation to estimate the sizes.

  • Knowledge Prerequisite: Need to know scala, linear algebra, and dynamic programming or compilers would be helpful. Knowing the basics of Hadoop would help, but not needed.

  • Mentor: Oscar Boykin @posco

Integrate Algebird and Spire

  • Brief explanation: Spire is a scala library modeling many algebraic concepts. Algebird is a Twitter library that is very similar and has a subset of the objects in Spire. We would like to use the type-classes of Spire in Algebird. Algebird is focused on streaming/aggregation algorithms, which are a subset of Spire's use case.

  • Expected results: A branch of Algebird that has all the same functionality and all the same tests passing, but removing all code that is in common with Spire.

  • Knowledge Prerequisite: Need to know scala, linear algebra. Knowing the basics of Hadoop would help, but not needed.

  • Mentor: Oscar Boykin @posco and Sam Ritchie @sritchie

Async Disk I/O

  • Brief explanation: fatcache deals with two kinds of IOs - disk IO and network IO. Network IO in fatcache is async, but disk I/O is sync. To exploit CPU and SSD parallelism, we are forced to run multiple instances of fatcache because a) fatcache it is single threaded, and b) disk I/O is sync. However, if we made disk I/O async (using libaio, perhaps), a single instance of fatcache can now exploit SSD parallelism to a greater extent.
  • Knowledge Prerequisite: C
  • Mentor: Manju Rajashekhar (@manju) and Yao Yue (@thinkingfish)

Lock Revamp

  • Brief explanation: Twemcache inherits the multithreaded design and its locking mechanism of Memcached (1.4.4 to be exact). While having great performance and low CPU utilization in general, lock contention is often the bottleneck that prevents Twemcache from taking advantage of the many cores on modern CPUs. Re-designing locking in Twemcache can push single instance throughput to a whole new level.
  • Knowledge Prerequisite: C, Operating Systems
  • Mentor: Yao Yue (@thinkingfish) and Manju Rajashekhar (@manju)

Data Structure/Scripting Support

  • Brief explanation: Twemcache is key-value store that treats the values mostly as a data blob. Adding the most basic data structures such as lists will simplify how many users use Twemcache. On top of that, scripting support (such as Lua) will allow users to define, store and operate on arbitrary data structures, and do so without multiple round trips or atomicity concerns.
  • Knowledge Prerequisite: C
  • Mentor: Yao Yue (@thinkingfish) and Manju Rajashekhar (@manju)

Please visit the Twemcache GSoC wiki page for more information about these projects.

Graph compression and new algorithms

  • Brief explanation: Experiment and innovate with graph compression in Cassovary. Also implement some new algorithms, and explore performance trade-offs with compression.
  • Expected results: A few compressed versions of the various graph data structures contributed to Cassovary, and a study of performance trade-offs of some algorithms like random walks. As just one example, you will be coding up the Layered Label propagation algorithm.
  • Knowledge Prerequisite: Scala, algorithms, basic graph theory.
  • Mentor: Pankaj Gupta (@pankaj) and Aneesh Sharma (@aneeshs)

Asynchronous DNS resolver

  • Brief explanation: Being an asynchronous network application framework, having a built-in asynchronous DNS resolver, instead of a blocking DNS resolver provided by JDK, will prevent the applications built on top of Netty from their performance being impacted by slow or overloaded DNS servers.
  • Expected resuts: 1) A DNS protocol codec package which can be reused by user applications built on top of Netty. 2) Built-in asynchronous DNS resolution mechanism for Netty based on (1).
  • Knowledge Prerequisite: Java, Netty, DNS protocol
  • Mentor: Trustin Lee (@trustin)

Convert to Arel

  • Brief explanation: The activerecord-reputation-system gem helps you build a reputation system on top of activerecord and rails. Currently this gem does not return ActiveRecord::Relation objects, which means calls to reputation system cannot be chained or composed. This makes it difficult to compose another framework (e.g. an ACL system) with the reputation system. For example, queries like "what is the karma of all the users who have access to project 'foo'" are not readily possible.
  • Expected results: Should be able to chain/compose reputation system calls with other activerelation objects
  • Knowledge Prerequisite: Ruby, Arel, Activerecord
  • Mentor: Sumit Shah (@bigloser) and Cameron Dutro (@camertron)

Project

Project URL

Project Idea (e.g., New Feature)

  • Brief explanation:
  • Expected results:
  • Knowledge Prerequisite:
  • Mentor:

General Proposal Requirements

Proposals will be submitted via http://www.google-melange.com/gsoc/homepage/google/gsoc2013, therefore plain text is the best way to go. We expect your application to be in the range of 1000 words. Anything less than that will probably not contain enough information for us to determine whether you are the right person for the job. Your proposal should contain at least the following information, but feel free to include anything that you think is relevant:

  • Please include your name and twitter handle!
  • Title of your proposal
  • Abstract of your proposal
  • A link to your github id (if you have one)
  • Detailed description of your idea including explanation on why is it innovative
  • Description of previous work, existing solutions (links to prototypes, bibliography are more than welcome)
  • Mention the details of your academic studies, any previous work, internships
  • Any relevant skills that will help you to achieve the goal (programming languages, frameworks)?
  • Any previous open-source projects (or even previous GSoC) you have contributed to?
  • Do you plan to have any other commitments during SoC that may affect you work? Any vacations/holidays planned?
  • Contact details

Good luck!