Skip to content

Outreachy (Round 10)

Chris Aniszczyk edited this page Mar 29, 2015 · 25 revisions

@TwitterOSS is a proud participant of Outreachy. Outreachy helps people from groups underrepresented in free and open source software get involved. They provide a supportive community for beginning to contribute any time throughout the year and offer focused internship opportunities twice a year with a number of free software organizations. The current round of internships is open to women (cis and trans), trans men, genderqueer people, and all participants of the Ascend Project regardless of gender.

Information for Applicants

These ideas were contributed by our developers and our community, they are only meant to be a starting point. If you wish to submit a proposal based on these ideas, you may wish to contact the developers and find out more about the particular suggestion you're looking at.

Timeline

Check all the details about this round at the program page.

  • February 17: participating organizations are announced
  • February 17 - March 24: applicants need to get in touch with at least one project and make a contribution to it
  • March 3: application system opens
  • March 24: application deadline at 7pm UTC
  • April 7: EXTENDED! new application deadline at 7pm UTC
  • April 27: accepted participants announced on this page at 7pm UTC
  • May 25 - August 25 internship period

Accepted Projects

TBD

Adding a Proposal

Please follow this template:

  • Brief explanation:
  • Expected results:
  • Knowledge Prerequisite:
  • Mentor:

See here for more information: https://wiki.gnome.org/OutreachProgramForWomen#Send_in_an_Application

If you are not a developer but have a good idea for a proposal, get in contact with relevant developers first or @TwitterOSS.

Project Ideas

A good starting point is Finagle is the Quickstart: http://twitter.github.io/finagle/guide/Quickstart.html

You could also start digging in the code here: https://github.com/twitter/finagle/

Check out the Finagle mailing list if you have any questions.

finagle-http2

  • Brief explanation: HTTP/2 has been finalized and offers advantages over HTTP 1 (such as multiplexing) that would be useful for Finagle HTTP clients and servers.
  • Expected results: An experimental finagle-http2 implementation
  • Knowledge prerequisites: Scala, HTTP/2, and an interest in learning about Finagle
  • Mentors: Travis Brown (@travisbrown)

Kerberos authentication in Mux

  • Brief explanation: Mux is a new RPC session protocol in use at Twitter. We would like to add kerberos authentication.
  • Knowledge Prerequisite: Scala, Distributed systems
  • Mentor: Marius Eriksen (@marius) and Steve Gury (@stevegury)

Examples and Service Adaptors for Stitch

  • Brief explanation: Stitch is a library for RPC service composition that makes it easy to take advantage of batch APIs without muddling up your code with explicit batching logic. We'd like to develop better examples and tools for developers who want to use Stitch in the context of Finagle.
  • Knowledge prerequisites: Scala and an interest in learning about Finagle
  • Mentors: Travis Brown (@travisbrown)

Alternative Service Representations for Scrooge

  • Brief explanation: Scrooge is a Thrift code generator that can create client and server adaptors for Finagle, but the current representation of service interfaces makes it difficult to wrap endpoints in Finagle filters, for example. We're interested in exploring other approaches that would allow Scrooge-generated clients and servers to fit more cleanly into the abstractions provided by Finagle.
  • Knowledge prerequisites: Scala and an interest in learning about Thrift and Finagle
  • Mentors: Nik Shkrob (@nshkrob)
  • Project Description: Apache Aurora is a service scheduler that runs on top of Mesos, enabling you to run long-running services that take advantage of Mesos' scalability, fault-tolerance, and resource isolation.
  • Aurora Mailing List: http://aurora.apache.org/community/
  • Aurora IRC: #aurora
  • Aurora on Twitter: @ApacheAurora
  • Where to start: The project also has several tickets labeled newbie if you'd like to introduce yourself by submitting a patch.

Interactive web-based Aurora CLI tutorial

  • Brief explanation: Implement a web-based terminal that provides a simulated interaction with the Aurora CLI, to guide new users through the process of running their first Aurora job. This project requires integrating existing libraries, potentially writing a new javascript library, and customizing that for the Aurora CLI. An example would be what Docker has developed to get folks started (https://www.docker.com/tryit/), however it is a requirement that this code be licensed under the Apache v2 license.
  • JIRA Issues: AURORA-1164
  • Knowledge Prerequisite: JavaScript
  • Mentor: Dave Lester (@davelester)
  • Project Description: Apache Mesos is a cluster manager that abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
  • Mesos Mailing List: http://mesos.apache.org/community/
  • Mesos IRC: #mesos
  • Mesos on Twitter: @ApacheMesos
  • Where to start: The project also has several tickets labeled newbie if you'd like to introduce yourself by submitting a patch.

Updating FrameworkInfo

  • Brief explanation: Allow frameworks to update their FrameworkInfo without having to restart masters or slaves or tasks/executors. In other words, the updated FrameworkInfo should be properly reconciled across the cluster in a seamless fashion.
  • Expected results: A working implementation that allows all fields of FrameworkInfo to be updated.
  • JIRA ticket: MESOS-703
  • Knowledge Prerequisite: C++, Ability to understand Mesos architecture/codebase
  • Mentor: Vinod Kone (vinod@twitter.com)

Libprocess Benchmark Suite

  • Brief explanation: Implement a benchmark suite for libprocess to identify potential performance improvements and test for performance regressions.
  • Knowledge Prerequisite: C++
  • Mentor: Ben Mahler (@bmahler) Jie Yu (@jie_yu)
  • JIRA Issue: MESOS-1018

Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Addition of Akka backend for streaming compute

  • Brief explanation: Akka(http://akka.io) is a popular open source distributed actor system. Integrating this into Summingbird would increase the range of potential compute platform for users. Making the system more accessible and suitable for more varied tasks.
  • Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop.
  • Mentor: Oscar Boykin (@posco)

Addition of Samza backend for streaming compute

  • Brief explanation: Samza(http://samza.incubator.apache.org/) is a new Apache incubator project allowing compute to be placed between two Kafka streams. Integrating this into Summingbird would increase the range of potential compute platform for users. Making the system more accessible and suitable for more varied tasks.
  • Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop, Yarn.
  • Mentor: Oscar Boykin (@posco) or Ian O'Connell (@0x138)

Better Spark Support in Summingbird

  • Brief explanation: We currently have an alpha version of Spark support for batch computation. This should be completed along with creating a demo application. After that, we should add a realtime layer using spark-streaming.
  • Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Spark.
  • Mentor: Oscar Boykin (@posco) or Ian O'Connell (@0x138)

Addition of Tez backend for offline batch compute

  • Brief explanation: Tez(http://tez.incubator.apache.org) is a new Apache incubator to generalize and expand the map/reduce model of computation. Summingbird should be able to automatically take advantage of map-reduce-reduce plans, and other optimizations that Tez enables. This should perform better than the existing Hadoop-via-cascading-via-scalding backend that is currently available.
  • Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop, Yarn.
  • Mentor: Ian O'Connell (@0x138)

Addition of batch key/value store on Mesos or Yarn

  • Brief explanation: Something that is sorely missing from the open source release of scalding is a good batch-writable read-only key-value store to use for batch jobs. This could be something like ElephantDB (https://github.com/nathanmarz/elephantdb) or HBase. Having such a project set up with Summingbird would be a huge coup for the open-source community.
  • Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Ideally familiar with mesos or yarn, and low latency key-value stores likes HBase or ElephantDB.
  • Mentor: Oscar Boykin (@posco) or Ian O'Connell (@0x138)

Scalding Twitter's library for programming in scala on Hadoop. It is approachable by new-comers with a fields/Data-frame-like API as well as a type-safe API. There is also a linear algebra API to support working with giant matrices and vectors on Hadoop.

Apache Tez support for Scalding

  • Brief explanation: Cascading 3 supports Apache Tez, which may compete in some workloads with spark. If we update Scalding to use Cascading 3, we should be able to run Scalding on Tez. There are lots of little issues here and a few big ones as some concepts from Hadoop are not present in Tez (the distributed cache changes) and some cascading features are not yet supported.
  • Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop. Must be familiar with graphs for modeling flows of computation.
  • Mentor: Oscar Boykin @posco

Query Optimization in Scalding

  • Right now, Scalding has some optimizations that it can do because it can see the types of the data and functions. Those optimizations are baked into how the graph is produced. This project would instead create an AST in the Typed-API of scalding, and only just before running would we do a global optimization to produce the most optimal cascading plan. This work can leverage existing code created for summingbird to optimize this graphs.
  • Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop. Must be familiar with graphs for modeling flows of computation.
  • Mentor: Oscar Boykin @posco

Use statistics to implement page level filtering in the filter2 API

  • Brief explanation: We currently apply filters to entire row groups as well as individual records, but we could apply them to pages as well. This would work similar to how row group filtering currently works.
  • Expected results:
    • Statistics based filtering applied to pages in the parquet read path
    • Additional tests for correctness
  • Knowledge Prerequisite: Java, Hadoop, Test frameworks
  • Mentor: Alex Levenson (@THISWILLWORK) and/or Julien Le Dem (@J_)

Collect more statistics for rowgroups + pages

  • Brief explanation: We currently only collect the number of records, number of nulls, min, and max for a chunk of records. We could make use of more statistics when filtering in the read path.
  • Expected results:
    • Investigate which statistics would be most useful
    • Add more types of statistics, such as:
      • A bloom filter of the values when a chunk is not dictionary encoded (good for filtering)
      • A HyperLogLog of the values (good for fast count-distinct)
      • A CountMinSketch of the values (good for heavy hitters)
    • Additional tests for correctness
  • Knowledge Prerequisite: Java, Hadoop, Test frameworks
  • Mentor: Alex Levenson (@THISWILLWORK) and/or Julien Le Dem (@J_)

Add more filters to the filter2 API

  • Brief explanation: Parquet can currently filter values by ==, !=, >, >=, <, <= -- we could add some more, for example filter by value in(1,2,3) or notIn(1,2,3)
  • Expected results:
    • Add more filter types to the filter2 API
    • Additional tests for correctness
  • Knowledge Prerequisite: Java, Hadoop, Test frameworks
  • Mentor: Alex Levenson (@THISWILLWORK) and/or Julien Le Dem (@J_)

Take advantage of dictionary encoding in the filter2 API

  • Brief explanation: When applying filters to dictionary encoded columns, apply the filter to the dictionary instead of to the individual values.
  • Expected results:
    • Use dictionaries when filtering row groups
    • Use dictionaries when filtering individual records
    • Additional tests for correctness
  • Knowledge Prerequisite: Java, Hadoop, Test frameworks
  • Mentor: Alex Levenson (@THISWILLWORK) and/or Julien Le Dem (@J_)

Profile and improve the assembly time for parquet-thrift

  • Brief explanation: When assembling thrift records, parquet uses a TProtocol that boxes every value in an anonymous class. Investigate an implement a more efficient solution.
  • Expected results:
    • Faster implementation of assembling thrift records
  • Knowledge Prerequisite: Java, Hadoop, Test frameworks
  • Mentor: Alex Levenson (@THISWILLWORK) and/or Julien Le Dem (@J_)

Parquet compatibility across tools

  • Brief explanation: Develop cross tools compatibility tests for parquet (https://github.com/Parquet/parquet-mr/issues/300)
  • Expected results:
    • Compatibility of nested data types across tools - pig, hive, avro, thrift etc.
    • Automated compatibility check between java implementation and impala (across release versions)
  • Knowledge Prerequisite: Java, Hadoop, Test frameworks
  • Mentor: Alex Levenson (@THISWILLWORK) and/or Julien Le Dem (@J_)

Decouple Parquet from the Hadoop API

Study state of the art floating point compression algorithms

(https://github.com/Parquet/parquet-mr/issues/306)

  • Brief explanation: Study existing lossless floating point compression papers and implement benchmarks.
  • Expected results: Provide reference implementation and benchmark comparison. With integration into the Parquet library
  • Mentor: Julien Le Dem (@J_)

You can learn more about getting involved with the Netty Project here: http://netty.io/community.html

Android testsuite

  • Brief explanation:
    • Netty project team is willing to support Android 4.0 Ice Cream Sandwich officially, and we need an automated testsuite to achieve the goal.
  • Expected results:
    • During the build process, an Android emulator is automatically started and stopped to run all (or applicable) JUnit tests inside the Android emulator.
    • The result of the JUnit tests inside the emulator affects the build result so that we can run the Android compatibility test in our CI machine.
    • All Android compatibility issues found during the test are fixed.
  • Knowledge Prerequisite:
    • Java and Android programming
    • Custom JUnit runners
    • Experience with building a network application atop Netty
  • Mentor: Trustin Lee (@trustin)

For more information about Pants, check these out:

Pants Interactive Tutorial

  • Brief explanation: Add an interactive tutorial to learn Pants
  • Expected results: Create an interactive tutorial that will guide you in learning pants. Use JQueryTerminal to help stage a prompt that guides you through some simple Pants usage cases.
  • Knowledge Prerequisite: Python, Java, Javascript
  • Mentor: Chris Aniszczyk (@cra)

Eclipse Integration

  • Brief explanation: Add Eclipse integration to Pants
  • Expected results: Create a classpath container based on integrating with Pants and a launcher.
  • Knowledge Prerequisite: Python, Java, Eclipse
  • Mentor: Chris Aniszczyk (@cra)

You can read more about the project here https://github.com/pantsbuild/intellij-pants-plugin#intellij-pants-plugin

Running tests using Pants.

Pants Plugin can compile using Menu “Build” or “Rebuild” using Pants. However we still rely on IntelliJ built in JUnit Test Runner to run tests. Users can manually create Pants Run Configuration to run tests.

We want to swap the Intellij in Built in Junit Test Runner and create Pants Run configuration whenever users run tests using the Menu option or Right Click Action. The next step will be to add functionality to pants to construct intellij test output. This will generate pretty Test Tree View.

This project will give you opportunity to work on intellij-pants-plugin and pants.

Import Pants Python projects

Pants Plugin cannot import java scala projects. We want to add the functionality to import python projects. This will require you to add necessary apis to pants to get dependencies for the project. This project will require understanding on how python requirements are specified and resolved in pants. The minimum expectations for the imported project are: User should be able to navigate through source code. Clicking on 3rdparty dependency imports should take you the source code of the python library.

Pants Plugin Wizard.

IntelliJ provides a wizard to create a new project, easily configure basic dependencies and start working with it right away. Currently we don’t have such integration for the Pants plugin.

Sample Import Wizard Screen

We want to start with creating a simple wizard for Pants project with just a few basic templates for Java and Scala project. In the wizard a user will be able to choose between types of the project he want to create, configure main class if needed, add some additional dependencies. After everything is configured a project will be created with Build files and targets. The next step will be to add an ability to create custom templates in your repo. For example here at Twitter we usually create some Thrift services with a common structure and initial configuration. It will be great to be able to create Pants projects for such services just in a few clicks.

Clone this wiki locally