Google Summer of Code 2017

Krzysztof Nowak edited this page Mar 27, 2017 · 104 revisions

Introduction

Welcome to Zenodo's Google Summer of Code 2017 wiki page!

Thank you for taking the first step towards making the Science more Open! We hope you will choose to contribute to our mission, and have a great learning experience while doing so. On this wiki you will find all necessary information related to GSoC and getting started with Zenodo development.

If you find any information missing, let us know and we'll add it here.

Table of Contents

  1. General Information
    1. Code and documentation
    2. Production and QA instances
    3. Discussion channels
    4. How to get started
    5. What we can offer and what we expect
  2. Zenodo Troubleshooting
  3. How to contact us
  4. Ideas List
    1. Researcher Profiles
    2. Research data metadata extraction
    3. Advanced Data Previewers
    4. Spam filtering and content classification
    5. Zenodo CLI
  5. How to submit a proposal
    1. Guidelines
    2. Proposal template
  6. Mentors

General Information

Code and documentation

Zenodo's code lives in a GitHub repository: https://github.com/zenodo/zenodo.

The developer documentation can be found at http://zenodo.readthedocs.io. It contains a getting started guide on how to setup Zenodo locally. Don't hesitate to contact us if you get stuck.

Zenodo is a digital archive, written in Python and Flask, and is largely based on Invenio - a general purpose digital library software initially created and developed at CERN, and nowadays with contributions from many other scientific institutes worldwide. Most of the available projects will involve adding something to an already existing Invenio module or creating a new one, and then writing an integration layer in Zenodo.

To get familiar with Invenio, please take a look at http://invenio.readthedocs.io. Because Invenio is used by so many libraries and institutes worldwide it was designed in a highly modular fashion, with each module being a separate repository installed with pip. You can think of them as plugins or extensions out of which you can build your own library, archive, document server, and much more. Each module in Invenio is usually very generic and unaware of a specific service it's going to be used for. Instead, it's highly configurable, so that the basic functionality can be easily tailored to specific needs.

You can find the full collection of Invenio modules at https://github.com/inveniosoftware. Invenio modules follow the naming convention "Invenio-[Foobar]". To see what modules Zenodo is using, just take a look at Zenodo's setup.py list here.

Production and QA instances

A production instance of Zenodo lives at https://zenodo.org. Feel free to create an account and publish any papers, thesis or any scientific output you have, but don't create test records here. This is a production service, and every time you publish a record, Zenodo registers a DOI. It's the real deal.

For playing around we have setup up a Sandbox server: https://sandbox.zenodo.org. You don't have to worry about spamming Sandbox - information stored there, stays there and we wipe it clean once in a while anyway. Especially use Sandbox when testing the REST API!

During development you will be mostly using your local instance of Zenodo. Completed projects will first end up on Sandbox and will be tested by us as well as by our users for a while. Finally when the feature is ready it goes to production.

Discussion channels

We do our general chatting, troubleshooting and help on Gitter: https://gitter.im/zenodo/zenodo. Feel free to stop by to say Hi, and mention that you're a GSoC student :)

Formal development discussions should be done as part of a GitHub Issues https://github.com/zenodo/zenodo/issues, or inside pull requests. Gitter is great for quick updates, but not so good for long discussions or searching chat history.

How to get started.

The next two sections might be the most important sections from this guide, as they describe how to successfully apply with us as well as what we can offer to you as mentors, and what we expect from you during this GSoC. Please read them carefully. It's crucial to your project's success!

Before you submit an idea proposal, please make sure you have completed the following:

  1. Install Zenodo locally, load the demo records and play around with the system to get the feeling on how it works and what it offers to users. Information on how to do that is available at Zenodo documentation. If you experience issues with setting up Zenodo locally, take a look at our Troubleshooting page - your problem was most likely solved already. Otherwise, ask away on Gitter :)
  2. Get familiar with the Zenodo source code. Zenodo is based on Invenio framework - in many places you will see something like from invenio_foobar import bazbar, spam, eggs. Go to the corresponding Invenio module to check out what it actually does. Most of your code will end up being an Invenio module, with a small layer of Zenodo code on top of it. You can find more on Zenodo/Invenio codebases in the Code and documentation section above.
  3. If you already know which Project Idea from the list you would like to pursue, find the relevant Zenodo and Invenio code and try to close a small task or an issue related to that. Otherwise, demonstrate your skills by challenging some of the issues on our GitHub repository tagged as low-hanging-fruit.
  4. To prevent double work, comment under the Issue that you have started work on it.
  5. If somebody else commented more than 24 hours ago without further follow-up, PR, questions, ask if you can take it.
  6. Choose an idea from the Ideas List, think about it for a while and contact us if you have any questions or more ideas! We'll be happy to discuss your project idea with us and see how you would attack the problem!

What we can offer and what we expect

Zenodo is part of our daily jobs. Because of that we can offer you a high availability of our guidance and mentorship during working hours (8:00-18:00 CEST), and some availability outside and on the weekends. We are researchers, educators and tutors ourselves and we have experience with mentoring interns, students as well as GSoC students. We have dedicated part of our daily jobs at CERN during GSoC exclusively for mentoring and guidance, so we can make this a great learning experience for you.

We expect a similar dedication from you.

If you plan on getting any other secret cool internships or side jobs this summer, please don't apply with us. We know that this is a remote job, and you might think you'll be able to pull it off, but trust us, it won't be the same and we'll notice... Even the easiest of the projects here require full-time job type of dedication - if you don't give it that, you'll be at a very high risk of failing. We promise to take our role seriously, if you don't, you'll waste a spot for somebody that would have taken this opportunity to the fullest.

By participating in GSoC with us, you will not only have a chance to greatly improve your coding skills, but you'll learn a lot about Open Science. Our goal is to put all of the accepted projects into production at the end or shortly after GSoC. This means that your work will have a direct impact on the progress of Open Science, and the work of thousands of scientists and researchers using Zenodo today!

If you successfully complete your project, we hope you will stay with us. We can give you tips on how you can continue your journey with Zenodo. Have an idea for a cool service on top of Zenodo? Perhaps code Zenodo as part of a BSc/MSc thesis? Want to use Zenodo data for a cool machine learning or other research project? Ask away, and we'll be happy to help and guide you further!

Zenodo Troubleshooting

Once again, if you find any problems with installation or development setup, please take a look at our Troubleshooting guide. You will find information on how to browse for solved problems and how to effectively report new problems!

How to contact us

The best and most recommended way to contact us is by talking to on our public Gitter channel: https://gitter.im/zenodo/zenodo

If you have some issues or administrative questions, which require private communication you can contact our organization admins Krzysztof (Gitter PM: https://gitter.im/krzysztof, k.nowak@cern.ch) or Alex (Gitter PM: https://gitter.im/slint, a.ioannidis@cern.ch).

For questions regarding problems with setting up you development environment or you would like to discuss a project idea use our main Gitter channel, this way the whole team can pitch-in to help you faster.

Finally, for general questions about Zenodo you can also write an email to info@zenodo.org, but it's highly recommended that you use any of the previous methods.

Ideas List

Basic requirements:

All projects from the list require very good knowledge of Python. Previous experience with Flask, Django or other Python web framework is a big plus!

All of the projects require good knowledge of Object Oriented Programming and relational databases (knowledge of PostgreSQL and SQLAlchemy is also a big plus!)

Extra requirements:

Most projects require some basic knowledge of Elasticsearch and asynchronous tasks (Celery) and some basic Frontend skills (AngularJS, Javascript, HTML + Bootstrap).

Some of the projects also require some code/API design skills, either REST API design, class structure design or JSON Schema design.

Two projects require specialized knowledge of Machine learning, Data analysis and the corresponding Python tools (Scikit-learn, NumPy, Matplotlib and others).

We don't expect you to know all of the extra requirements, but basic requirements should be met. And, of course, the more of the extra requirements you already know, the better!

Project difficulty

The project difficulty span from a 2nd-5th Semester BSc Computer Science course (Difficulty: 2.5 - 4) to an advanced class in MSc Computer Science course (4.5 - 5.0).

All projects from the list involve:

  • Initial planning and design of the module or a feature
  • Coding and writing unit tests
  • Developer's (and sometimes User's) documentation.

Those have to be accounted for in the proposal's Deliverables.

Ideas:

Researcher Profiles

Description: Implement a public profile pages module for Zenodo users, so they can share their most recent work, include their bio and follow other researchers.

Difficulty: 3/5

Relevant skills: Data modeling (OOP, databases), Frontend (Javascript/AngularJS), Search (Elasticsearch DSL), REST API

Long description: Currently Zenodo does not offer any public pages for the researchers to showcase their work. We would like to build something simple (similar to GitHub user profiles), and relevant to readers. Prior to taking on this project we encourage you to look for other research/work social sites (GitHub, Orcid.org, Researchgate, Academia.edu, LinkedIn, etc.) with user profiles and try to come up with a nice mix of features. If you want you can even create a quick profile page mockup and include it in your proposal and include a link.

Some suggestions on profile-relevant information:

  • List of Zenodo records, with highlighted recently added papers, top articles by researcher.
  • Researcher activity (Contributions calendar heatmap ala GitHub)
  • Researcher "branding": Picture, Short Bio, "Skills", "Interests", etc.
  • Profiles might include a simple messaging system - How can you contact this researcher?
  • "Following" other users.
  • Some basic statistics about the researcher's total contribution.

Related Issues:

Expected outcome: A public profile page showcasing a researcher's work. A REST API endpoint allowing to query for researcher profiles.

Possible mentors: Alex, Krzysztof, Jose

Research data metadata extraction

Description: A module capable of extracting metadata from a variety of research data formats. Generic and extendable by a variety of format-specific "extractors".

Difficulty: 4.5/5

Relevant skills: Data modeling (OOP, JSON Schema), Search (Elasticsearch DSL), Apache Tika, Simple Data Visualization (Matplotlib)

Data stored on Zenodo would have very little meaning if it wasn't for the metadata. Metadata is simply data which describes other data (so meta!). At the moment when you upload a paper, dataset or software to Zenodo, you are required to provide some basic metadata on the content. This is provided by the user, but perhaps we could let the data speak for itself! The basic idea of this project is to create a simple module which will take a single piece of data (a file), and depending on its format and content, output a some basic metadata, such as: Title, Abstract, Keywords, Authors (from PDFs), or even other such as image metadata, CSV headers, Geo-spatial information, audio information, video metadata. It's a lot of different formats to handle, but you can use off the shelf tools! A large part of this project will be the design of a JSON Schema, which will standardize the expected output from this module, as well as allow to search the metadata.

Some relevant technologies:

  • Apache Tika
  • Metadata extraction from different files (CSV, FITS, PDF)

Expected outcome:

  • An Invenio module capable of ingesting files from the system, processing them according to the filetype, and resulting in extracted metadata as JSON.
  • Module should be generic and allow for further extension with other file formats in the future.
  • Part of the project can involve the analysis of the extraction algorithms and visualization of the results, e.g.: How accurate are the extracted titles?

Possible mentors: Krzysztof, Alex

Advanced Data Previewers

Description: Enhance Zenodo's in-browser data previeving plugins with better image previewing using IIIF and support more data formats.

Difficulty: 3-4/5 (Depending on the amount of previewers proposed for a Project)

Relevant skills: Strong frontend skills (Javascript, Bootstrap, HTML), basic Elasticsearch.

Users can take a peek into the data currently stored on Zenodo without need to download it. At the moment only some formats are supported: CSV, PDF, TIFF, PNG, JPG and ZIP. Zenodo contains over 100.000 image records (take a look HERE ). The goal of this project would be to first enhance the image previewing capabilities (required) and as a secondary objective extend it with more formats (optional).

  1. To improve our image previewing support we could use a powerful IIIF, so that large images can be scaled, rotated and cropped. Some records contain multiple images, perhaps we could render those as tiles or a gallery. In addition to that, it would be helpful if users could see a small thumbnail of the images directly in the search result. This will require generation of thumbnails and some angular + elasticsearch work to make them appear in search results (i.e. Imagine a small image thumbnail next to each record HERE )

  2. For more data formats support, we could extend them with IPython notebook, geographical metadata previewing, 3D models or plotting of CSV data.

This project is partially focused on the frontend work and requires strong knowledge of JavaScript. Nonetheless, it can involve using mostly some of the off-shelf tools and libraries, with only a small JS integration layer and customization.

This project also involves some investigation. In your proposal, feel free to recommend some of the libraries for data previewing you might have found or used elsewhere!

Some more ideas:

  • Multiple-image galleries for records with more than one image, preferably by using a IIIF JavaScript viewer like OpenSeadragon
  • In-browser 3D models previewing library: http://vcg.isti.cnr.it/3dhop/
  • Extension of archive previewing formats with TAR.GZ and RAR
  • Plotting of CSV data (simple 2D scatter plot, pie distribution charts, etc.)
  • Jupyter notebook previewing

Related Issues:

Expected outcome:

  • Integration of IIIF into image previewer
  • Image files thumbnail generation tasks, and integration into search results.
  • Extension of the Invenio-Previewer with richer previewers and more previewing formats.

Possible mentors: Alex, Lars

Spam filtering and content classification

Description: A module for Record metadata analysis, capable of training and then classifying records based on Spam/Non-Spam label as well as domain-specific labels.

Difficulty: 5/5

Relevant skills: Data modeling (OOP, JSON), Machine Learning (SciKit-learn, NumPy, SciPy, TensorFlow), Data Analysis (NumPy, Matplotlib)

Note: This is a challenging project but offering a research flavour! Great for MSc/PhD students!

Zenodo as many open repositories, will sooner or later have to face the problem of spam content. The idea of this project is to analyze current Zenodo records metadata (which is public domain - CC0), and do some basic data analysis to determine whether we can train a classifier of spam records (you can count on us to help you gather the already labelled spam records on Zenodo). The classifier can be an off-the-shelf open source tool, but also a custom implementation of some classifier.

Moreover, since this problem can be treated more generally, it would be even better to create a module for general metadata classification and training of the different classifiers, e.g.: classification of records by scientific domain, abstract similarity, topic similarity. This way, it can be built as a foundation for the paper recommendation system (as well as a spam filter). The result of such classifier would be a small piece of metadata information (JSON), which can then be plugged into our search engine.

Getting started information:

  • A snapshot of Zenodo's metadata prepared for this project can be found here: https://doi.org/10.5281/zenodo.375909

  • Description of the dataset and its format is available in the record description.

  • Metadata contains much more features than you might find useful for classification. Moreover some data might be of poor quality (standard in real world machine learning). You might need to determine which features carry meaningful information.

  • Some metadata needs cleaning, e.g.: keywords contains very useful information in general, but it's "dirty" - sometimes it's a list of strings (as it should), but sometimes it's a list with a single string containing comma-separated keywords - you might need to take good look at the data and clean it up a bit to get some useful information.

  • A good classifier will depend on general text fields such "description", "title", "keywords", "filename"/"filetype", "subjects" etc. and less on publishing conditions or time-related fields, such as "publication_date". Many SPAM records on Zenodo were published within one day by a spammer with a script. In this case "publication_date" might emerge as a very good SPAM/NON-SPAM differentiating feature, and give you false "good" classification score during training, but would probably fail miserably for any new batch-SPAM that might come in the future.

  • It is not guaranteed that metadata which was classified as non-spam might not be a SPAM record indeed. Classification was done selectively by a human, hence there is probably is a lot of false-negatives in the data.

  • For the "domain" classifier, you can use the information available in "communities". A simple way to define a "Domain" is to do a search on Zenodo's Communities (which look into descriptions and titles). E.g. one could define domains as follows:

Expected outcome:

  • Initial data analysis of current Zenodo records based on spam records.
  • Implementation of a Generic Invenio record classifier module for JSON data input with simple programmatic API and configuration.
    • Extendible with different classifiers through entry-points.
  • Classifier can either be run as part of the main Flask application (with Celery tasks to off-load the main application from the heavy computation), or a stand-alone service with a REST API.
  • Infrastructure documentation for running the tool (a Dockerfile), with tasks, notifications, results UI

Possible mentors: Alex, Krzysztof

Zenodo CLI

Difficulty: 2.5/5

Relevant skills: REST API, Data modeling (OOP, CLI design)

Description: A stand alone CLI application, wrapping Zenodo REST API calls with clean and easy to use command line tool.

There are two ways to get content into Zenodo - using the web interface, or by writing a script (or curl) which will call our REST API (see the Zenodo REST API documentation here: https://zenodo.org/dev). Web interface way is quite easy to use, but can be cumbersome for large files or for uploading multiple records at the same time. REST API is powerful, but requires programming knowledge or CURL and Bash skills.

A perfect solution would be to create a simple Zenodo command line interface tool, which will allow users to perform batch operations without the need for any programming or scripting knowledge. This way users will be able to upload data directly from remote machines without graphical interface, or upload huge datasets by leaving the CLI tool running on a tmux session instead of keeping the browser window alive for days. All of that would come without the need to know and understand the REST API or any scripting language.

NOTE: This projects requires almost no coding on the main Zenodo repository. Instead, it's a stand-alone application which will talk with Zenodo's REST API. Nonetheless, it will have to follow the coding guidelines and coding style of our main codebase.

Some ideas and requirements:

  • Relevant RFC from Invenio community on how to make it a generic module: https://github.com/inveniosoftware/invenio/issues/3796
  • Reading, testing and very good understanding of Zenodo's REST API
    • Remember to use your local instance of Zenodo for testing, not production!
  • Propose a clean set of CLI instructions for common actions such as:
    • CRUD operations on a deposit
    • Files upload
    • Record publishing
    • Editing record's metadata
    • Retrieve search results from the CLI with JSON a response, e.g.: zenodo-cli search "The First Black Holes in the Cosmic Dark Ages" (prints the JSON response from the server).
  • Write a self-documented CLI in Python, using the click Python library
  • Optional but recommended - Write a CLI as a generic Invenio-CLI module and configurable to be a service-specific CLI.

Expected outcome: A CLI tool which you can install with pip install zenodo-cli. Apart from the technical documentation, the tool will require a detailed user guide with examples.

Possible mentors: Krzysztof, Lars, Jose

How to submit a proposal

In this section you will the information on how to write a good proposal. Please read the guidelines carefully. You can use the template included below.

Guidelines

If you haven't already, make sure you have read the Elements of a Quality Proposal section from the Students Guide here. You should read the rest of the Guide, it's crucial for success!

  • Your proposal should follow the structure mentioned in the Elements of a Quality Proposal and contain the seven sections (Name and Contact Information,... ,Biographical Information). Some extra tips and requirements:
    • GSoC proposal submission system should allow you to choose one of the 5 tags, each corresponding to a project on the Ideas List, choose the one according with your project choice.
    • Related work - If you have closed any issues on Zenodo tagged low-hanging-fruit, include the PR URLs. This will help us assess your technical skills better! Code speaks more than words!
    • Synopsis - You might want to include one or two sentences on why you would like to contribute to Zenodo, and why you chose these projects.
    • Deliverables
      • Some of the projects require some initial investigation and research, include what you have found. Have you seen a similar feature somewhere else already? Mention it!
      • Don't be afraid to suggest a design or a roadmap for a module which might be "wrong". We want to see if you have gave given the idea enough thought and if you understand what type of work the given project involves. The fine details of the milestones and deadlines can be worked on later with your mentor. Nonetheless, be specific and treat it as a real Software Engineering work plan.
  • The document should not be too long (2-4 pages). Feel free to write a longer document if you prefer, but please stay relevant to the project.
  • If you need to pass some exams, or study during some weeks of GSoC it's fine. GSoC almost always coincides with some exam period in many countries. Just make sure you mention this in your proposal so you and your mentor can better organize the work around that time.
  • You can use the template below to start on your proposal.

Proposal template

Name and Contact Information:

First name, Last name, email, GitHub handle, Twitter handle, Homepage URL, Skype/Hangouts handle for video conferencing

Title:

Project title - your subtitle (optional)

Synopsis:

One or two sentences on why you chose Zenodo from the organizations list and selected this specific project - Do you like the technical challenge? Do you want to see the feature go live?

The rest of the synopsis.

Benefits to Community:

What problem does my project address? How does it make Zenodo better? How does it make Open Science better?

Deliverables:

Plan of work, milestones, deadlines, draft of the idea and proposed solution.

Related Work:

Hey, I did something similar in the past! Also, I harvested the following Zenodo low-hanging-fruits:

  • URL to PR or Issue
  • URL to PR or Issue
Biographical Information:

Your education, work experience, coding skills, education, scientific contributions, open source contributions and more.

Mentors

GitHub Name Roles Gitter Email
Krzysztof Nowak Organization Admin, Mentor krzysztof k.nowak@cern.ch
Alexander Ioannidis Organization Admin, Mentor slint a.ioannidis@cern.ch
Lars Holm Nielsen Mentor lnielsen lars.holm.nielsen@cern.ch
Jose Benito Gonzalez Mentor jbenito3 jose.benito.gonzalez@cern.ch