first commit

yahoo · Oct 17, 2018 · 368262f · 368262f
commit 368262f
Show file tree

Hide file tree

Showing 14 changed files with 877 additions and 0 deletions.
diff --git a/Code-Of-Conduct.md b/Code-Of-Conduct.md
@@ -0,0 +1,54 @@
+# Oath Open Source Code of Conduct
+
+## Summary
+This Code of Conduct is our way to encourage good behavior and discourage bad behavior in our open source community. We invite participation from many people to bring different perspectives to support this project. We pledge to do our part to foster a welcoming and professional environment free of harassment. We expect participants to communicate professionally and thoughtfully during their involvement with this project. 
+
+Participants may lose their good standing by engaging in misconduct. For example: insulting, threatening, or conveying unwelcome sexual content. We ask participants who observe conduct issues to report the incident directly to the project's Response Team at opensource-conduct@oath.com. Oath will assign a respondent to address the issue. We may remove harassers from this project. 
+
+This code does not replace the terms of service or acceptable use policies of the websites used to support this project. We acknowledge that participants may be subject to additional conduct terms based on their employment which may govern their online expressions.
+
+## Details
+This Code of Conduct makes our expectations of participants in this community explicit.
+* We forbid harassment and abusive speech within this community.
+* We request participants to report misconduct to the project’s Response Team.
+* We urge participants to refrain from using discussion forums to play out a fight.
+
+### Expected Behaviors
+We expect participants in this community to conduct themselves professionally. Since our primary mode of communication is text on an online forum (e.g. issues, pull requests, comments, emails, or chats) devoid of vocal tone, gestures, or other context that is often vital to understanding, it is important that participants are attentive to their interaction style.
+
+* **Assume positive intent.** We ask community members to assume positive intent on the part of other people’s communications. We may disagree on details, but we expect all suggestions to be supportive of the community goals.
+* **Respect participants.** We expect participants will occasionally disagree. Even if we reject an idea, we welcome everyone’s participation. Open Source projects are learning experiences. Ask, explore, challenge, and then respectfully assert if you agree or disagree. If your idea is rejected, be more persuasive not bitter.
+* **Welcoming to new members.** New members bring new perspectives. Some may raise questions that have been addressed before. Kindly point them to existing discussions. Everyone is new to every project once.
+* **Be kind to beginners.** Beginners use open source projects to get experience. They might not be talented coders yet, and projects should not accept poor quality code. But we were all beginners once, and we need to engage kindly.
+* **Consider your impact on others.** Your work will be used by others, and you depend on the work of others. We expect community members to be considerate and establish a balance their self-interest with communal interest.
+* **Use words carefully.** We may not understand intent when you say something ironic. Poe’s Law suggests that without an emoticon people will misinterpret sarcasm. We ask community members to communicate plainly.
+* **Leave with class.** When you wish to resign from participating in this project for any reason, you are free to fork the code and create a competitive project. Open Source explicitly allows this. Your exit should not be dramatic or bitter. 
+
+### Unacceptable Behaviors
+Participants remain in good standing when they do not engage in misconduct or harassment. To elaborate: 
+* **Don't be a bigot.** Calling out project members by their identity or background in a negative or insulting manner. This includes, but is not limited to, slurs or insinuations related to protected or suspect classes e.g. race, color, citizenship, national origin, political belief, religion, sexual orientation, gender identity and expression, age, size, culture, ethnicity, genetic features, language, profession, national minority statue, mental or physical ability.
+* **Don't insult.** Insulting remarks about a person’s lifestyle practices.
+* **Don't dox.** Revealing private information about other participants without explicit permission.
+* **Don't intimidate.** Threats of violence or intimidation of any project member.
+* **Don't creep.** Unwanted sexual attention or content unsuited for the subject of this project.
+* **Don't disrupt.** Sustained disruptions in a discussion.
+* **Let us help.** Refusal to assist the Response Team to resolve an issue in the community.
+
+We do not list all forms of harassment, nor imply some forms of harassment are not worthy of action. Any participant who *feels* harassed or *observes* harassment, should report the incident. Victim of harassment should not address grievances in the public forum, as this often intensifies the problem. Report it, and let us address it off-line.
+
+### Reporting Issues
+If you experience or witness misconduct, or have any other concerns about the conduct of members of this project, please report it by contacting our Response Team at opensource-conduct@oath.com who will handle your report with discretion. Your report should include:
+* Your preferred contact information. We cannot process anonymous reports.
+* Names (real or usernames) of those involved in the incident.
+* Your account of what occurred, and if the incident is ongoing. Please provide links to or transcripts of the publicly available records (e.g. a mailing list archive or a public IRC logger), so that we can review it.
+* Any additional information that may be helpful to achieve resolution.
+
+After filing a report, a representative will contact you directly to review the incident and ask additional questions. If a member of the Oath Response Team is named in an incident report, that member will be recused from handling your incident. If the complaint originates from a member of the Response Team, it will be addressed by a different member of the Response Team. We will consider reports to be confidential for the purpose of protecting victims of abuse. 
+
+### Scope
+Oath will assign a Response Team member with admin rights on the project and legal rights on the project copyright. The Response Team is empowered to restrict some privileges to the project as needed. Since this project is governed by an open source license, any participant may fork the code under the terms of the project license. The Response Team’s goal is to preserve the project if possible, and will restrict or remove participation from those who disrupt the project. 
+
+This code does not replace the terms of service or acceptable use policies that are provided by the websites used to support this community. Nor does this code apply to communications or actions that take place outside of the context of this community. Many participants in this project are also subject to codes of conduct based on their employment. This code is a social-contract that informs participants of our social expectations. It is not a terms of service or legal contract.
+
+## License and Acknowledgment. 
+This text is shared under the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/). This code is based on a study conducted by the [TODO Group](https://todogroup.org/) of many codes used in the open source community. If you have feedback about this code, contact our Response Team at the address listed above. 
diff --git a/Contributing.md b/Contributing.md
@@ -0,0 +1,26 @@
+# How to contribute
+First, thanks for taking the time to contribute to our project! The following information provides a guide for making contributions.
+
+## Code of Conduct
+
+By participating in this project, you agree to abide by the [Oath Code of Conduct](Code-of-Conduct.md). Everyone is welcome to submit a pull request or open an issue to improve the documentation, add improvements, or report bugs.
+
+## How to Ask a Question
+
+If you simply have a question that needs an answer, [create an issue](https://help.github.com/articles/creating-an-issue/), and label it as a question.
+
+## How To Contribute
+
+### Report a Bug or Request a Feature
+
+If you encounter any bugs while using this software, or want to request a new feature or enhancement, feel free to [create an issue](https://help.github.com/articles/creating-an-issue/) to report it, make sure you add a label to indicate what type of issue it is.
+
+### Contribute Code
+Pull requests are welcome for bug fixes. If you want to implement something new, please [request a feature first](#report-a-bug-or-request-a-feature) so we can discuss it.
+
+#### Creating a Pull Request
+Before you submit any code, we need you to agree to our [Contributor License Agreement](https://yahoocla.herokuapp.com/); this ensures we can continue to protect your contributions under an open source license well into the future.
+
+Please follow [best practices](https://github.com/trein/dev-best-practices/wiki/Git-Commit-Best-Practices) for creating git commits.
+
+When your code is ready to be submitted, you can [submit a pull request](https://help.github.com/articles/creating-a-pull-request/) to begin the code review process.
diff --git a/LICENSE-MIT b/LICENSE-MIT
@@ -0,0 +1,9 @@
+MIT License
+
+Copyright 2018 Oath Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,136 @@
+# Training Recipes for Reproducible Machine Learning Models
+
+> Technical documentation guidelines to improve the reproduciblity of machine learning models. 
+
+Did you build a machine learning model and want to make sure you're not the only one who knows how to train it? This repository contains guidelines for how to write technical documentation that makes it easier for others to reproduce the results of your model. 
+
+## Table of Contents
+- [Background](#background)
+- [Usage](#usage)
+ - [Data](#data)
+ - [Training](#training)
+- [Contribute](#contribute)
+- [License](#license)
+
+## Background
+
+Reproducing the results of a machine learning model is notoriously difficult. This is particularly true in the deep learning community, where high-dimensional, over-parameterized non-convex optimization problems require multiple heuristics to converge to local minima with good performance. Hence, if the optimization or source of training data are not properly documented, it can take considerable experimentation to achieve the expected results. 
+
+ We propose the training recipe, a technical document whose aim is to provide sufficient information for a researcher to train a model and achieve a target performance without requiring any external help. We recommend to write training recipes for models that are deployed to production, models that are published, or models that reproduce results from academic papers. As we are well aware that writing technical documentation can be cumbersome, we provide simple guidelines that make it possible to produce the recipe in a timely fashion. We recommend to write the recipe as a markdown document and to review it with a Github pull request.
+
+## Usage
+
+Here are the components of the training recipe:
+- Data
+  - Raw Data
+  - Data Processing Methods
+- Training
+  - Optimization Methods
+  - Performance Metrics
+- Inference
+
+The next section will provide a checklist of critical items for each of these and will explain the rationale for their importance.
+
+### Data
+
+#### Raw Data
+
+**Description**
+
+Describe the data, labels, as well as additional available meta-data. It is important to understand the data schema when adding new data samples. Furthermore, while developing machine learning models one may identify biases or peculiar behaviors that can be caused by how the training data was generated. Information about its source makes it possible to verify hypotheses and make necessary adjustments. 
+
+**Path**
+
+Provide information about how to access the raw data, e.g. a website, a set of Hive tables, an Hadoop Distributed File System (HDFS) or S3 URI, etc... Make sure that you respect the data governance if access is restricted. 
+
+#### Data Processing Methods
+
+**Description**
+
+Describe the data processing pipeline. The raw dataset is often in a format that is not appropriate for running the model training script directly. Here are some examples of common data processing methods:
+
+* The data is image URLs and images are downloaded.
+* The data and labels are contained in separate Hive tables which are joined.
+* The data is split into training, validation, and test sets. As performance is evaluated on the test set and is the ultimate metric to determine reproducibility, it is critical to have detailed information on how to build the test set. 
+* The class distribution is highly skewed and the dataset is balanced in order for the classifier to better learn the rare classes.
+* For natural language data a dictionary of fixed size is computed and words are mapped to indices.
+
+**Code**
+
+Provide the code and instructions to run the data processing script. Include the git commit SHA-1 hash. 
+
+**Path**
+
+Include a link to the processed data. It makes it possible to train the model without running the preprocessing scripts which may take a long time. Note that data governance such as GDPR may not allow to keep a cache of the processed data and therefore the dataset should be reprocessed whenever we train a new model. Make sure that you respect the data governance if you can provide a link and access is restricted. 
+
+### Training
+
+#### Optimization Methods
+
+**Description**
+
+Describe the model architecture and the loss function.
+
+**Code**
+
+Provide the code and instructions to run the training script. Include the git commit SHA-1 hash. 
+
+**Hyperparameters**
+
+Provide details about the optimization hyperparameters, such as 
+
+* Batch size
+* Number of GPUs
+* Optimizer information and learning rate schedule, e.g. stochastic gradient descent, momentum, Adam, RMSProp, etc...
+* Location of the pre-trained model (if the model is fine-tuned)
+* Data augmentation methods
+
+**Dynamics**
+Describe the training dynamics, such as the total training time, the evolution of the loss function on the training and validation sets, or the evolution of other relevant metrics such as accuracy. As training a model to completion may take several days, the dynamics provide a way to get earlier feedback about whether we are "on track" to reproduce performance. 
+
+**Outputs**
+
+Provide a link to the trained model, e.g. URL, HDFS or S3 URI, etc... 
+
+#### Performance metrics
+
+**Target metrics**
+Provide target metrics, such as top-K accuracy, mean average precision, BLEU score, etc... 
+
+**Code**
+
+Provide the code and instructions to run the evaluation script. Include the git commit SHA-1 hash. 
+
+#### Inference
+
+**Code**
+
+Provide an example of how to run the model on a new data sample. This demonstrates how to combine the data processing steps and model inference, which is helpful for model deployment. It also makes it possible to explore the model's predictions interactively and get insight into its performance beyond the metrics provided above. We recommend using Jupyter Notebooks to show examples on how to run the inference.
+
+**Timing information**
+
+Provide timing information along with relevant details such as the hardware (e.g. NVIDIA Tesla V100, Intel Core i7, ...), batch size, software version, etc... Having access to these numbers makes it easier to plan for capacity and discuss product integrations. 
+
+**Serialization**
+
+Provide the code and instructions to serialize the model. Include the git commit SHA-1. Note that this is only required for models that are deployed to production in a format that differs from the output of the training script, e.g. ONNX, TFLite.
+
+
+#### Miscellaneous
+
+This section contains additional relevant information such as academic papers, experimental journal and notes detailing the experiments performed to reach the best performance, list of action items to improve the model, the code, etc...
+
+## Example
+
+Please refer to the training [recipe](recipe_resnet50_imagenet.md) for an example of how to train a ResNet-50 model on ImageNet. 
+
+## Contribute
+
+Please refer to [the contributing.md file](Contributing.md) for information about how to get involved. We welcome issues, questions, and pull requests. Pull Requests are welcome.
+
+## Maintainers
+Pierre Garrigues: garp@oath.com
+
+## License
+
+This project is licensed under the terms of the [MIT](LICENSE-MIT) open source license. 
diff --git a/figures/accuracy3.png b/figures/accuracy3.png
diff --git a/figures/cross_entropy2.png b/figures/cross_entropy2.png
diff --git a/figures/l2_loss2.png b/figures/l2_loss2.png
diff --git a/figures/loss2.png b/figures/loss2.png