Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wishlist for FSDL v2 #4

Open
josh-tobin opened this issue Jul 10, 2020 · 10 comments
Open

Wishlist for FSDL v2 #4

josh-tobin opened this issue Jul 10, 2020 · 10 comments

Comments

@josh-tobin
Copy link
Contributor

If we decide to do another version of the course, here are some new topics that could be exciting to add. This is off the top of my head, feel free to suggest other topics.

Bias / fairness

  • Detecting and reducing bias in ML systems
  • Ethics for ML practitioners

Deployment

  • More complicated web serving scenarios (ensembles, graphs of models, low-latency, larger models)
  • More prescriptive recommendations on deployment (how to do AB tests, shadow mode, instant rollbacks, etc)
  • Model optimization (quantization, distillation, compression, etc)
  • Edge / mobile deployment
  • On-prem or data-sensitive deployment

Troubleshooting

  • More specific pytorch recommendations

Testing

  • More specific testing recommendations -- "test coverage" for ML, what to do when tests fail etc
  • More on data slices, how to pick them, and how to manage them
  • Testing suggestions for language data

Monitoring

  • More on what to monitor
  • How to set up a monitoring system

Data

  • Managing data at a larger scale
  • Managing user data for ML

Infrastructure / tooling

  • Feature stores -- why, when, and how
  • Logging infrastructure for ML
  • Spark -- why, when, and how
  • Tools for building reproducible data pipelines (Airflow, Kubeflow, etc)

Model lifecycle management

  • How to know when to retrain models
  • How to set up reproducible retraining pipelines
  • How to select data for your next training run (active learning & friends)
@josh-tobin
Copy link
Contributor Author

Suggestion from @sayakpaul

Add a section to the troubleshooting lecture on things that are common practice in the research world that may not be worth the added complexity in the real world. Examples:

  • If one is careless enough data augmentation can degrade image quality unnecessarily. So, how could incorporate augmentation and at the same time ensure the quality is preserved as much as possible. So, the fast.ai team came up with a simple augmentation policy called presizing.
  • My model is getting dumbed down in the local minima. So what approaches should I take? Ex: LR schedules with decay, Cyclical Learning Rate, etc (just providing examples).
  • If I incorporate batch normalization in my model training then I might be making the model performance highly dependent on the batch statistics which might not be desirable during inference. So, how to go about handling this?

@sayakpaul
Copy link

@josh-tobin do you mind sharing the platforms/approaches/tools you have your mind to cover the Managing data at a larger scale. or do you plan to do it platform-independent?

One approach that I have found useful for quite a while now (apologies if it is naive):

  • Convert my data to multiple shards of TensorFlow Records
  • Copy them over to a GCS bucket in the same zone where my cloud VM resides for improved performance

@josh-tobin
Copy link
Contributor Author

@josh-tobin do you mind sharing the platforms/approaches/tools you have your mind to cover the Managing data at a larger scale. or do you plan to do it platform-independent?

One approach that I have found useful for quite a while now (apologies if it is naive):

  • Convert my data to multiple shards of TensorFlow Records
  • Copy them over to a GCS bucket in the same zone where my cloud VM resides for improved performance

This is somewhat similar to how I've done it in the past. Some considerations are HDFS vs GCS/S3/etc, and how to build performant data loaders. I'd want to dig into this some more before making any concrete recommendations though.

@sayakpaul
Copy link

Great! This could be also stretched to show how much impactful a data input pipeline is for training a model with good hardware utilization.

@chesterxgchen
Copy link

In data management, should include new data formats such as
Apache Iceberg -- originally from Netflix, currently used by many big companies ( Netflix, Apple, Alibaba, Tencent, Adobe, LinkedIn (?), ..)

Delta Lake -- originated from Databricks, Open source in Linux Foundation, with greatest momentum due to the integration with Databricks Cloud, Spark/SparkSQL/SparkStreaming, MLFlow

Apache Hudi -- originated from Uber, suitable for upsert.

All there has time-travel support needed for data versioning. MLFlow is now integrated with Delta Lake to do data version control.

@chesterxgchen
Copy link

Monitoring --
Data quality monitoring : features quality, feature distribution visualization, feature skew, data distribution change over time. Feature training/test/validation distributions mismatch etc
Model monitoring --

@KDDS
Copy link

KDDS commented Jul 14, 2020

More aspects on data engineering. Like implementing massively parallel programming techniques and other cutting edge solution in today's world to make big data ready for DL/ML

@DanielhCarranza
Copy link
Contributor

Data
Dataset shifts:

  • Proactive approaches (p.e. Causal Diagrams, DAGs, PAGs)
  • Reactive approaches

How to effectively handle Long-tail Data

@josh-tobin
Copy link
Contributor Author

@DanielhCarranza say more about the reactive / proactive approaches you have in mind?

@nickdavidhaynes
Copy link

I'd love to see a discussion on peer review (maybe it fits in the section on teams, or in testing/deployment section?). There are a lot of pieces need to be reviewed!

  • Training code/configuration
  • Serving code
  • Modeling approach — data and features selected, model architecture, choice of metric(s)
  • Experiment results
  • Plan and code for monitoring performance

I'm aware of a couple good blog posts on the topic, but I'm not sure anything definitive exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants