Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible data slicing logic #20

Closed
Firenze11 opened this issue Sep 9, 2019 · 1 comment
Closed

Flexible data slicing logic #20

Firenze11 opened this issue Sep 9, 2019 · 1 comment
Labels
enhancement New feature or request

Comments

@Firenze11
Copy link
Contributor

Firenze11 commented Sep 9, 2019

problem

Currently the following logic is hardcoded in Manifold:

  • when choosing auto segmentation, user can choose comparison metric, but when choosing manual segmentation, they can't. The value will be whatever is set when they're in auto mode

- when choosing manual segmentation, user can only select feature columns, but not prediction columns

- comparison metric cannot be flexibly defined.

Because of this, the following use cases cannot be easily implemented:

Use cases

Real-world usage this improvement will support include (from customer interview):

  1. Highlight direction of errors
  2. Identify really badly-performing data points (outliers) for inspection

Solutions

Use case 1 can be implemented by setting performance metrics to indicate over/under prediction (instead of absolute prediction error), and allow users to manually segment data based on this metric column (i.e. set segmentation threshold to 0).

Use case 2 can be implemented by allowing users to manually segment data based on this metric column (i.e. set segmentation threshold to some really high value so that only a few datapoints are in group 0).

To enable these, we need to make the following fields in state independent knobs (instead of hard-coding the value of one field base on the value of another), and then hook each of them to UI controls:

  • isManualSegmentation: whether to apply manual (filter-based) or automatic (kmeans) data slicing
  • baseCols: use which columns to slice (either through creating filters for these columns, or through inputting them to kmeans clustering)
  • nClusters: number of clusters to use in automatic slicing (only applicable to automatic slicing)
  • segmentFilters: filter logic corresponding to data segment (only applicable to manual slicing)
  • segmentGroups: which segments to group together for comparing against each other

Milestone

To validate the success of the change, we will evaluate how the 2 user tasks in the "Use cases" section can be achieved.

Appendix

A complete list of variables in the slicing logic

A complete list of variables in the slicing logic

  • Ways to define “performance column”
  • Column type to segment on
    • Feature column
    • Performance column
    • Prediction column
    • Create a new column (e.g. Delta between 2 performance column)
  • Data segmentation strategies:
    • Auto segmentation (through k-means)
    • Manual segmentation (through defining filter values)
  • Number of columns to segment on
    • Single column
    • Multiple column

Items in 1, 2, 3, 4 are independent, e.g you can have 1a + 2a + 3a + 4a, or 1a + 2b + 3a + 4b, based on specific needs.

Code structure

Code structure

Data slicing is only part of the logic in the application. Conceptually, the functionalities of the application will be structured into the following components (we do not actively work on the refactoring; restructuring will be done piecemeal to prioritize functionalities.)

  • Data generation: updating performance metric, compute performance score, compute delta between 2 performance columns etc. These actions will cause changes in data field.
  • Data slicing: toggling auto/manual data slicing, choosing base columns to slice, configuring segmentation filters etc. These actions won't cause changes in data field but will change data subsets
  • Visualization configuration: changing which feature column to color by etc. These actions won't change data slices, but will cause display changes.
@Firenze11 Firenze11 added enhancement New feature or request refactoring Code architecture needs to change labels Sep 9, 2019
@Firenze11 Firenze11 removed the refactoring Code architecture needs to change label Sep 18, 2019
@Firenze11
Copy link
Contributor Author

#34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant