New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexible data slicing #34
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Created a helper function
validateAndSetDefaultStatesConfigurator
to execute the logic above.
Please clarify a thing. Where should the validateAndSetDefaultStatesConfigurator
function to be used. Since this function modifies the state, I suppose it shouldn't be used inside a reducer, but rather inside the components which modify the related states (isManualSegmentation
, etc), e.g. via triggering an action insidecomponentDidUpdate
. If this is the case, you must prepare for the complexity that this type of pattern introduces.
@@ -388,6 +358,32 @@ export function zipObjects(arrays, joinField, rename) { | |||
}); | |||
} | |||
|
|||
export function product(collectionArr) { | |||
let result = []; | |||
function recur(collection) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this have to be done in a recursive way. Maybe a simple loop does the job more efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no space or time complexity difference between recursion and iterative approach.
// columns: array of array of columns; fields: array of field metadata | ||
data: {columns: [], fields: []}, | ||
// collumnTypeRanges: map from "column type" (x, yPred, etc) to array of 2 elements indicating the start and end index of that column type in dataset | ||
// map from "column type" (x, yPred, etc) to array of 2 elements indicating the start and end index of that column type in dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we use start
and end
indices here, are we assuming the columns of the same type will be provided in consecutive order? or we will sort the columns for the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are assuming the columns of the same type will be provided in consecutive order. We don't remove or insert columns (only replace or append) since we refer to columns by ids, and want to ensure the reference is still valid after replacing or appending.
const nonGroup = segmentIds.filter( | ||
e => !segmentGroups[0].includes(e) && !segmentGroups[1].includes(e) | ||
); | ||
return nonGroup.concat(segmentGroups[1]).concat(segmentGroups[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we comment on the order here?
[nonGroup, segmentGroups[1], segmentGroups[0]]
why 1
and then 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const clusteringInputDataset = gatherDataset(data, colIds); | ||
|
||
const {columns} = clusteringInputDataset; | ||
const clusterIds = computeClusters(columns, nClusters, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not in this diff but what do you think of using one object as the input parameter, instead of three parameters? such that the order of the params won't matter and it is easier to infer the keys (e.g. what are we putting the true
for here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, true
was an indicator of whether input shape is [nDataPoints, nFeatures] or [nFeatures, nDataPoints]. Was added as a quick fix when we did the refactoring to migrate to column based data format. Will update computeClusters
so we don't have to use true
}, | ||
], | ||
}; | ||
} | ||
|
||
/** | ||
* compute model score columns given yPred, yTrue columns | ||
* @param {Array<Array<Number|String>} yPred prediction columns, from all models and all classes | ||
* @param {Array<Array<Number>} yPred prediction columns, from all models and all classes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a mistake before, our documentation https://github.com/uber/manifold#upload-csv-to-demo-app maintains that values in yPred
should always be numbers
segmentFilter.length === baseCols.length && | ||
segmentFilter.every((filter, i) => { | ||
const colId = filter.key; | ||
// each baseCol and their corresponding filter must be in the same order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides returning false, shall we provide (e.g. console.debug or assert) detailed info like this?
Easier to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking about enabling some logging in validateAndSetDefaultStateSingle
or validateAndSetDefaultStatesConfigurator
when env === 'development'. Probably not in isValid...
functions themselves since this type of invalidity are not an indicator of bugs, but a legal invalidation due to some other states get changed.
|
||
case FEATURE_TYPE.CATEGORICAL: | ||
return [[domain[0]], [domain[1]]]; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we derive the default filter value for FEATURE_TYPE.FUNC
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean FILTER_TYPE.FUNC
? This has not been implemented. Currently FEATURE_TYPE
only has a few options https://github.com/uber/manifold/blob/master/modules/mlvis-common/src/constants/index.js#L14 and only CATEGORICAL
and NUMERICAL
are allowed to have corresponding filters.
return ( | ||
filterType === FILTER_TYPE.INCLUDE && | ||
// filter must consain a subset of column domain | ||
value.length < domain.length && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can value.length === domain.length
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we assume a filter should contain only a subset of original data.
* @return {boolean} | ||
*/ | ||
export const isValidSegmentFilterFromFieldDef = (filter, field) => { | ||
const {type: filterType, value} = filter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we rename value
to range
in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since filter.value could also be a function, let's not rename it to "range"
const nSegments = isManualSegmentation ? segmentFilters.length : nClusters; | ||
for (let i = 0; i < segmentGroups.length; i++) { | ||
if (!segmentGroups[i].length) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's provide more detail on why the isValidSegmentGroups
check fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed comments
segmentFilter.length === baseCols.length && | ||
segmentFilter.every((filter, i) => { | ||
const colId = filter.key; | ||
// each baseCol and their corresponding filter must be in the same order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking about enabling some logging in validateAndSetDefaultStateSingle
or validateAndSetDefaultStatesConfigurator
when env === 'development'. Probably not in isValid...
functions themselves since this type of invalidity are not an indicator of bugs, but a legal invalidation due to some other states get changed.
|
||
case FEATURE_TYPE.CATEGORICAL: | ||
return [[domain[0]], [domain[1]]]; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean FILTER_TYPE.FUNC
? This has not been implemented. Currently FEATURE_TYPE
only has a few options https://github.com/uber/manifold/blob/master/modules/mlvis-common/src/constants/index.js#L14 and only CATEGORICAL
and NUMERICAL
are allowed to have corresponding filters.
* @return {boolean} | ||
*/ | ||
export const isValidSegmentFilterFromFieldDef = (filter, field) => { | ||
const {type: filterType, value} = filter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since filter.value could also be a function, let's not rename it to "range"
const nSegments = isManualSegmentation ? segmentFilters.length : nClusters; | ||
for (let i = 0; i < segmentGroups.length; i++) { | ||
if (!segmentGroups[i].length) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments
@@ -388,6 +358,32 @@ export function zipObjects(arrays, joinField, rename) { | |||
}); | |||
} | |||
|
|||
export function product(collectionArr) { | |||
let result = []; | |||
function recur(collection) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no space or time complexity difference between recursion and iterative approach.
This reverts commit 125729c.
* Enable flexible data slicing mechanism, based on all of the following fields in `state`: * `isManualSegmentation`: whether to apply manual (filter-based) or automatic (kmeans) data slicing * `baseCols`: use which columns to slice (either through creating filters for these columns, or through inputting them to kmeans clustering) * `nClusters`: number of clusters to use in automatic slicing (only applicable to automatic slicing) * `segmentFilters`: filter logic corresponding to data segment (only applicable to manual slicing) * `segmentGroups`: which segments to group together for comparing against each other * Enable automatic updating all fields mentioned above once one of the fields gets changed (usually this happens when user triggers an action to update one of the fields) * States form a dependency chain: `{data, isManualSegmentation} -> {baseCols} -> {nClusters, segmentFilters} -> {segmentGroups}`. * Any change in upstream states could cause changes in downstream states (e.g. change in `data` will result in change in `baseCols`, if the current value of `baseCols` becomes invalid given the current value of `data` * Downstream states change won't affect upstream states (e.g changing `segmentGroups` won't cause `nClusters` to become invalid * States update logic: If a state A changes, for one its downstream states B, check if it has become invalid, if so, set it to default based on all other current upstream states; if still valid, do not change the value; change in B could result in further changes in C, so we need to recurse on all downstream states. * Created a helper function `validateAndSetDefaultStatesConfigurator` to execute the logic above.
state
:isManualSegmentation
: whether to apply manual (filter-based) or automatic (kmeans) data slicingbaseCols
: use which columns to slice (either through creating filters for these columns, or through inputting them to kmeans clustering)nClusters
: number of clusters to use in automatic slicing (only applicable to automatic slicing)segmentFilters
: filter logic corresponding to data segment (only applicable to manual slicing)segmentGroups
: which segments to group together for comparing against each other{data, isManualSegmentation} -> {baseCols} -> {nClusters, segmentFilters} -> {segmentGroups}
.data
will result in change inbaseCols
, if the current value ofbaseCols
becomes invalid given the current value ofdata
segmentGroups
won't causenClusters
to become invalidvalidateAndSetDefaultStatesConfigurator
to execute the logic above.