Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible data slicing #34

Merged
merged 5 commits into from Sep 16, 2019
Merged

Flexible data slicing #34

merged 5 commits into from Sep 16, 2019

Conversation

Firenze11
Copy link
Contributor

  • Enable flexible data slicing mechanism, based on all of the following fields in state:
    • isManualSegmentation: whether to apply manual (filter-based) or automatic (kmeans) data slicing
    • baseCols: use which columns to slice (either through creating filters for these columns, or through inputting them to kmeans clustering)
    • nClusters: number of clusters to use in automatic slicing (only applicable to automatic slicing)
    • segmentFilters: filter logic corresponding to data segment (only applicable to manual slicing)
    • segmentGroups: which segments to group together for comparing against each other
  • Enable automatic updating all fields mentioned above once one of the fields gets changed (usually this happens when user triggers an action to update one of the fields)
    • States form a dependency chain: {data, isManualSegmentation} -> {baseCols} -> {nClusters, segmentFilters} -> {segmentGroups}.
    • Any change in upstream states could cause changes in downstream states (e.g. change in data will result in change in baseCols, if the current value of baseCols becomes invalid given the current value of data
    • Downstream states change won't affect upstream states (e.g changing segmentGroups won't cause nClusters to become invalid
    • States update logic: If a state A changes, for one its downstream states B, check if it has become invalid, if so, set it to default based on all other current upstream states; if still valid, do not change the value; change in B could result in further changes in C, so we need to recurse on all downstream states.
    • Created a helper function validateAndSetDefaultStatesConfigurator to execute the logic above.

Copy link
Collaborator

@kenns29 kenns29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Created a helper function validateAndSetDefaultStatesConfigurator to execute the logic above.

Please clarify a thing. Where should the validateAndSetDefaultStatesConfigurator function to be used. Since this function modifies the state, I suppose it shouldn't be used inside a reducer, but rather inside the components which modify the related states (isManualSegmentation, etc), e.g. via triggering an action insidecomponentDidUpdate. If this is the case, you must prepare for the complexity that this type of pattern introduces.

modules/manifold/src/utils/default-states.js Show resolved Hide resolved
modules/manifold/src/utils/default-states.js Outdated Show resolved Hide resolved
@@ -388,6 +358,32 @@ export function zipObjects(arrays, joinField, rename) {
});
}

export function product(collectionArr) {
let result = [];
function recur(collection) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have to be done in a recursive way. Maybe a simple loop does the job more efficiently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no space or time complexity difference between recursion and iterative approach.

jest.config.js Outdated Show resolved Hide resolved
modules/manifold/src/constants/constants.js Outdated Show resolved Hide resolved
// columns: array of array of columns; fields: array of field metadata
data: {columns: [], fields: []},
// collumnTypeRanges: map from "column type" (x, yPred, etc) to array of 2 elements indicating the start and end index of that column type in dataset
// map from "column type" (x, yPred, etc) to array of 2 elements indicating the start and end index of that column type in dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use start and end indices here, are we assuming the columns of the same type will be provided in consecutive order? or we will sort the columns for the user?

Copy link
Contributor Author

@Firenze11 Firenze11 Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are assuming the columns of the same type will be provided in consecutive order. We don't remove or insert columns (only replace or append) since we refer to columns by ids, and want to ensure the reference is still valid after replacing or appending.

const nonGroup = segmentIds.filter(
e => !segmentGroups[0].includes(e) && !segmentGroups[1].includes(e)
);
return nonGroup.concat(segmentGroups[1]).concat(segmentGroups[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we comment on the order here?
[nonGroup, segmentGroups[1], segmentGroups[0]]
why 1 and then 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
groups are ordered reversely, we assume user cares most about segmentGroups[0], then segmentGroups[1], then nonGroup.

const clusteringInputDataset = gatherDataset(data, colIds);

const {columns} = clusteringInputDataset;
const clusterIds = computeClusters(columns, nClusters, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in this diff but what do you think of using one object as the input parameter, instead of three parameters? such that the order of the params won't matter and it is easier to infer the keys (e.g. what are we putting the true for here?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, true was an indicator of whether input shape is [nDataPoints, nFeatures] or [nFeatures, nDataPoints]. Was added as a quick fix when we did the refactoring to migrate to column based data format. Will update computeClusters so we don't have to use true

},
],
};
}

/**
* compute model score columns given yPred, yTrue columns
* @param {Array<Array<Number|String>} yPred prediction columns, from all models and all classes
* @param {Array<Array<Number>} yPred prediction columns, from all models and all classes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a mistake before, our documentation https://github.com/uber/manifold#upload-csv-to-demo-app maintains that values in yPred should always be numbers

modules/manifold/src/utils/data-processor.js Outdated Show resolved Hide resolved
segmentFilter.length === baseCols.length &&
segmentFilter.every((filter, i) => {
const colId = filter.key;
// each baseCol and their corresponding filter must be in the same order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides returning false, shall we provide (e.g. console.debug or assert) detailed info like this?
Easier to debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking about enabling some logging in validateAndSetDefaultStateSingle or validateAndSetDefaultStatesConfigurator when env === 'development'. Probably not in isValid... functions themselves since this type of invalidity are not an indicator of bugs, but a legal invalidation due to some other states get changed.


case FEATURE_TYPE.CATEGORICAL:
return [[domain[0]], [domain[1]]];

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we derive the default filter value for FEATURE_TYPE.FUNC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean FILTER_TYPE.FUNC? This has not been implemented. Currently FEATURE_TYPE only has a few options https://github.com/uber/manifold/blob/master/modules/mlvis-common/src/constants/index.js#L14 and only CATEGORICAL and NUMERICAL are allowed to have corresponding filters.

modules/manifold/src/utils/default-states.js Outdated Show resolved Hide resolved
return (
filterType === FILTER_TYPE.INCLUDE &&
// filter must consain a subset of column domain
value.length < domain.length &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can value.length === domain.length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we assume a filter should contain only a subset of original data.

* @return {boolean}
*/
export const isValidSegmentFilterFromFieldDef = (filter, field) => {
const {type: filterType, value} = filter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we rename value to range in this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since filter.value could also be a function, let's not rename it to "range"

const nSegments = isManualSegmentation ? segmentFilters.length : nClusters;
for (let i = 0; i < segmentGroups.length; i++) {
if (!segmentGroups[i].length) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's provide more detail on why the isValidSegmentGroups check fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments

@Firenze11
Copy link
Contributor Author

Firenze11 commented Sep 16, 2019

Please clarify a thing. Where should the validateAndSetDefaultStatesConfigurator function to be used. Since this function modifies the state, I suppose it shouldn't be used inside a reducer, but rather inside the components which modify the related states (isManualSegmentation, etc), e.g. via triggering an action insidecomponentDidUpdate. If this is the case, you must prepare for the complexity that this type of pattern introduces.

validateAndSetDefaultStatesConfigurator doesn't mutate state, and should be used in reducer. In particular, it should be used in action handlers such as handleUpdateFieldA, where updating fieldA would cause other fields in current state state.fieldB state.fieldC to be invalidated:

const validateAndSetDefaultStates = validateAndSetDefaultStatesConfigurator(
    ['fieldB', 'fieldC'],
    isValidFuncs,
    setDefaultFuncs
);
const handleUpdateFieldA = (state, action) => {
    return validateAndSetDefaultStates({
        ...state,
        fieldA: action.payload,
    });
}

Copy link
Contributor Author

@Firenze11 Firenze11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comments

segmentFilter.length === baseCols.length &&
segmentFilter.every((filter, i) => {
const colId = filter.key;
// each baseCol and their corresponding filter must be in the same order
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking about enabling some logging in validateAndSetDefaultStateSingle or validateAndSetDefaultStatesConfigurator when env === 'development'. Probably not in isValid... functions themselves since this type of invalidity are not an indicator of bugs, but a legal invalidation due to some other states get changed.


case FEATURE_TYPE.CATEGORICAL:
return [[domain[0]], [domain[1]]];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean FILTER_TYPE.FUNC? This has not been implemented. Currently FEATURE_TYPE only has a few options https://github.com/uber/manifold/blob/master/modules/mlvis-common/src/constants/index.js#L14 and only CATEGORICAL and NUMERICAL are allowed to have corresponding filters.

* @return {boolean}
*/
export const isValidSegmentFilterFromFieldDef = (filter, field) => {
const {type: filterType, value} = filter;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since filter.value could also be a function, let's not rename it to "range"

const nSegments = isManualSegmentation ? segmentFilters.length : nClusters;
for (let i = 0; i < segmentGroups.length; i++) {
if (!segmentGroups[i].length) {
return false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments

@@ -388,6 +358,32 @@ export function zipObjects(arrays, joinField, rename) {
});
}

export function product(collectionArr) {
let result = [];
function recur(collection) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no space or time complexity difference between recursion and iterative approach.

@Firenze11 Firenze11 changed the base branch from publish to master September 16, 2019 20:23
@Firenze11 Firenze11 changed the base branch from master to publish September 16, 2019 20:24
@Firenze11 Firenze11 merged commit 125729c into publish Sep 16, 2019
@Firenze11 Firenze11 mentioned this pull request Sep 16, 2019
Firenze11 added a commit that referenced this pull request Sep 16, 2019
Firenze11 added a commit that referenced this pull request Sep 16, 2019
* Enable flexible data slicing mechanism, based on all of the following fields in `state`:
  * `isManualSegmentation`: whether to apply manual (filter-based) or automatic (kmeans) data slicing
  * `baseCols`: use which columns to slice (either through creating filters for these columns, or through inputting them to kmeans clustering)
  * `nClusters`: number of clusters to use in automatic slicing (only applicable to automatic slicing)
  * `segmentFilters`: filter logic corresponding to data segment (only applicable to manual slicing)
  * `segmentGroups`: which segments to group together for comparing against each other
* Enable automatic updating all fields mentioned above once one of the fields gets changed (usually this happens when user triggers an action to update one of the fields)
  * States form a dependency chain: `{data, isManualSegmentation} -> {baseCols} -> {nClusters, segmentFilters} -> {segmentGroups}`.
  * Any change in upstream states could cause changes in downstream states (e.g. change in `data` will result in change in `baseCols`, if the current value of `baseCols` becomes invalid given the current value of `data`
  * Downstream states change won't affect upstream states (e.g changing `segmentGroups` won't cause `nClusters` to become invalid
  * States update logic: If a state A changes, for one its downstream states B, check if it has become invalid, if so, set it to default based on all other current upstream states; if still valid, do not change the value; change in B could result in further changes in C, so we need to recurse on all downstream states.
  * Created a helper function `validateAndSetDefaultStatesConfigurator` to execute the logic above.
@Firenze11 Firenze11 mentioned this pull request Sep 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants