Request to improve the documentation #97

kaurao · 2023-12-11T15:17:40Z

@kwinkunks Thanks a lot for creating a nice package.
Here are some more detailed comments that I think could be helpful.

wasserstein could return a pandas DataFrame with appropriate index and column names.
It will be helpful to see an example where the data is not identically distributed by construction, e.g. after applying different SrandardScalers?
The sentence "This shows us that the distributions of the PE log in well indices 6 and 7 are somewhat different and may be anomalous." should be revised to match with the text/str indices as shown in the figure above.
is_correlated how does this work? Which correlation is used? How does it convert the correlation value to a binary outcome? How are the chunks correlated?
The sentence "That is, shuffling the data removes the correlation, but does not mean the records are independent." is unclear and confusing. How is a user supposed to make sense of the output of this function?
feature_importances it is unclear how the order of the output is related to the input. Also do the lower or higher values mean higher importance?
The API documentation says "In each case, the n normalized importances with the most variance are averaged.". This is unclear and will be helpful for a user to get more precise information on what was done like how were the scores normalized? If something like Z-scoring was done then would that be appropriate as it could change the importance scores to negative.
Also this combining of importance scores is non-standard, at least I have not seen it commonly used. So it will be helpful to have references for this approach.
More genereally, the documentation should provide more details on what the functions are exactly doing.

ping openjournals/joss-reviews#6065

The text was updated successfully, but these errors were encountered:

Part of #97

Part of #97. The goal is to make the function more explainable. Moved the aggregation function to utils.py where it can be inspected more easily. By default it's a simple sum, but for the feature importance measure it normalizes before summing over features, then normalizes the result so the importances sum to 1. Borda counts are an option too but I think scores are best for importance, to preserve the magnitudes of the coefficients. Also switched to LinearRegression instead of Lasso, which should really be fitted for alpha, see #98.

kaurao added the bug Something isn't working label Dec 11, 2023

kwinkunks added a commit that referenced this issue Dec 17, 2023

Improve is_correlated docstring

1a995b4

Part of #97

kwinkunks added a commit that referenced this issue Dec 21, 2023

Intermediate towards #97

dcb61a5

kwinkunks mentioned this issue Apr 9, 2024

[REVIEW]: Redflag: machine learning safety by design openjournals/joss-reviews#6065

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request to improve the documentation #97

Request to improve the documentation #97

kaurao commented Dec 11, 2023

Request to improve the documentation #97

Request to improve the documentation #97

Comments

kaurao commented Dec 11, 2023