Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to improve the documentation #97

Open
kaurao opened this issue Dec 11, 2023 · 0 comments
Open

Request to improve the documentation #97

kaurao opened this issue Dec 11, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@kaurao
Copy link

kaurao commented Dec 11, 2023

@kwinkunks Thanks a lot for creating a nice package.
Here are some more detailed comments that I think could be helpful.

  1. wasserstein could return a pandas DataFrame with appropriate index and column names.
    It will be helpful to see an example where the data is not identically distributed by construction, e.g. after applying different SrandardScalers?
    The sentence "This shows us that the distributions of the PE log in well indices 6 and 7 are somewhat different and may be anomalous." should be revised to match with the text/str indices as shown in the figure above.

  2. is_correlated how does this work? Which correlation is used? How does it convert the correlation value to a binary outcome? How are the chunks correlated?
    The sentence "That is, shuffling the data removes the correlation, but does not mean the records are independent." is unclear and confusing. How is a user supposed to make sense of the output of this function?

  3. feature_importances it is unclear how the order of the output is related to the input. Also do the lower or higher values mean higher importance?
    The API documentation says "In each case, the n normalized importances with the most variance are averaged.". This is unclear and will be helpful for a user to get more precise information on what was done like how were the scores normalized? If something like Z-scoring was done then would that be appropriate as it could change the importance scores to negative.
    Also this combining of importance scores is non-standard, at least I have not seen it commonly used. So it will be helpful to have references for this approach.

  4. More genereally, the documentation should provide more details on what the functions are exactly doing.

ping openjournals/joss-reviews#6065

@kaurao kaurao added the bug Something isn't working label Dec 11, 2023
kwinkunks added a commit that referenced this issue Dec 17, 2023
kwinkunks added a commit that referenced this issue Dec 17, 2023
Part of #97. The goal is to make the function more explainable.
Moved the aggregation function to utils.py where it can be
inspected more easily. By default it's a simple sum, but
for the feature importance measure it normalizes before
summing over features, then normalizes the result so the
importances sum to 1. Borda counts are an option too but
I think scores are best for importance, to preserve the
magnitudes of the coefficients.

Also switched to LinearRegression instead of Lasso, which
should really be fitted for alpha, see #98.
kwinkunks added a commit that referenced this issue Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant