Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve transparency concerning background samples #3650

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

CloseChoice
Copy link
Collaborator

@CloseChoice CloseChoice commented May 11, 2024

Overview

Closes #3461

Description of the changes proposed in this pull request:

Checklist

  • All pre-commit checks pass.
  • Unit tests added (if fixing a bug or adding a new feature)

@CloseChoice CloseChoice marked this pull request as ready for review May 11, 2024 19:46
if masker_length is not None and hasattr(self.masker, 'max_samples') and masker_length > self.masker.max_samples:
logging.warn(f"Your background dataset is larger than the max_samples {self.masker.max_samples}. You can hand over max_samples by explicitly "
"defining the masker to use: "
"`shap.Explainer(model, masker=maskers.Independent(X, max_samples=1000))`")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions:

  1. This seems possibly an odd place for the warning to be raised. Would it be more appropriate to raise it in the masker itself, where the subsampling actually occurs, for example in Tabluar.__init__()? I think this would be a bit tidier.

https://github.com/CloseChoice/shap/blob/b23058409eb05f5cb9e7e5cc7c59ea22b38a5695/shap/maskers/_tabular.py#L57-L58

  1. It could help to adopt the common pattern of using a module-specific logger, e.g. log = logging.getLogger(__name__)

  2. I believe .warn is deprecated in favour of .warning.

  3. I think the warning message could perhaps explain the consequences of having more samples (i.e. that the dataset will just be subsampled). Otherwise, this warning might be taken as an error that something is wrong and needs to be fixed by the user. E.g.:

log.warning(
    "The background dataset exceeds the masker's `max_samples`, "
    "so will be subsampled from {n} rows to {max_samples} rows."
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's an odd place but with this approach we do not need to repeat code in the maskers itself but have one place where this check happens. Would you rather duplicate the code in the maskers (I believe we currently have 2 maskers with the max_samples keyword)?

As for the other points I totally agree and will implement these when we've settled the point above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. On balance I'd probably prefer to log a message in both maskers. That way, the same warning is raised whether the old shap_values or the new Explainer interface is used.

One other thing actually, going back to the discussion on #3461 : I think there is a risk that excessive logging could be annoying for most users. We had originally proposed logging.INFO, but here is logging.WARNING. I think the latter will display in an IPython or Jupyter output by default.

Is this desirable? IMO the subsampling is pretty usual behaviour, so "INFO" feels more appropriate. Subsampling will be invoked probably the majority of the time, so it seems excessive to have a warning message visible on most runs of the package! I would think a visible warning would only show for things that are unusual / unintended / require user attention.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest, might be that our behaviour is standard but I still think it's bad practice to change user behaviour without at least an info (I would even prefer a warning and think it would be better if the user manually configures the correct max_samples if the provided data exceeds the default.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's bad practice to change user behaviour without ...

Please would you elaborate on this point: does the subsampling code "change user behaviour"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the user selects 1000 rows, potentially a well balanced dataset and we sample this down to 100 destroying the balance. Even if something like this is standard behaviour I wouldn't consider this good practice. At least we should inform the user how to use the balanced dataset WITHOUT sampling it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: increase transparency of background dataset sub-sampling
2 participants