Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Completing estimator class docstrings #1148

Closed
fkiraly opened this issue Jul 17, 2021 · 7 comments
Closed

[DOC] Completing estimator class docstrings #1148

fkiraly opened this issue Jul 17, 2021 · 7 comments
Labels
documentation Documentation & tutorials

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 17, 2021

Every estimator class should have a complete docstring.
This should be worked on one-by-one, and feel free to complete only individual rubrics if it's unclear what to fill in for the others.

A good estimator docstring should include rubrics:

  • one-liner description (top), start capitalized, end with .
  • description paragraph - what is the algorithm?
  • Components block - only if there are estimator components. The list of components should be identical with constructor arguments that are estimators (inheriting from BaseClassifier, BaseForecaster, etc).
  • Parameters block - individual parameters listed with param_name: type, explanation, explanation should include value/structure convention if expectation is more specific than just stating the type, e.g., n: int, integer between 0 and 42.
    The list of Parameters should be identical with constructor arguments that are not estimators.
  • Attributes block - these are the most important attributes of object instances which are not parameters or components. It should include attributes that correspond to the "fitted model".
  • Notes - details, formulae, academic references
  • Example - self-contained example on sktime internal toy data that runs

For formatting, we use the numpy style, though note that the rubrics are slightly different (because we are dealing with algorithms/estimators).
Also look at the extension templates for the algorithm scitype for a "fill-in template" that algorithm implementers are using (or should be using).

Here's an example of a good class docstring:

class BOSSEnsemble(BaseClassifier):
    """Ensemble of bag of Symbolic Fourier Approximation Symbols (BOSS).

    Implementation of BOSS Ensemble from Schäfer (2015). [1]_

    Overview: Input "n" series of length "m" and BOSS performs a grid search over
    a set of parameter values, evaluating each with a LOOCV. It then retains
    all ensemble members within 92% of the best by default for use in the ensmeble.
    There are three primary parameters:
        - alpha: alphabet size
        - w: window length
        - l: word length.

    For any combination, a single BOSS slides a window length "w" along the
    series. The w length window is shortened to an "l" length word through
    taking a Fourier transform and keeping the first l/2 complex coefficients.
    These "l" coefficients are then discretized into alpha possible values,
    to form a word length "l". A histogram of words for each
    series is formed and stored.

    Fit involves finding "n" histograms.

    Predict uses 1 nearest neighbor with a bespoke BOSS distance function.

    Parameters
    ----------
    threshold : float, default=0.92
        Threshold used to determine which classifiers to retain. All classifiers
        within percentage `threshold` of the best one are retained.
    max_ensemble_size : int or None, default=500
        Maximum number of classifiers to retain. Will limit number of retained
        classifiers even if more than `max_ensemble_size` are within threshold.
    max_win_len_prop : int or float, default=1
        Maximum window length as a proportion of the series length.
    min_window : int, default=10
        Minimum window size.
    n_jobs : int, default=1
        The number of jobs to run in parallel for both `fit` and `predict`.
        ``-1`` means using all processors.
    random_state : int or None, default=None
        Seed for random, integer.

    Attributes
    ----------
    n_classes : int
        Number of classes. Extracted from the data.
    n_instances : int
        Number of instances. Extracted from the data.
    n_estimators : int
        The final number of classifiers used. Will be <= `max_ensemble_size` if
        `max_ensemble_size` has been specified.
    series_length : int
        Length of all series (assumed equal).
    classifiers : list
       List of DecisionTree classifiers.

    See Also
    --------
    IndividualBOSS, ContractableBOSS

    Notes
    -------
    For the Java version, see
    `TSML <https://github.com/uea-machine-learning/tsml/blob/master/src/
    main/java/tsml/classifiers/dictionary_based/BOSS.java>`_.

    References
    ----------
    .. [1] Patrick Schäfer, "The BOSS is concerned with time series classification
       in the presence of noise", Data Mining and Knowledge Discovery, 29(6): 2015
       https://link.springer.com/article/10.1007/s10618-014-0377-7

    Example
    -------
    >>> from sktime.classification.dictionary_based import BOSSEnsemble
    >>> from sktime.datasets import load_italy_power_demand
    >>> X_train, y_train = load_italy_power_demand(split="train", return_X_y=True)
    >>> X_test, y_test = load_italy_power_demand(split="test", return_X_y=True)
    >>> clf = BOSSEnsemble()
    >>> clf.fit(X_train, y_train)
    BOSSEnsemble(...)
    >>> y_pred = clf.predict(X_test)
    """
@mloning
Copy link
Contributor

mloning commented Jul 17, 2021

Running: pydocstyle sktime/ --config=setup.cfg | grep ./ | cut -d ':' -f1 | uniq from a Unix command line should give you a list of all files that have incomplete docstrings.

@fkiraly fkiraly changed the title Completing estimator class docstrings [DOC] Completing estimator class docstrings Jul 17, 2021
@RNKuhns
Copy link
Contributor

RNKuhns commented Jul 17, 2021

@mloning and @fkiraly, I've got an update to the docstring above that fixes some formatting issues that won't render well in Sphinx (e.g. some of the formatting of parameters and attribute sections). Note that this moves the reference to the paper to the references section as specified in NumPy docstring format. Moved the reference to the Java version to the See Also section, but still need to figure out how to make the link work correctly there.

This also cleans up some typos, capitalization issues aand other minor things.

class BOSSEnsemble(BaseClassifier):
    """Ensemble of bag of Symbolic Fourier Approximation Symbols (BOSS).

    Implementation of BOSS Ensemble from Schäfer (2015). [1]_

    Overview: Input "n" series of length "m" and BOSS performs a grid search over
    a set of parameter values, evaluating each with a LOOCV. It then retains
    all ensemble members within 92% of the best by default for use in the ensmeble.
    There are three primary parameters:
        - alpha: alphabet size
        - w: window length
        - l: word length.

    For any combination, a single BOSS slides a window length "w" along the
    series. The w length window is shortened to an "l" length word through
    taking a Fourier transform and keeping the first l/2 complex coefficients.
    These "l" coefficients are then discretized into alpha possible values,
    to form a word length "l". A histogram of words for each
    series is formed and stored.

    Fit involves finding "n" histograms.

    Predict uses 1 nearest neighbor with a bespoke BOSS distance function.

    Parameters
    ----------
    threshold : float, default=0.92
        Threshold used to determine which classifiers to retain. All classifiers
        within percentage `threshold` of the best one are retained.
    max_ensemble_size : int or None, default=500
        Maximum number of classifiers to retain. Will limit number of retained
        classifiers even if more than `max_ensemble_size` are within threshold.
    max_win_len_prop : int or float, default=1
        Maximum window length as a proportion of the series length.
    min_window : int, default=10
        Minimum window size.
    n_jobs : int, default=1
        The number of jobs to run in parallel for both `fit` and `predict`.
        ``-1`` means using all processors.
    random_state : int or None, default=None
        Seed for random, integer.

    Attributes
    ----------
    n_classes : int
        Number of classes. Extracted from the data.
    n_instances : int
        Number of instances. Extracted from the data.
    n_estimators : int
        The final number of classifiers used. Will be <= `max_ensemble_size` if
        `max_ensemble_size` has been specified.
    series_length : int
        Length of all series (assumed equal).
    classifiers : list
       List of DecisionTree classifiers.

    See Also
    --------
    :py:class:`IndividualBOSS`, :py:class:`ContractableBOSS`

    For the Java version, see
    `TSML <https://github.com/uea-machine-learning/tsml/blob/master/src/
    main/java/tsml/classifiers/dictionary_based/BOSS.java>`_.

    References
    ----------
    .. [1] Patrick Schäfer, "The BOSS is concerned with time series classification
       in the presence of noise", Data Mining and Knowledge Discovery, 29(6): 2015
       https://link.springer.com/article/10.1007/s10618-014-0377-7

    Example
    -------
    >>> from sktime.classification.dictionary_based import BOSSEnsemble
    >>> from sktime.datasets import load_italy_power_demand
    >>> X_train, y_train = load_italy_power_demand(split="train", return_X_y=True)
    >>> X_test, y_test = load_italy_power_demand(split="test", return_X_y=True)
    >>> clf = BOSSEnsemble()
    >>> clf.fit(X_train, y_train)
    BOSSEnsemble(...)
    >>> y_pred = clf.predict(X_test)
    """

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 18, 2021

Looks good, @RNKuhns, thanks!
I cooked up mine just very quickly to help people in the sprint.

Would you mind:

  • adding the example
  • editing/overwriting my original post, so the "thing at the top" is the perfect docstring?

@RNKuhns
Copy link
Contributor

RNKuhns commented Jul 18, 2021

@fkiraly -- no problem. I've added the example (didn't copy that over correctly in my post) and updated your post at the top.

@RNKuhns
Copy link
Contributor

RNKuhns commented Jul 18, 2021

Was able to get with the Scipy documentation sprint and figure out how to specify the link. I've updated both posts to include the correct link usage in "See Also"

@RNKuhns
Copy link
Contributor

RNKuhns commented Jul 20, 2021

I've made another minor tweak to the doc -- the references to related classifiers in See Also will now also work.

@mloning
Copy link
Contributor

mloning commented Dec 4, 2021

I think we should transfer this issue into the relevant developer guide section and close it.

@fkiraly fkiraly closed this as completed Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation & tutorials
Projects
None yet
Development

No branches or pull requests

3 participants