Skip to content

Preprocessing

Robin van de Water edited this page Jun 2, 2023 · 2 revisions

Creating a preprocessing pipeline

Our preprocessing pipeline is set up to be as general as possible and allows for custom implementations, defined as subclass from the Preprocessor class and passed as a command-line argument. For our tasks, we have defined a default preprocessing pipeline for both classification and regression tasks. The snippet below shows the class structure of the default classification preprocessor. In the private methods of this class, is used to apply feature generation steps. The abstract Preprocessor has two functions that need to be implemented: __init__() (which initializes the preprocessor and configures the settings) and apply(data) (which returns the preprocessed data dictionary of features and labels for each of the train, validate, and test splits)

@gin.configurable("base_classification_preprocessor")
class DefaultClassificationPreprocessor(Preprocessor):
    def __init__(self, generate_features: bool = True, scaling: bool = True, use_static_features: bool = True):
        """
        Args:
            generate_features: Generate features for dynamic data.
            scaling: Scaling of dynamic and static data.
            use_static_features: Use static features.
        Returns:
            Preprocessed data.
        """


    def apply(self, data, vars):
        """
        Args:
            data: Train, validation and test data dictionary. Further divided in static, dynamic, and outcome.
            vars: Variables for static, dynamic, outcome.
        Returns:
            Preprocessed data.
        """
        ...
        return data

    def _process_static(self, data, vars):
        ...
        return data

    def _process_dynamic(self, data, vars):
        ...
        return data

    def _dynamic_feature_generation(self, data, dynamic_vars):
        ...
        return data
Clone this wiki locally