# Writing your own Extractor

This tutorial shows how you can extend RDV with your own custom feature extractors. 

Note that some outputs may not be work when viewing on Github since they make use of Dash apps that require a server to be launched in the background. We recommend to clone this repo and execute the notebooks locally.

When writing your feature extractor, you need to implement the `rdv.extracotrs.Extractor` interface. This interface inherits the `rdv.globals.Buildable` and `rdv.globals.Serializable` interfaces. The code snippet below shows the full interface that has to be implemented.

In [5]:
from raymon.profiling.extractors import Extractor


class KMeansOutlierScorer(SimpleExtractor):
    """Extractor Interface"""
    def extract_feature(self, data):
        """Extracts a feature from a data instance.

        Parameters
        ----------
        data : any
            The data instance you want to extract a feature from. The type is up to you.

        """
        # Do something
        return 1.0 # Return a float, int or str
    
    """Buildable interface"""
    def build(self, input, output, actual):
        """Your feature extractor must be Buildable. This means that it may use data to set some reference values, used to calculate the feature to be extracted from a data sample. A good example for this is the `rdv.extractors.structured.KMeansOutlierScorer` extractor, which clusters the data at building time and saves those clusters as reference in the objects state. If you dont require and buildabe state, like the `rdv.extractors.structured.ElementExtractor`, don't do anything in this function.

        Parameters
        ----------
        data : any
            The set of data available at building time. Can be any type you want.
        Returns
        -------
        None
        """
        pass

    def is_built(self):
        """
        Check whether the object has been built. Typically, this method checks whether the required references for the object is set. If your Extractor does not use any references, simply return True.

        Returns
        -------
        is_built : bool
        """
        return True


    """Serializable interface"""
    def to_jcr(self):
        """Return a JSON compatible representation of the object. Will generally return a dict cintaining the objects state, but can return anything JSON serializable. json.dumps(xxx) will be called on the output xxx of this function."""
        # Return a json-compatible representation
        raise NotImplementedError()

    @classmethod
    def from_jcr(cls, jcr):
        """Given the JSON compatible representation from the function above, load an object of this type with the desired state.

        Parameters
        ----------
        jcr : [dict]
            The jcr representation returned from the `to_jcr` function above. Will generally be a dict, but can be  anything JSON serializable.
            
        Returns
        -------
        obj : this
            An object with type of your extractor.
        """
        # Load a json-compatible representation
        raise NotImplementedError()

You can pass any object that implements the interface above as Extractor. **You do need to make sure that to code definition is available at train and test time**, because RDV saves the classpath of every Extractor i nthe schema JSON when saving, and tries to load it when instantiating the Schema and its Features as can be seen [here](https://github.com/raymon-ai/data-validation/blob/master/rdv/feature.py#L168). 

You can use the existing extractors as guide to implement your own. Good examples the simple [ElementExtractor](https://github.com/raymon-ai/data-validation/blob/master/rdv/extractors/structured/element.py) and the more advanced [ KMeansOutlierScorer](https://github.com/raymon-ai/data-validation/blob/master/rdv/extractors/structured/kmeans.py).