ADS Sample Notebook.

Copyright (c) 2021 Oracle, Inc. All rights reserved. Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.

***
# <font color=red>Feature Type Validator</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle Cloud Infrastructure Data Science Service Team</font></p>

***

# Overview:

Feature types are a powerful way to abstract the nature of the data way from the way that is it represented on the computer. Feature type validators are a flexible way to improve and speed up your data validation process. It allows a feature type to be dynamically extended such that the data validation process can be reproducible and shared across projects. `ADS` comes with various validation routines. However, the system is designed for you to create your own validation routines to use on your organization's feature types. This notebook shows you how to create feature type validators.


---

## Prerequisites:
- Experience with a specific topic: Novice
- Professional experience: Novice

---

## Objectives:

- <a href='#overview'>Overview</a>
- <a href="#list_validators">List Feature Type Validator</a>
    - <a href="#list_validators_ftm">List Feature Type Validators Using the Feature Type Manager</a>
    - <a href="#list_validators_fto">List Feature Type Validators on a Feature Type Object</a>
    - <a href="#list_validators_series">List Feature Type Validators on a Series</a>
    - <a href="#list_validators_df">List Feature Type Validators on a Dataframe</a>
- <a href="#using_handlers">Using Feature Type Validators</a>
    - <a href="#using_handlers_series">Using Feature Type Validators on a Series</a>
    - <a href="#using_handlers_fto">Using Feature Type Validators on a Feature Type Object</a>
- <a href="#default_handler">Default Feature Type Validator</a>
    - <a href="#create_handler">Create a Default Feature Type Validator</a> 
    - <a href='#validator_register'>Registering a Default Feature Type Validator</a>
- <a href='#condition_handler'>Condition Feature Type Validator</a>
    - <a href='#condition_handler_closed'>Closed Value Condition Feature Type Validator</a>   
    - <a href='#condition_handler_open'>Open Value Condition Feature Type Validator</a>
    - <a href='#condition_handler_disambiguation'>Disambiguation</a>
- <a href='#unregistering'>Unregistering a Feature Type Validator</a>
- <a href="#reference">References</a>    
    
    ---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

Datasets are provided as a convenience. Datasets are considered third party content and are not considered materials under your agreement with Oracle applicable to the services. The [`orcl_attrition` dataset](oracle_data/UPL.txt) is distributed under the UPL license.


In [None]:
import ads
import numpy as np
import os
import pandas as pd
import re
from ads.feature_engineering import feature_type_manager
from ads.feature_engineering import FeatureType

<a id="overview"></a>
# Feature Type Validators

One aspect of exploratory data analysis (EDA) is to ensure that all the data is valid. For example, you may have credit card data and want to ensure that all the numbers are valid credit card numbers. The feature type validators are a way of performing this validation. There are built-in methods for the feature types that are provided by `ADS`, but the idea is for you to create these methods for your custom feature types.

Feature type validators are defined on the feature type level. You can define functions that can be applied to the columns and features of the same feature type and use it across your entire organization. The feature types that are provided by `ADS` come with a default set of handlers.

The feature type validators are a set of `.is_*()` methods, where `*` is generally the name of the feature type. For example, the method `.is_credit_card()` would be called to ensure that the data are all credit card numbers. The feature type validators return a boolean pandas series, which is the length of the data. If the element meets the criteria specified in the feature type validator, it indicates `True` otherwise it is `False`. The `.is_*()` method is called the **validator**.

The feature type validator system is extensible. You can have multiple validators for any feature type. For example, if you had a credit card feature type you may want to have the validator `.is_credit_card()`. It would check a set of credit card numbers to make sure that they are valid credit card numbers. You may want to add other validators like, `.is_visa()` and `.is_mastercard()`, to determine if the credit card numbers are associated with Visa or Mastercard accounts.

The feature type validator can also be extended through the use of conditions. Conditions allow you to have different sets of feature type validators based on a set of arguments that you define called **conditions**. For example, if you wanted to and see if a credit card is a Visa card you could create a condition like `.is_credit_card(card_type='Visa')`. Then you register a feature handler with that condition and it runs when you pass in that condition.

<a id="list_validators"></a>
## List Feature Type Validators

To make the association between the feature type, validator, condition, and feature type validator, the feature type validator must be registered with the feature type manager. The module `feature_type_manager` is used to manage these relationships.

<a id="list_validators_ftm"></a>
## List Feature Type Validators Using the Feature Type Manager

To list the current feature handlers and their conditions use the `feature_type_manager.validator_registered()` method. It shows the registered handlers in a dataframe format. The columns in the dataframe are:
- `Feature Type`: Feature type class name.
- `Validator`:    Validation functions that you can call to validate a pandas series.
- `Condition`:    Condition that the handler is registered in.
- `Handler`:      Registered handler.

In [None]:
feature_type_manager.validator_registered()

<a id="list_validators_fto"></a>
## List Feature Type Validators on a Feature Type Object

Each feature type object also has a `.validator.registered()` method that returns a dataframe with the validators, conditions, and feature type validators that are associated with the given feature type. 

The next cell uses the feature type manager to obtain a `FeatureType` object for the credit card feature type. This object is the feature type object for the credit card feature type. The cell then obtains a list of validators, conditions, and feature type validators that are associated with the credit card feature type.

In [None]:
CreditCard = feature_type_manager.feature_type_object('credit_card')
CreditCard.validator.registered()

<a id="list_validators_series"></a>
## List Feature Type Validators on a Series

The `.validator_registered()` method can be used on a pandas series by calling `.ads.validator_registeres()`.

The next cell creates a series that contains valid credit card numbers. The series has its feature type set to `credit_card`. The call to `series.ads.validator_registered()` reports multiple handlers because the series has multiple feature types associated with it (credit card and string).

In [None]:
series = pd.Series(["4532640527811543", "4556929308150929", "4539944650919740"], name='creditcard')
series.ads.feature_type = ['credit_card']
series.ads.validator_registered()

<a id="list_validators_df"></a>
## List Feature Type Validators on a Dataframe

Like a pandas series, `.validator_registered()` can be used on a dataframe to obtain information on the feature type validators that are associated with the columns of the dataframe. It only displays columns that have feature type validators associated with them.

The next cell loads a sample dataset into a pandas dataframe. The feature types are then assigned to these columns and `.ads.validator_registered()` is called on the dataframe.

In [None]:
attrition_path = os.path.join('/opt', 'notebooks', 'ads-examples', 'oracle_data', 'orcl_attrition.csv')
df = pd.read_csv(attrition_path, 
                 usecols=['Attrition', 'TravelForWork', 'JobFunction', 'EducationalLevel'])
df.ads.feature_type = {'Attrition': ['boolean', 'category'],
                         'TravelForWork': ['category'],
                         'JobFunction': ['category'],
                         'EducationalLevel': ['category']}

df.ads.validator_registered()

<a id="using_handlers"></a>
# Using Feature Type Validators

The goal of the feature type validator is to validate the data against some set of criteria. This can be done using the feature type object itself or on a pandas series.

<a id="using_handlers_series"></a>
## Using Feature Type Validators on a Series

For a pandas series, the feature type validator is invoked by using the name of the validator and any condition arguments that may be required. To do this the series object calls `.ads` followed by a call to the validator name. For example, `series.ads.validator.is_credit_card(starts_with='4')`, where `.is_credit_card()` is the validator name and `starts_with='4'` is the condition.

The next cell creates a pandas series that contains a set of valid credit card numbers along with a set of invalid numbers. This series has its feature type set to `credit_card` and invokes the `.is_credit_card()` feature type validator.

In [None]:
visa = ["4532640527811543", "4556929308150929", "4539944650919740", "4485348152450846", "4556593717607190"]
invalid = [np.nan, None, "", "123", "abc"]

series = pd.Series(visa + invalid, name='creditcard')
series.ads.feature_type = ['credit_card']
series.ads.validator.is_credit_card()

A series can have multiple feature types handlers associated with it. In this example, `.is_string()` could have also been called. Since the return type is a pandas series the `.any()` and `.all()` methods can be used to provide summary information about the validation. 

The next cell shows you how to validate that there is at least one datum that is not a valid string.

In [None]:
series.ads.validator.is_string().all()

<a id="using_handlers_fto"></a>
## Using Feature Type Validators on a Feature Type Object

A feature type object can be used to invoke the feature type validator by using the name of the validator and any condition arguments that may be required on the feature type object. On the feature type object, call the validator. For example, `CreditCard.is_credit_card(starts_with='4')`, where `.is_credit_card()` is the validator and `starts_with='4'` is the condition.

The next cell uses the feature type manager to obtain a feature type object to the credit card feature type. This object is used to call the feature type validator by passing in the pandas series that is to be assessed.

In [None]:
CreditCard = feature_type_manager.feature_type_object('credit_card')
CreditCard.validator.is_credit_card(series)

<a id="default_handler"></a>
# Default Feature Type Validator

The power of the feature type system is that you can quickly create new feature type validators to validate your data. This is a two step process:

1. Define a function that will act as the feature type validator.
1. Register the feature type validator.

Each feature type has a default handler that is called when no other handler can process a request.

<a id="create_handler"></a>
## Create a Default Feature Type Validator

A feature type validator is a function that respects these rules:

* It takes a series as a first argument.
* If there are any condition arguments, the `*args` and `**kwargs` should be used.
* It returns a boolean series that is the same length as the input series.

To register your own handler, you need to define the handler and then register it to the feature type. If the handler already exists, there is no need to create a new one.

In this example a new feature type validator, `.is_visa_card_handler()`, is created. It checks to see if the credit card number is issued by Visa. The next cell creates the `is_visa_card_handler()` function that tests each element in the `data` parameter is a valid Visa credit card number. Then it returns a boolean series the same length as `data`.

In [None]:
def is_visa_card_handler(data: pd.Series, *args, **kwargs) -> pd.Series:
    """
    Processes given data and indicates if the data matches Visa credit card.

    Parameters
    ----------
    data: pd.Series
        The data to process.

    Returns
    --------
    pd.Series: The logical list indicating if the data matches requirements.
    """
    _pattern_string = r"""
        ^(?:4[0-9]{12}(?:[0-9]{3})?         # Visa
        |  ^4[0-9]{12}(?:[0-9]{6})?$        # Visa 19 digit
        )$
    """
    PATTERN = re.compile(_pattern_string, re.VERBOSE)
    def _is_credit_card(x: pd.Series):
        return (
            not pd.isnull(x)
            and PATTERN.match(str(x)) is not None
        )
    return data.apply(lambda x: True if _is_credit_card(x) else False)

<a id='validator_register'></a>
## Registering a Default Feature Type Validator

The feature type validator, needs to be registered with the feature type. You do that using the `.register()` method, which is part of the feature type object. The feature type manager is used to obtain a link to the feature type object.


The `.register()` method has the following parameters:
- `name`: The validator name that is used to invoke the feature type validator.
- `default_handler`: The function name of the default feature type validator.
- `replace`: The flag indicating if the registered handler should be replaced with the new one.

The next cell obtains the feature type object, `CreditCard`, and then it registers the default feature type validator. If one exists, with the same name, it is replaced. A call to `CreditCard.validator_registered()` returns the registered handlers for the credit card feature type.

In [None]:
CreditCard = feature_type_manager.feature_type_object('credit_card')
CreditCard.validator.register(name='is_visa_card', handler=is_visa_card_handler, replace = True)
CreditCard.validator.registered()

The next cell demonstrates how to use the `.is_visa_card()` feature type validator. A series is created that contains credit cards from Visa, Mastercard and American Express. The `.is_visa_card()` feature type validator is called and it returns a boolean series indicating which credit cards are issued by Visa.

In [None]:
visa = ["4532640527811543", "4556929308150929"]
mastercard = ["5406644374892259", "5440870983256218"]
amex = ["371025944923273", "374745112042294"]
series = pd.Series(visa + mastercard + amex, name='Credit Card')
series.ads.feature_type = ['credit_card']
CreditCard.validator.is_visa_card(series)

Using `CreditCard.validator.is_visa_card?`, with a question mark, shows that `is_visa_card` is an object that has several built-in methods. For example, `.register()`, and `.registered()`. These methods help you to extend the functionality of the handler. 

In [None]:
CreditCard.validator.is_visa_card?

<a id='condition_handler'></a>
# Condition Feature Type Validator

Each feature type validator has a default handler that is called when no condition handler can handle a request. A condition feature type validator allows you to specify arbitrary parameters that are passed to the feature type system. The system examines these parameters and determines the best handler that should be dispatched. This determination is done by using the most restrictive set of parameter criteria.

Use the `.register()` method to register a condition handler. The parameter `name` gives a user-friendly name to the registered handler. The `condition` parameter is used to specify the conditions that must be met to invoke the handler. Conditions would be user-defined parameters and values that help identify what condition in which the handler is to be dispatched. The `handler` parameter is the name of a function that will be dispatched when the conditions are met. The parameter `replace` determines if the handler should replace one that matches the condition. If set to `True` it will replace it if it exists. If set to `False` it will not replace an existing matching handler and will throw an error.

A feature type validator is a function that respects the following rules:

* It takes a series as a first argument.
* It has formal arguments for the conditions.
* It returns a boolean series that is the same length as the input series.

<a id='condition_handler_closed'></a>
## Closed Value Condition Feature Type Validator

Closed value condition feature types allow you to specify any number of key-pair pairs to a condition handler and control which feature type validator is dispatched. However, when calling the handler, all of the key-value pairs must match.

The `condition` parameter of the `.register()` method can explicitly define key-value pairs that are used to determine with handler to dispatch. In the <a href="#default_handler">Default Feature Type Validator</a> section, the `is_visa_card` validator was created to determine if the credit cards were issued by Visa. It is possible to create the same effect by using a condition feature type validator on the `is_credit_card` feature type handle using explicit key-value pairs. To do this, the `condition` method accepts a dictionary of key-value pairs where the key is the parameter name and the dictionary value is the parameter value. For example, `CreditCard.validator.register(name='is_credit_card', condition={"card_type": "Visa"}, handler=is_visa_card_handler, replace=True)` links the parameter `card_type` to the value `Visa`.

In the next cell, the credit card feature type has a condition handler registered. It uses the same feature type validator, `is_visa_card_handler` that was used to create the `is_visa_card` default feature type validator.

In [None]:
CreditCard = feature_type_manager.feature_type_object('credit_card')
CreditCard.validator.register(name='is_credit_card', condition={"card_type": "Visa"}, handler=is_visa_card_handler, replace=True)
CreditCard.validator.registered()

The next cell creates a series of credit card numbers and uses the parameters `card_type="Visa"` when calling the `is_credit_card` validator. Notice that only the first two elements are flagged as being issued by Visa. If the default handler was called, all the returned values would have been `True` because they are all valid credit card numbers.

In [None]:
visa = ["4532640527811543", "4556929308150929"]
mastercard = ["5406644374892259", "5440870983256218"]
amex = ["371025944923273", "374745112042294"]
series = pd.Series(visa + mastercard + amex, name='Credit Card')
series.ads.feature_type = ['credit_card']
CreditCard.validator.is_visa_card(series)

The previous cell called the feature type validator using the feature type object, `CreditCard`. It can also be called using the pandas series.

In [None]:
series.ads.validator.is_credit_card(card_type="Visa")

With closed value condition feature type validators, the key and values must match what was registered. If they don't, the condition feature type validator isn't called. In the next cell, the value is set to `Mastercard` to cause the default handler to be called. 

In [None]:
CreditCard.validator.is_credit_card(series, card_type="Mastercard")

To register a closed value feature type validator that has multiple conditions, you use a dictionary with multiple key-value pairs. For example, to create a condition that checks that the country code is 1 and area code is 902, you could do the following:

```python
PhoneNumber.validator.register(name='is_phone_number', condition={"country_code": "1", "area_code": "902"},
                               handler=is_1_902_handler)
```

<a id='condition_handler_open'></a>
## Open Value Condition Feature Type Validator

Open value condition feature type validators are similar to their closed value counterparts expect the value is not used in the matching process.

To register an open value condition feature type validator, the same process is used as for the <a href='#condition_handler_closed'>Closed Value Condition Feature Type Validator</a> with the exception that a tuple is used to specify the conditions and no values are provided. For example, `CreditCard.validator.register(name='is_credit_card', condition=("card_type",), handler=is_any_card_handler)`.

This cell defines a feature type condition handler that accepts the card type as a parameter name:

In [None]:
def is_any_card_handler(data: pd.Series, card_type: str) -> pd.Series:
    """
    Processes given data and indicates if the data matches any credit card

    Parameters
    ----------
    data: pd.Series
        The data to process.

    Returns
    --------
    pd.Series: The logical list indicating if the data matches requirements.
    """
    
    if card_type == 'Visa':
        _pattern_string = r"""
            ^(?:4[0-9]{12}(?:[0-9]{3})?         # Visa
            |  ^4[0-9]{12}(?:[0-9]{6})?$        # Visa 19 digit
            )$
        """    
    elif card_type == 'Mastercard':
        _pattern_string = r"""
            ^5[1-5][0-9]{14}|^(222[1-9]|22[3-9]\\d|2[3-6]\\d{2}|27[0-1]\\d|2720)[0-9]{12}$
        """
        
    elif card_type == "Amex":
        _pattern_string = r"""
            ^3[47][0-9]{13}$
        """
    else:
        raise ValueError()
        
    PATTERN = re.compile(_pattern_string, re.VERBOSE)
    def _is_credit_card(x: pd.Series):
        return (
            not pd.isnull(x)
            and PATTERN.match(str(x)) is not None
        )
    return data.apply(lambda x: _is_credit_card(x))

The next cell registers the open value feature type validator using a tuple. Notice that values for the `card_type` parameter aren't specified. However, the `is_any_card_handler` function has a formal argument for it. The value of the parameter is passed into the handler. Also, note the trailing comma to make the parameter in `condition` a tuple. This forces Python to make `('card_type',)` a tuple. The output of the cell is the currently registered feature type validators.

In [None]:
CreditCard.validator.register(name='is_credit_card', condition=("card_type",), handler=is_any_card_handler)
CreditCard.validator.registered()

To determine which credit card numbers in the variable `series` are issued by Mastercard, pass the parameter `card_type="Mastercard"` into the `.is_credit_card()` feature type validator. The feature type system examines the parameters and then dispatches `is_any_card_handler`. `is_any_card_handler` accepts the `card_type` parameter and has logic to detect which numbers are Mastercard.

In [None]:
series.ads.validator.is_credit_card(card_type="Mastercard")

This approach can also be used by the feature type object, `CreditCard`. In this example, the values in the variable `series` are checked to see if they match American Express credit card numbers:

In [None]:
CreditCard.validator.is_credit_card(series, card_type="Amex")

To register an open value feature type validator that has multiple conditions, you would use a tuple with multiple values. For example, if you wanted to create a condition that would check the country and area codes of a phone number, you could use the following:

```python
PhoneNumber.validator.register(name='is_phone_number', condition=(("country_code", "area_code")),
                               handler=is_county_area_handler)
```

It's not possible to mix open and closed condition feature type validators.

<a id='condition_handler_disambiguation'></a>
## Disambiguation

In this notebook, a closed condition feature type was created for `'card_type'='Visa'`. There was also an open condition feature type that was created to handle all conditions that specify the `card_type` parameter. There appears to be a conflict in that both conditions support the case of `'card_type'='Visa'`. In fact, there is no conflict. The feature type system determines the most restrictive case and dispatches on it so the `is_visa_card_handler` handler is called.

In [None]:
CreditCard.validator.registered()

The next cell causes the `is_visa_card_handler` to be dispatched because it has the most restrictive set of requirements that match the parameters given.

In [None]:
series.ads.validator.is_credit_card(card_type="Visa")

<a id='unregistering'></a>
# Unregistering a Feature Type Validator

The `.unregister()` method is used to remove a feature type validator. Condition feature type validators are removed by using the validator as an accessor. In this case, the parameters to `.unregister()` are a dictionary in the case of closed condition feature type validators and it must match the dictionary that was used to register the handler. In the case of the open condition feature type validator, a tuple is passed to the `.unregister()`. Again, the tuple must match the tuple that was used to register the handler.

To remove a default feature type validator, use the feature type object along with the `.unregister()` method. In this case, the parameter is the name of the validator. Removing the default feature type validator also removes any condition feature type validators that are associated with it.

The next cell lists the current feature type validators.

In [None]:
CreditCard = feature_type_manager.feature_type_object('credit_card')
CreditCard.validator.registered()

Remove the closed condition for the case where `'card_type'='Visa'` on the validator `is_credit_card` as in the next cell. Note that the handler has been removed.

In [None]:
try:
    CreditCard.validator.unregister(name="is_credit_card", condition = {"card_type": "Visa"})
except:
    pass
CreditCard.validator.registered()

Remove the open condition for `card_type` on the validator `is_credit_card` as in the next cell. Note that the handler has been removed.

In [None]:
CreditCard.validator.unregister(name="is_credit_card", condition=("card_type",))
CreditCard.validator.registered()

Remove the default feature type validator for `is_visa_card` as in the next cell. Note that the handler has been removed.

In [None]:
CreditCard.validator.unregister(name='is_visa_card')
CreditCard.validator.registered()

<a id="reference"></a>
# References

- [ADS Library Documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
