ADS Sample Notebook.

Copyright (c) 2021 Oracle, Inc. All rights reserved. Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.

***
# <font color=red>Custom Feature Type</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal>Oracle Cloud Infrastructure Data Science Service Team</font></p>

***


# Overview:

The feature type system allows data scientists to separate the concept of how data is represented physically from what the data actually measures. The data can have feature types that classify the data based on what it represents and not how the data is stored in memory. Each set of data can have multiple feature types through a system of multiple inheritances. While ADS comes with a collection of common feature types, the power of the system is that an organization can create custom feature types for their data. This notebook provides an overview of how to create custom feature types.  

---

## Prerequisites:
- Experience with a specific topic: Intermediate
- Professional experience: Basic

---

## Objectives:

- <a href='#overview'>Feature Type System</a>
- <a href="#feature_type_create">Create a Custom Feature Type</a>
- <a href="#feature_type_attributes">Attributes</a>
- <a href="#feature_type_stats">Feature Type Statistics</a>
- <a href="#feature_type_plot">Feature Type Plot</a>
- <a href="#feature_type_method">Custom Method in a Feature Type Class</a>
- <a href='#unregister_custom_type'>Unregistering a Custom Feature Type</a>
- <a href="#reference">References</a>

---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

<font color=gray>Datasets are provided as a convenience. Datasets are considered third-party content and are not considered materials under your agreement with Oracle applicable to the services. The [`orcl_attrition` dataset](oracle_data/UPL.txt) is distributed under the UPL license.
</font>

In [None]:
import ads
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import seaborn as sns

from ads.feature_engineering import feature_type_manager, FeatureType
from ads.common.card_identifier import card_identify

<a id='overview'></a>
# Feature Type System

The feature type system allows the data scientist to separate the concept of how data is represented physically from what the data actually measures. The data can have feature types that classify the data based on what it represents and not how the data is stored in memory. Each feature can have multiple feature types through a system of multiple inheritances. For example, an organization that sells cars might have a set of data that represents the purchase price of a car (the wholesale price). This could have a feature set of `wholesale_price`, `car_price`, `USD`, and `continuous`.

While ADS comes with a number of built-in feature types, the power lies in its flexibility for you to quickly create feature types that can be shared across your organization. A custom feature type is created by defining a class that is derived from the `FeatureType` class or one of its subclasses. This class comes with several attributes and methods that can be overridden so that it meets your specific needs. However, there is no requirement to override any property of the base class.

A feature type has the following attributes that can be overridden:

- `description`: A description of the feature type.
- `name`: The name of the feature type.

If you wish to create custom summary statistics for a feature type, then override the `.feature_stat()` method. To create a custom summary plot, override the `.feature_plot()` method.

The focus of this notebook is to demonstrate how to create a custom feature type.

<a id="feature_type_create"></a>
# Create a Custom Feature Type

The feature type framework comes with some common feature types. However, the power of using feature types is that you can easily create your own, and apply them to your specific data. You don't need to try to represent your data in a synthetic way that does not match the nature of your data. This framework allows you to create methods that validate whether the data fits the specifications of your organization.

To create a custom feature type, you need to create a class that is inherited from the `FeatureType` class. The class must be registered with ADS before it can be used. You do this using the `feature_type_manager.feature_type_register()` method and pass in the name of the class.

In the next cell, the custom feature type, `CustomCreditCard`, is created and is inherited from the `FeatureType` base class. The class overrides the `name` attribute to set a custom name for the class. If it is not overridden then the name will automatically be determined by the class name and converted to snake case. The name can be used to register and unregister a class. It is also used in several outputs to identify the class. When assigning a feature type to a series, the name can be used to identify the class.

The `description` attribute allows you to provide detailed information about a feature type. If it is not overridden the description will default to 'Base Feature Type.'

The `.feature_stat()` provides a mechanism to provide summary statistics for a feature type. If this method is not overridden and `.feature_stat()` is called on a feature type the multiple inheritance mechanism will be used. For example, we may create a feature type called `wholesale_price` and then a Pandas series called `car_price`. `car_price` has the following feature types: `wholesale_price`, `car_price`, `USD`, and `continuous`. Assuming that `.feature_stat()` is not overridden in the `wholesale_price` class, the system will attempt to generate the feature statistics using the `.feature_stat()` method defined in its parent feature type, `car_price`. This will continue until a feature type has defined the `.feature_stat()` method. If none of the custom feature type classes implement `.feature_stat()` it will fall through to the default feature type which always has this method implemented.

In the next cell, the `.feature_stat()` method is implemented. It will create a dataframe where the index is the name of the metric. There will be one column that is the value that is associated with the metric.

The `.feature_plot()` method can be overridden such that a custom plot can be created for the data. Just like the `.feature_stat()` method, if it is not overridden a plot will be generated by the multiple inheritance mechanism. All default feature types have a defined `.feature_plot()`. Since the default feature type's plotting method is generic, it may or may not provide a good representation of the data.

In the next cell, the `.feature_plot()` method creates a bar chart with the count of the issuing financial institution. 


As with all classes, it is possible to add additional attributes and methods that may be helpful when working with a feature type. The next cell implements a method called `.issuer()`. Since the class that we are creating is a custom credit card, this would be a handy piece of information.

The important rule is that all methods of the feature type must be `static` or `class` level methods and should take a Pandas series as a first argument.

In [None]:
def assign_issuer(cardnumber):
    """
    Identifies the credit card type.
    """
    if pd.isnull(cardnumber):
        return "missing"
    else:
        return card_identify().identify_issue_network(cardnumber)

class CustomCreditCard(FeatureType):
    """Type representing custom credit card numbers.

    Attributes
    ----------
    description: str
        The feature type description.
    name: str
        The feature type name.
    
    Methods
    --------
    register_handler(cls, name: str, default_handler: Callable) -> None
        Registers a new handler for a feature type.
    unregister_handler(cls, name: str) -> None
        Unregisters a handler.
    registered_handlers(cls) -> pd.DataFrame
        Gets the list of registered handlers.
    feature_stat(x: pd.Series) -> pd.DataFrame
        Generates feature statistics.        
    feature_plot(x: pd.Series) -> plt.Axes
        Generates plot object.
    """
    description = "This is an example of a custom credit card feature type."
    name="Custom Credit Card"
       
    @staticmethod
    def issuer(series: pd.Series) -> pd.Series:
        """Identifies the credit card type.
        
        Parameters
        ----------
        series: pd.Series
            The data to process.
        
        Returns
        -------
        pd.Series: The result of processing data.
        """
        return series.apply(lambda card_number: 'Missing'
                       if pd.isnull(card_number) else card_identify().identify_issue_network(card_number))    
    
    @staticmethod
    def feature_stat(series: pd.Series) -> pd.DataFrame:
        """Generates feature statistics.

        Feature statistics include (total)count, unique(count), missing(count) and
            count of each credit card type.
        
        Parameters
        ----------
        series: pd.Series
            The data to process.

        Returns
        -------
        Pandas Dataframe
            Summary statistics of the Series or Dataframe provided.
        """
        
        def _count_unique_missing(series: pd.Series):
            """Return the total count, unique count and count of missing values of a series."""

            def _add_missing(series, df):
                """Add count of missing values."""
                n_missing = pd.isnull(series.replace(r'', np.NaN)).sum()
                if n_missing > 0:
                    df.loc['missing'] = n_missing
                return df

            df_stat = pd.Series({'count': len(series),
                                 'unique': len(series.replace(r'', np.NaN).dropna().unique())},
                                name=series.name).to_frame()
            return _add_missing(series, df_stat)

        df_stat = _count_unique_missing(series)
        card_types = series.apply(assign_issuer)
        value_counts = card_types.value_counts()
        value_counts.index = [
            "count_" + cardtype for cardtype in list(value_counts.index)
        ]
        return pd.concat([df_stat, value_counts.to_frame()])
    
    @staticmethod
    def feature_plot(x: pd.Series) -> plt.Axes:
        """Generates plot object.

        Bar plot include present count of each credit card type.
        
        Parameters
        ----------
        series: pd.Series
            The data to process.

        Returns
        -------
        matplotlib.axes._subplots.AxesSubplot
            Plot object for the series based on the CreditCard feature type.
        """
        card_types = x.apply(assign_issuer)
        df = card_types.value_counts().to_frame()
        if len(df.index):
            ax = sns.barplot(x=df.index, y=list(df.iloc[:, 0]), color='#76A2A0')
            ax.set(xlabel="Issuing Financial Institution")
            ax.set(ylabel="Count")
            return ax

Once the feature type class has been created, it must be registered with ADS using the `feature_type_manager.feature_type_register()` method. The next cell registers the `CustomCreditCard` and displays all feature types.

In [None]:
try:
    feature_type_manager.feature_type_register(CustomCreditCard)
except:
    pass
feature_type_manager.feature_type_registered()

<a id="feature_type_attributes"></a>
# Attributes

The attributes `name` and `description` can be accessed in the standard method that the attributes of any class can be accessed. 

The name assigned to the `CustomCreditCard` class can be determined by calling `CustomCreditCard.name`.

In [None]:
CustomCreditCard.name

The same approach works for the `description` attribute.

In [None]:
CustomCreditCard.description

<a id="feature_type_stats"></a>
# Feature Type Statistics

The `.feature_stat()` method is used to provide summary statistics about a Pandas series that are specific for the feature type. In the `CustomCreditCard` class, the summary statistics were count, unique credit card numbers, the number of missing values and the number of credit cards from each issuer. The power of the feature type statistic is that you can create any number of summary statistics that will adequately summarize your feature type.

The summary statistics can be accessed by calling `.feature_stat()` on a series using the feature type class. The next cell creates a dataframe of credit cards and uses this approach to output the summary statistics.

In [None]:
creditcard_numbers = ["4532640527811543", "4556929308150929", "4539944650919740", "4485348152450846", "4556593717607190",
                       "5406644374892259", "5440870983256218", "5446977909157877", "5125379319578503", "5558757254105711",
                       "371025944923273", "374745112042294", "340984902710890", "375767928645325", "370720852891659",
                       np.nan, None, "", "111", "0"]
df = pd.DataFrame({'credit_card': creditcard_numbers})
CustomCreditCard.feature_stat(df['credit_card'])

The more common approach to generating summary statistics is to call `.feature_stat()` on the Pandas series itself. Before doing this, the series needs to be associated with the feature type 'Custom Credit Card'. The next cell makes `df['credit_card']` have the feature type 'Custom Credit Card' and then displays the summary statistics.

In [None]:
df['credit_card'].ads.feature_type = ['Custom Credit Card']
df['credit_card'].ads.feature_stat()

<a id="feature_type_plot"></a>
# Feature Type Plot

The `.feature_plot()` method is used to provide a univariate plot that is specific to a feature type.  You can create your own plot that will visualize your feature type. For the CustomCreditCard feature type, a bar chart was implemented in the `.feature_plot()` method. It counts the number of credit cards based on the issuing financial institution. It then sorts the counts from highest to lowest. The method returns a Seaborn plot object that can then be customized.

The plot method can be accessed by calling `.feature_plot()` on a series. The next cell creates a Seaborn plot object for the feature type Custom Credit Card.

In [None]:
df['credit_card'].ads.feature_plot()

<a id="feature_type_method"></a>
# Custom Method in a Feature Type Class

It is often desirable to add methods to a feature type class that provide extra functionality when working with that feature type. In the `CustomCreditCard` class the method `.issuer()` was added. This method returns the name of the organization that issued the credit card number.

The custom method can be accessed by calling `.issuer()` on a series using the feature type class. The next uses this approach to create a series for credit card issuers. This approach has the advantage that the series, `df['credit_card']` does not have to have the feature type `CustomCreditCard`.

In [None]:
CustomCreditCard.issuer(df['credit_card'])

The more common approach is to call `.issuer()` on the Pandas series itself. Before doing this, the series needs to be associated with the feature type 'Custom Credit Card'. The next cell makes `df['credit_card']` have the feature type 'Custom Credit Card' and issuer of each credit card number.

In [None]:
df['credit_card'].ads.feature_type = ['Custom Credit Card']
df['credit_card'].ads.issuer()

<a id='unregister_custom_type'></a>
# Unregistering a Custom Feature Type

To unregister custom feature type the `feature_type_manager.feature_type_unregister()` method can be used for this. This method accepts either a feature type object `feature_type_manager.feature_type_unregister(CustomCreditCard)` or its name `feature_type_manager.feature_type_unregister("Custom Credit Card")`.

The next cell unregisters the `CustomCreditCard` feature type and displays the currently registered feature types. Notice that `CustomCreditCard` is not listed.

In [None]:
feature_type_manager.feature_type_unregister(CustomCreditCard)
feature_type_manager.feature_type_registered()

<a id="reference"></a>
# References
- [Oracle ADS Library documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [ADS Library Documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)