ADS Sample Notebook.

Copyright (c) 2021 Oracle, Inc. All rights reserved. Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.

***
# <font color=red>Feature Type</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal>Oracle Cloud Infrastructure Data Science Service Team</font></p>

***

# Overview:

There is a distinction between the data type of a feature and the nature of data that it represents. The data type represents the form of the data that the computer understands. ADS uses the term "feature type" to refer to the nature of the data. For example, a medical record id could be represented as an integer, its data type, but the feature type would be "medical record id". The feature type represents the data the way the data scientist understands it. ADS provides the feature type module on top of your Pandas dataframes and series to manage and use the typing information to better understand your data.

The feature type framework comes with some common feature types. However, the power of using feature types is that you can easily create your own and apply them to your specific data. You don't need to try to represent your data in a synthetic way that does not match the nature of your data. This framework allows you to create methods that validate whether the data fits the specifications of your organization. For example, for a medical record type you could create methods to validate that the data is properly formatted. You can also have the system generate warnings to sure the data is valid as a whole or create graphs for summary plots.

The framework allows you to create and assign multiple feature types. For example, a medical record id could also have a feature type id and the integer feature type.

---

## Prerequisites:
 - Experience with specific topic: Novice
 - Professional experience: None

---

## Objectives:

- <a href='#overview'>Overview</a>
- <a href="#feature_type">Feature Type</a>
    - <a href="#set_feature_type">Setting Feature Types</a>
        - <a href="#set_feature_type_series">Series</a>
        - <a href="#set_feature_type_df">Dataframe</a>
    - <a href="#default_type">Default Feature Type</a>
    - <a href="#tag">Tag</a>
- <a href="#warnings">Feature Type Warnings</a>
- <a href="#handlers">Feature Type Validators</a>
- <a href="#feature_select">Feature Type Selection</a>

---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

<font color=gray>Datasets are provided as a convenience. Datasets are considered third party content and are not considered materials under your agreement with Oracle applicable to the services. The [`orcl_attrition` dataset](oracle_data/UPL.txt) is distributed under the UPL license.
</font>

In [None]:
import ads
import numpy as np
import os
import pandas as pd

from ads.feature_engineering import feature_type_manager, FeatureType, Tag

<a id="overview"></a>
# Feature Type System

The feature type system allows data scientists to separate the concept of how data is represented physically from what the data actually measures. That is, the data can have feature types that classify the data based on what it represents and not how the data is stored in memory. Each set of data can have multiple feature types through a system of multiple inheritances. As a concrete example, an organization that sells cars might have a set of data that represents their purchase price of a car, that is the wholesale price. You  could have a feature set of `wholesale_price`, `car_price`, `USD`, and `continuous`. This multiple inheritance allows a data scientist to create <a href="#warnings">feature type warnings</a> and <a href="#handlers">feature type validators</a> for each feature type.

Feature type warnings are used for rapid validation of the data. For example, the `wholesale_price` might have a method that ensures that the value is a positive number because you can't purchase a car with negative money. The `car_price` feature type may have a check to ensure that it is within a reasonable price range. `USD` can check the value to make sure that it represents a valid US dollar amount. It can't have values below one cent. The `continuous` feature type is the <a href="default_type">default feature type</a> and it represents the way the data is stored internally.

The feature type validators are a set of `is_*` methods, where `*` is generally the name of the feature type. For example, the method `.is_wholesale_price()`can create a boolean Pandas series that indicates what values meet the validation criteria. It  allows you to quickly identify which values need to be filtered or require future examination into problems in the data pipeline. The feature type validators can be as complex as they need to be. For example, they might take a client ID and call an API to validate each client ID is active.


<a id="feature_type"></a>
# Feature Type

Pandas dtypes are physical data types that indicate how data are stored. You can call `.dtype` on your Pandas dataframe or series to inspect the physical types. Feature types are the logical types that define how the data should be interpreted by the end user. Feature types categorize the features from the machine learning perspective. Different feature types could be the same physical type. For example, both categorical and ordinal can be an integer dtype. However, the difference between `categorical` and `ordinal` feature types is that `ordinal` features have an ordering while `categorical` features don't. 

ADS allows a set of data to have multiple feature types through a system of inheritance. For example, a hospital may have a medical record number for each patient. That data might have the feature types `patient_id`, `id`, and `integer`. The `patient_id` is the child feature type with `id` being its parent. The `integer` is the parent of the `id` feature type. It is also the last feature type in the inheritance chain and is called the <a href="default_type">default feature type</a>.

In addition to the regular feature types, there are two special versions. The <a href="default_type">default type</a> is based on the Pandas dtype and cannot be changed without changing the Pandas dtype. There is no need to set it because it is always the last feature type in the inheritance chain. The <a href="#tags">tag</a> feature type does not support <a href="#warnings">feature type warning</a> nor <a href="#handlers">feature type validators</a>. It is designed to allow you to tag data with extra information.

Calling `feature_type_manager.feature_type_registered()`, gives you an overview of all the registered feature types. ADS comes with various common feature types, but the idea is that you create feature types that explicitly define your data.

`feature_type_manager.feature_type_registered()` returns a dataframe with these columns:
- `Class`: registered feature type class.
- `Name`: feature type class name.
- `Description`: Description of each feature type class.

In [None]:
feature_type_manager.feature_type_registered()

<a id="set_feature_type"></a>
## Setting Feature Types
    
The `.feature_type` property is used to store the feature types that are to be associated with a dataset. It accepts an ordered list of the feature types that are to be associated with the dataset. The next cell creates a series of credit card numbers. It then uses the `.feature_type` with a list of strings of the class names of the feature types. 

A call to `.feature_type_description` returns a dataframe that is ordered by inheritance. It contains the following columns:

* `Feature Type`: The class that defines the feature type.
* `Description`: A description of the feature type.


<a id="set_feature_type_series"></a>
### Series

To assign feature types to a Pandas series us the `.ads.feature_type` property on the series.


In [None]:
series = pd.Series(["4532640527811543", "4556929308150929", "4539944650919740", "4485348152450846"], name='Credit Card')
series.ads.feature_type = ['credit_card', 'string']
series.ads.feature_type_description

The `.feature_type` property doesn't have to take the class name because it can accept the class itself. For example, the feature type object that is associated with the `credit_card` class name can be used. This gives you flexibility for how you want to define the feature types.

The feature type manager can be used to get a `FeatureType` object that is based on the `credit_card` feature type. You do this with the `.feature_type_object()` method.

You can repeat the preceding example by replacing `credit_card` with `CreditCard` and obtain the same results.

In [None]:
CreditCard = feature_type_manager.feature_type_object('credit_card')
series.ads.feature_type = [CreditCard, 'string']
series.ads.feature_type

The `.feature_type_object()` method, of the `feature_type_manager` object, takes a class name and return as object based on `FeatureType`. This object represents the feature type. For example, 'string' is the class name of the feature type `String`. The next cell checks if the class is a subclass of `FeatureType`.

In [None]:
String = feature_type_manager.feature_type_object('string')
issubclass(String, FeatureType)

Now use the `String` and the `CreditCard` class to set the feature types:

In [None]:
series.ads.feature_type = [CreditCard, String]
series.ads.feature_type

<a id="set_feature_type_df"></a>
### Dataframe

Like a Pandas series, `.feature_type` can be used on a dataframe to set the feature types for the columns in the dataframe. The property accepts a dictionary where the key in the dictionary is the column name and the value is a list of feature types associated with that column.

In [None]:
attrition_path = os.path.join('/opt', 'notebooks', 'ads-examples', 'oracle_data', 'orcl_attrition.csv')
df = pd.read_csv(attrition_path, 
                 usecols=['Attrition', 'TravelForWork', 'JobFunction', 'EducationalLevel'])
df.ads.feature_type = {'Attrition': ['boolean', 'category'],
                         'TravelForWork': ['category'],
                         'JobFunction': ['category'],
                         'EducationalLevel': ['category']}
df.ads.validator_registered()

<a id="default_type"></a>
## Default Feature Type

There is a special feature type called the default feature type. It is based on the Pandas dtype. It doesn't have to be set by the user, but it can be.

Feature types allow for multiple inheritances and the default feature type is an ancestor to all other feature types, with the exception of <a href="#tags">tags</a>. Each series only has one default feature type. It can't be mutated or removed unless the underlying Pandas dtype has changed. For example, you have a Pandas series called `series` that has a dtype of `string` so its default feature type is `string`. If you change the type by calling `series = series.astype('category')`, then the default feature type is `categorical` instead of `string`.

ADS automatically detects the dtype of each series and sets the default feature type. The default feature type can be one of the following:

- `boolean`
- `date_time`
- `category`
- `string`
- `continuous`
- `integer`
- `object`.

This cell creates a Pandas series of credit card numbers and prints the default feature type:

In [None]:
series = pd.Series(["4532640527811543", "4556929308150929", "4539944650919740"], name='creditcard')
series.ads.default_type

The `.feature_type` property allows you to list the feature types that are assigned to the data. In the next cell, `series` has the `credit_card` feature type added and the feature types are displayed. Notice that the `string` feature type is the last in the list and it wasn't included in the `.feature_type` property. This is because it is the default feature type and it's included automatically.

In [None]:
series.ads.feature_type = ['credit_card']
series.ads.feature_type

The default feature type can be included in the `.feature_type` property. If you do thate, then the default feature type isn't added a second time.

In [None]:
series.ads.feature_type = ['credit_card', 'string']
series.ads.feature_type

<a id="tag"></a>
## Tag

A non-tag feature type must have a python class defined and registered with ADS. However, it is often convenient to tag a dataset with additional information without the need to create a feature type class. This is the role of the `Tag`, which allows you to create a feature type without having to explicitly define and register a class. The tradeoff is that you can't have <a href="#warnings">feature type warnings</a> nor <a href="#handlers">feature type validators</a>. Tags are semantic and provide more context about the actual meaning of a feature. This could directly affect the interpretation of the information. Tags are optional for any dataset.

The process of creating your tag is the same as setting the feature types because it is a feature type. You use the `.feature_type` property to create tags. 

The next cell creates a set of credit card numbers and set the feature type to `credit_card` and tags the dataset as being inactive cards. Also the cards are from North American financial institutions. You can put any text ypu want in the `Tag()` because no underlying feature type class has to exist.

In [None]:
series = pd.Series(["4532640527811543", "4556929308150929", "4539944650919740", "4485348152450846"], name='Credit Card')
series.ads.feature_type=['credit_card', Tag('Inactive Card'), Tag('North American')]
series.ads.feature_type

Tags are always listed after the other feature types.

In [None]:
series.ads.feature_type = [Tag('Inactive Card'), 'credit_card', Tag('North American')]
series.ads.feature_type

A list of tags can be obtained using the `tags` attribute.

In [None]:
series.ads.tags

<a id="warnings"></a>
# Feature Type Warnings

Part of the exploratory data analysis (EDA) is to check the state or condition of your data. Meaning checking to ensure that there are all the values are in a given range or are properly formated with no missing values. For categorical data, you often want to ensure that the cardinality is low enough for the type of modeling that you are doing. Since the feature type system is meant to understand the nature of your data, it's an ideal mechanism to help automate the evaluation of the data. This evaluation is done by registering warnings.

Feature type warnings are functions that are builtin or user-defined. They perform an analysis of a feature to determine that are any data condition problems with the data. For example, it might report that a feature is skewed when it is expected that the data is normally distributed. Another common example is that the data may have more than some threshold of missing values. ADS comes with various common warnings builtin for the feature types that it supports. However, you are able to create and register any warnings that you want.

Feature warnings are defined at the feature type level. Warnings can be registered dynamically at run time. The `feature_type_manager.warning_registered()` shows a dataframe of registered warnings of each registered feature type. The three columns of returned dataframes are:

- `Feature Type`: Feature type class name.
- `Warning`: Warning name.
- `Handler`: Registered warning handler for that feature type.

In [None]:
feature_type_manager.warning_registered()

The feature type object can be used to list the warnings for a given feature type. You do this by calling the `.warning.warning_registered()` on the feature type object. The next cell uses the `.feature_type_object()` method in the feature manager to obtain a credit card feature type object. This object can then be used to list the warnings that are associated with the credit card feature type.

In [None]:
CreditCard = feature_type_manager.feature_type_object('credit_card')
CreditCard.warning.registered()

The feature type warning system is used to provide warnings about the data. The next cell creates a dataset called `series` with a number of credit card values and some invalid values. The `series.ads.warning()` is called to produce a dataframe of the warnings.

In [None]:
visa = ["4532640527811543", "4556929308150929", "4539944650919740", "4485348152450846", "4556593717607190"]
amex = ["371025944923273", "374745112042294", "340984902710890", "375767928645325", "370720852891659"]
invalid = [np.nan, None, "", "123", "abc"]
series = pd.Series(visa + amex + invalid, name='creditcard')
series.ads.feature_type = ['credit_card']
series.ads.warning()

<a id="handlers"></a>
# Feature Type Validators

One aspect of EDA is to ensure that all the data is valid. For example, you might have credit card data and want to ensure that the numbers are all valid credit card numbers. The feature type validators are a way of performing this validation. Builtin methods ate included for the feature types that are provided by ADS, but the idea is for you to create these methods for your custom feature types.

The feature type validators are a set of `is_*` methods, where `*` is generally the name of the feature type. For example, the method `.is_credit_card()` could be called to ensure that the data are all credit cards. It returns a boolean Pandas series, which is the length of the data. It indicates `True` if the element meets the criteria specified in the feature type validator, or `False` otherwise.

Feature type validators are defined at the feature type level. You can define functions that can be applied to the columns and features of the same feature type. The feature types that are provided by ADS come with a default set of handlers.

Feature handlers also have a concept of a `condition`. Conditions allow you to have different sets of feature handlers based on the condition. For example, if you wanted to see if a credit card was a Visa card, you could create a condition like `.is_credit_card(startswith='4')`. You would then register a feature handler with that condition and it would run when you passed in that condition.

To list the current feature type validators and their conditions, use the `feature_type_manager.validator_registered()` method. It shows the registered default handlers in a dataframe format. The four columns in the dataframe are:

- `Feature Type`: Feature type class name.
- `Validator`: Validation functions that you can call to validate a Pandas series.
- `Condition`: The condition that the handler is registered in.
- `Handler`: Registered handler.

In [None]:
feature_type_manager.validator_registered()

With the `CreditCard`feature type object, you can view the handlers that are associated with the feature type credit card.

In [None]:
CreditCard = feature_type_manager.feature_type_object('credit_card')
CreditCard.validator.registered()

The feature type validator can be called on a Pandas series to test the values in the series. The next cell creates a dataset called `series` that contains some valid and invalid credit card numbers. This series is assigned a credit card feature type. The `.is_credit_card()` feature type validator is executed to determine which values are valid credit card numbers.

In [None]:
visa = ["4532640527811543", "4556929308150929"]
invalid = [np.nan, None, "", "123", "abc"]
series = pd.Series(visa + invalid, name='creditcard')
series.ads.feature_type = ['credit_card']
series.ads.validator.is_credit_card()

The feature type object can also be used to evalute a Pandas series

In [None]:
CreditCard.validator.is_credit_card(series) 

<a id="feature_select"></a>
# Feature Type Selection

You can select a subset of columns based on the feature types using `feature_select`. 
- `include` defaults to `None`. It takes a list of feature types (feature type object or feature type name) to include in the returned dataframe.
- `exclude` defaults to `None`. It takes a list of feature types (feature type object or feature type name) to exclude from the returned dataframe. 
- `include` and `exclude` cannot both be `None`. A feature type cannot be included or excluded at the same time.

In [None]:
attrition_path = os.path.join('/opt', 'notebooks', 'ads-examples', 'oracle_data', 'orcl_attrition.csv')
df = pd.read_csv(attrition_path, 
                 usecols=['Attrition', 'TravelForWork', 'JobFunction', 'EducationalLevel'])
df.ads.feature_type = {'Attrition': ['boolean'],
                         'TravelForWork': ['category'],
                         'JobFunction': ['category'],
                         'EducationalLevel': ['category']}

Next, create a dataframe that only has columns that have a boolean feature type:

In [None]:
df.ads.feature_select(include=['boolean'])

Now, create a dataframe that excludes columns that have a boolean feature type:

In [None]:
df.ads.feature_select(exclude=['boolean'])

<a id="reference"></a>
# References

- [ADS Library Documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
