ADS Sample Notebook.

Copyright (c) 2021 Oracle, Inc. All rights reserved. Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.

***
# <font color=red>Feature Type Warnings</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal>  Oracle Cloud Infrastructure Data Science Service Team </font></p>

***

# Overview:

Feature type warnings are part of the feature type system in ADS. It allows you to automate the process of checking for data quality issues. One of the most time-consuming processes of the exploratory data analysis (EDA) step is examining the data to make sure that it meets data quality standards. Historically, this has been a manual process and when the existing data is updated or a new dataset is used, the process has to be repeated. The feature type warning infrastructure allows you to code checks on the data and then repeat the process each time a new dataset is used. Since the code is at the feature type level, it is possible to reuse the feature type warnings across an entire organization's data.

The feature type warning system works across an entire feature. For example, you can check for the number of missing values and set a threshold on what is the permitted upper limit. This can be a count, percentage, or some other metric. You can also create mechanisms where you check to ensure that the data has the distribution that is assumed by the model class that you want to use. For example, linear regression assumes that the data is normally distributed. Therefore, the feature type warning might have a Shapiro-Wilk test and a threshold for what is an expected value.

Each feature can have as many feature type warnings as you want. Also, the multiple inheritance nature of the feature type system allows you to write-only the feature type warnings that are relevant for that specific feature type because the warnings for all inherited feature types to be checked. This feature reduces the amount of code duplication and speeds up your EDA.

---

## Prerequisites:
 - Experience with a specific topic: Intermediate
 - Professional experience: Basic

---

## Objectives:

 - <a href='#overview'>Overview</a>
 - <a href="#warnings">Feature Type Warnings</a>
 - <a href="#validating_data">Validating Data</a>
 - <a href="#examining">Examining Feature Type Warnings</a>
 - <a href="#register_unregister">Register and Unregistering a Feature Type Warning</a>
     - <a href="#register">Register a Feature Type Warning</a>
     - <a href="#unregister">Unregister a Feature Type Warning</a>
 - <a href="#dataframe">Working with a DataFrame</a>
 - <a href="#reference">References</a>

---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

<font color=gray>Datasets are provided as a convenience. Datasets are considered third party content and are not considered materials under your agreement with Oracle applicable to the services. The [`orcl_attrition` dataset](oracle_data/UPL.txt) is distributed under the UPL license.
</font>

In [None]:
import ads
import numpy as np
import os
import pandas as pd

from ads.feature_engineering import feature_type_manager, Tag

<a id="overview"></a>
# Feature Type System

The feature type system allows the data scientist to separate the concept of how data is represented physically from what the data actually measures. The data can have feature types that classify the data based on what it represents and not how the data is stored in memory. Each feature can have multiple feature types through a system of multiple inheritances. For example, an organization that sells cars might have a set of data that represents their purchase price of a car (the wholesale price). This could have a feature set of `wholesale_price`, `car_price`, `USD`, and `continuous`. This multiple inheritance allows a data scientist to create <a href="#warnings">feature type warnings</a> for each feature type.

Feature type warnings are used for rapid validation of the data. For example, the `wholesale_price` might have a method that ensures that the value is a positive number because you can't purchase a car with negative money. The `car_price` feature type might have a check to ensure that it is within a reasonable price range. `USD` can check the value to make sure that it represents a valid US dollar amount and it isn't below one cent. The `continuous` feature type is the default type, and it represents the way the data is stored internally.

<a id="warnings"></a>
# Feature Type Warnings

Part of the exploratory data analysis (EDA) is to check the state or condition of your data. This is checking to ensure that there are no missing values.  With categorical data, you often want to ensure that the cardinality is low enough for the type of modeling that you are doing. Since the feature type system is meant to understand the nature of your data, and is an ideal mechanism to help automate the evaluation of the data. This evaluation is done by registering feature type warnings handlers with ADS.

Feature type warning handlers are functions that are built-in or user-defined. They perform an analysis of a feature to determine whether there are any data condition problems. For example, it might report that a feature is skewed when the expectation is that the data is normally distributed. Another common example is that the data might have more than some threshold of missing values. ADS comes with various common warnings built-in for the feature types that it supports. However, you are able to create and register any warnings that you want.

Feature type warnings are defined at the feature type level. Warnings can be registered dynamically at runtime. The `feature_type_manager.warning_registered()` shows a dataframe of registered warnings of each registered feature type. The three columns of the returned dataframes are:

- `Feature Type`: Feature Type class name.
- `Warning`: Warning name.
- `Handler`: Registered warning handler for that feature type.

In [None]:
feature_type_manager.warning_registered()

<a id="validating_data"></a>
# Validating Data

The `.warning()` method runs all the data quality tests on a feature. It creates a dataframe where each row is the result of a test that generated a warning. The dataframe contains the feature and warning type that generated a warning. There is also a human-readable message that explains the warning. The metric and value columns contain information about the metric that was used and the value of that metric.

In the next cell, a set of credit card values is used as the dataset. The feature type is set to `credit_card` and the feature type warnings are reported.

In [None]:
visa = ["4532640527811543", "4556929308150929", "4539944650919740", "4485348152450846", "4556593717607190"]
amex = ["371025944923273", "374745112042294", "340984902710890", "375767928645325", "370720852891659"]
invalid = [np.nan, None, "", "123", "abc"]
series = pd.Series(visa + amex + invalid, name='creditcard')
series.ads.feature_type = ['credit_card']
series.ads.warning()

There are several things to notice about the generated dataframe. While the feature type was set to `credit_card`, the dataframe also lists `string` in the feature type column. This is because the default feature type is `string` so the feature type warning system also ran the tests for the `string` feature type.

The tuple (`credit_card`, `missing`) reports two warnings. This is because each warning can perform as many tests as it wants and reports as many warnings as required. This is also true for the (`string`, `missing`) tuple.

<a id="examining"></a>
# Examining Feature Type Warnings

Feature type warning handlers are functions that have been registered with ADS. The `.warning.registered()` method is part of the `warning` module and allows you to examine what warnings have been registered for a given feature type. 

The next cell uses the feature type manager to get a handle to the `credit_card` class. This class is an `ads.feature_engineering.feature_type.base.FeatureType` object that lists the name of the feature type warning and its handler.

In [None]:
CreditCard = feature_type_manager.feature_type_object('credit_card')
CreditCard.warning.registered()

<a id="register_unregister"></a>
# Register and Unregistering a Feature Type Warning

In the <a href="#validating_data">Validating Data</a> section, a set of credit cards numbers were created. Several missing values were found by the feature type warning tests. However, there are credit card numbers that aren't not valid. The feature type warning system didn't pick up on this because there was no test for it. This problem can be solved by creating a feature type warning that detects this.


<a id="register"></a>
## Register a Feature Type Warning

This section shows you how to create a new feature type warning. The goal of this example is to create a feature type warning handler using the built-in  `.is_credit_card()` feature handler. The example also registers the handler and demonstrates that it is able to create warnings for invalid credit card numbers.

There are two steps to creating a feature type warning. The first is to write a function that accepts a Pandas series and returns a carefully crafted dataframe. If there are no warnings, then the dataframe can be empty or the handler can return `None`. The dataframe has to have the following columns:

* `Warning`: A string that describes the type of warning.
* `Message`: A human-readable message about the warning.
* `Metric`: A string that describes what is being measured.
* `Value`: The value associated with the metric.

The next cell uses the feature type warning handler, `invalid_credit_card_handler`, which uses the `.is_credit_card()` feature handler to create a binary list indicating whether the credit card number is valid or not. It then computes the number of invalid cards. If there are any invalid cards, then it creates a row in a dataframe with the relevant information. If not, it returns `None`. This example returns one row at most. However, it can return as many rows as needed. For example, it could return a row for each credit card that is invalid and provide the credit card number.

In [None]:
def invalid_credit_card_handler(x: pd.Series):
    value = len(x) - CreditCard.validator.is_credit_card(x).sum()
    if value > 0:
        df = pd.DataFrame(columns=['Warning', 'Message', 'Metric', 'Value'])
        df.Value = [value]
        df.Warning = ['invalid credit card count']
        df.Message = [f'{df.Value.values[0]} invalid credit cards']
        df.Metric = ['count']
        return df

Feature type warnings are registered with the feature type of interest. You can assign the same handler to multiple feature types. In this example, the feature type manager is used to get a handle to the `credit_card` class, which is an `ads.feature_engineering.feature_type.base.FeatureType` class. The  `.register()` method in the `warning` module registers the handler for the warning. You give it a name for the handler and the handler function. The optional `replace = True` parameter overwrites the handler when the name exists.

In [None]:
CreditCard = feature_type_manager.feature_type_object('credit_card')
CreditCard.warning.register(name='invalid_credit_card', 
                                    handler=invalid_credit_card_handler, 
                                    replace=True)

Using the `.registered()` method in the `warning` module, you can see that the `invalid_credit_card` handler has been registered:

In [None]:
CreditCard.warning.registered()

Run the `.warning()` method on the feature to show the updated warnings. Notice the `invalid credit card count` in the warning column:

In [None]:
series.ads.warning()

<a id="unregister"></a>
## Unregister a Feature Type Warning

You can remove a feature type warning from a feature type using the `.unregister()` method of the `warning` module. It accepts the name of the feature type warning. If the warning doesn't exist, it return a `WarningNotFound` exception.

The next cell removes the `high-cardinality` warning and the remaining feature type warnings are displayed

In [None]:
CreditCard.warning.unregister('high_cardinality')
CreditCard.warning.registered()

<a id="dataframe"></a>
# Working with a DataFrame

While working with a Pandas series can be powerful, it is often more convenient to check for warnings on an entire dataframe. This section uses the orcl_attrition.csv dataset to look at dataframe level operations that you can perform with the feature type warning tools.

The next cell loads the sample dataset:

In [None]:
attrition_path = os.path.join('/opt', 'notebooks', 'ads-examples', 'oracle_data', 'orcl_attrition.csv')
df = pd.read_csv(attrition_path, 
                 usecols=['Age', 'Attrition', 'JobFunction', 'EducationalLevel', 'EducationField', 
                          'Gender', 'JobRole','MonthlyIncome'
                          ])
df.head()

In [None]:
df.ads.feature_type

In [None]:
df.ads.feature_type = {'Age': ['integer'],
 'Attrition': ['category'],
 'JobFunction': ['string'],
 'EducationalLevel': ['string'],
 'EducationField': ['string'],
 'Gender': ['string'],
 'JobRole': ['string'],
 'MonthlyIncome': ['integer']}

In [None]:
df.ads.feature_type

In [None]:
df['Attrition'].ads.feature_type

After the data is loaded, the feature types only have the default feature type. This is based on the Pandas dtypes. 

In [None]:
df.ads.feature_type

The `.feature_type` property accepts a dictionary where the keys are the column names and the values are a list of the feature types that are to be assigned to each column. Only the columns you want to update are needed in the dictionary. This cell excludes `Age` because its default feature type, `integer`, is sufficient:

In [None]:
df.ads.feature_type = {'Attrition': ['boolean', 'category', Tag('target')],
                         'JobFunction': ['category'],
                         'EducationalLevel': ['category'],
                         'EducationField': ['category'],
                         'Gender': ['category'],
                         'JobRole': ['category'],
                         'MonthlyIncome': ['continuous']}
df.ads.feature_type

The `.warning_registered()` method on the dataframe shows all of the feature type warnings for all of the columns in the dataframe:

In [None]:
df.ads.warning_registered()

The method `.warning()` on the dataframe shows all of the warnings for all of the columns in the dataframe. This is a quick way to see all of your data with a single command.

In [None]:
df.ads.warning()

<a id="reference"></a>
# References

- [ADS Library Documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
