ADS Sample Notebook.

Copyright (c) 2021 Oracle, Inc. All rights reserved. Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.

***
# <font color=red>Using Feature Types in Exploratory Data Analysis</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal>Oracle Cloud Infrastructure Data Science Service Team</font></p>

***




# Overview:

In exploratory data analysis (EDA) the data scientist uses plots and statistics to summarize the characteristics of the data. The feature type system, in the ADS library, has been designed to speed up this process. The feature type system allows data scientists to separate the concept of how data is represented physically from what the data actually measures. The data can have feature types that classify the data based on what it represents and not how the data is stored in memory. In doing so, a feature type can have a set of built-in summary statistics and a plot. This notebook demonstrates how to use feature type plots and summary statistics in your EDA.

---

## Prerequisites:
- Experience with a specific topic: Intermediate
- Professional experience: Basic

---

## Objectives:

- <a href='#overview'>Feature Type System</a>
- <a href='#feature_count'>Feature Count</a>
- <a href='#feature_stat'>Feature Statistics</a>
- <a href="#warnings">Feature Type Warnings</a>
- <a href="#correlation">Correlation</a>
    - <a href="#correlation_pearson">Pearson Correlation Coefficient</a>
    - <a href="#correlation_correlation_ratio">Correlation Ratio</a>
    - <a href="#correlation_cramers_v">Cramér's V</a>
- <a href='#plots'>Feature Plot</a>
- <a href="#reference">References</a>


---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

<font color=gray>Datasets are provided as a convenience. Datasets are considered third-party content and are not considered materials under your agreement with Oracle applicable to the services. The [`orcl_attrition` dataset](oracle_data/UPL.txt) is distributed under the UPL license.
</font>

In [None]:
import ads
import os
import pandas as pd
import seaborn as sns

from ads.dataset.dataset_browser import DatasetBrowser
from ads.feature_engineering import feature_type_manager, FeatureType
import matplotlib.pyplot as plt
from os import path

<a id='overview'></a>
# Feature Type System

The feature type system allows the data scientist to separate the concept of how data is represented physically from what the data actually measures. The data can have feature types that classify the data based on what it represents and not how the data is stored in memory. Each feature can have multiple feature types through a system of multiple inheritances. For example, an organization that sells cars might have a set of data that represents the purchase price of a car (the wholesale price). This could have a feature set of `wholesale_price`, `car_price`, `USD`, and `continuous`.

All default feature types have methods for creating summary statistics and a plot to represent the data. This allows you to have summary information for each feature of your dataset while only using a single command. However, the default feature types may not provide the exact details needed in your specific use case. Therefore, feature types have been designed with expandability in mind. When creating a new feature type, the summary statistics and plots that are specific to your feature type can be customized.

The feature type system works at the Pandas dataframe and series levels. This allows you to create summary information across all of your data and at the same time dig into the specifics of one feature.

The `.feature_count()` method returns a dataframe that provides a summary of the number of features in a dataframe. Each row represents a feature type. It provides a count of the number of times that feature type is used in the dataframe. It also provides a count of the number of times that the feature type was the primary feature type. The primary feature type is the feature type that has no children feature types.

The `.feature_stat()` method returns a dataframe where each row represents a summary statistic and the numerical value for that statistic.

The `.feature_plot()` method returns a Seaborn plot object that summaries the feature. It can be modified after it is returned so that you can customize it to fit your needs.

There are also a number of correlation methods such as `.correlation_ratio()`, `.pearson()`, and `.cramersv()` that provide information about the correlation between different features in the form of a dataframe. Each row represents a single correlation metic. This information can also be represented in a plot with the `.correlation_ratio_plot()`, `.pearson_plot()`, and `.cramersv_plot()` methods.

<a id='feature_count'></a>
# Feature Count

Each column in a Pandas dataframe is associated with at least one feature type. This would be the default feature type and it is determined by the Pandas dtype. However, the feature type system allows you to associate a feature with multiple feature types using a system of inheritance. As discussed in the <a href='#overview'>Feature Type System</a> section, a feature could have a feature set of `wholesale_price`, `car_price`, `USD`, and `continuous`.
The `.feature_count()` method can be called on a dataframe to provide a summary of what features are being used. The output is a dataframe where each row represents a feature type and that is listed in the Feature Type column. The next column lists the number of times the feature type appears in any of the columns. Since each feature can have multiple feature types it counts all occurrences. The next column, Primary, is the count of the number of times that the feature type is listed as the primary feature type. That is the feature type that has no subclasses.

In the next cell, the orcl_attrition dataset is loaded. The feature types for the selected columns will be assigned and the top of the dataframe is displayed.

In [None]:
attrition_path = os.path.join('/opt', 'notebooks', 'ads-examples', 'oracle_data', 'orcl_attrition.csv')
df = pd.read_csv(attrition_path, 
                 usecols=['Attrition', 'TravelForWork', 'JobFunction', 'TrainingTimesLastYear'])
df.ads.feature_type = {'Attrition': ['boolean', 'category'],
                         'TravelForWork': ['category'],
                         'JobFunction': ['category'],
                         'TrainingTimesLastYear': ['integer']}
df.head()

In the preceding cell, the `.ads.feature_type` command was used to store the feature types associated with each column. For example, the Attrition column has the feature types boolean and category. The `.ads.feature_type` can also be used to return a dictionary that lists the feature types that are assigned to each feature. Notice that the Attrition feature has the feature types boolean, category and string associated with it. However, in the preceding cell only boolean and category were specified. That is because the feature type system will automatically append the feature type string based on the Pandas dtype. This is called the default feature type. In the case of the TrainingTimesLastYear, the feature type that was specified was an integer. Since this is the dtype no additional feature type was appended.

In [None]:
df.ads.feature_type

The `.feature_count()` method is called on the dataframe in the next cell. It provides a summary of what features are being used across all features in the dataframe. The output dataframe has one row for each feature type that is represented in the dataframe. This is listed in the Feature Type column. The next column lists the number of times the feature type appears in any of the columns. For example, the category feature type appears in the Attrition, TravelForWork, and JobFunction columns. Therefore, it has a count of three. The Primary column is the count of the number of times that the feature type is listed as the primary feature type. For the category feature type, the value is two as TravelForWork and JobFunction have this as their primary feature type. While category is a feature type of Attrition, it is not the primary feature type, boolean is. In the case of the string feature type, it occurs in the Attrition, TravelForWork and JobFunction features, however, it is not the primary feature type in these features and thus its Count is 3 but its Primary count is 0.

In [None]:
df.ads.feature_count()

<a id='feature_stat'></a>
# Feature Statistics

One of the main goals of the EDA is to gain an understanding of the nature of your data. Computing summary statistics is one of the most common tasks in this process. The goal of the `.feature_stat()` method is to produce relevant summary statistics for the feature set. The feature type framework allows you to customize what statistics will be used in a feature type.

The `.feature_stat()` outputs a Pandas dataframe where each row represents a summary statistic. The statistics that are reported depend on the multiple inheritance of the feature types. The feature type framework will iterate from the primary feature type to the default feature type looking for a feature type that has the `.feature_stat()` method defined and then will dispatch on that.

In the next cell, the `.feature_stat()` for the integer feature type is run. This feature set will return the count of the observations, the mean value, the standard deviation and <a href="https://en.wikipedia.org/wiki/Five-number_summary">Tukey's Five Numbers</a> (sample minimum, lower quartile, median, upper quartile, and sample maximum).

In [None]:
df['TrainingTimesLastYear'].ads.feature_stat()

The summary statistics that are created depend on the feature type. For example, the JobFunction column is categorical so it produces a count of the number of observations and the number of unique categories.

In [None]:
df['JobFunction'].ads.feature_stat()

This may not be ideal summary for the JobFunction feature. Instead, you may want to know the number of job functions in each category. This can be done by creating a new feature type and its associated `.feature_stat()` method. In next cell, a new feature type called `JobFunction` is created. It overrides the `.feature_stat()` method to produce a count of the number of each job functions in the data. This feature type is then registered and the JobFunction column, in the dataframe, is updated so that it now inherits from the `JobFunction` feature type. Then it prints the feature summary statistics for the JobFunction column.

In [None]:
try: # remove the feature type if it is already registered
    feature_type_manager.feature_type_unregister(JobFunction)
except:
    pass

# Create the JobFunction feature type
class JobFunction(FeatureType):
    @staticmethod
    def feature_stat(series: pd.Series) -> pd.DataFrame:
        result = dict()
        job_function = ['Product Management', 'Software Developer', 'Software Manager', 'Admin', 'TPM']
        for label in job_function:
            result[label] = len(series[series == label])
        return pd.DataFrame.from_dict(result, orient='index', columns=[series.name])

# Register the JobFunction feature type and assign it to the dataframe    
feature_type_manager.feature_type_register(JobFunction)
df['JobFunction'].ads.feature_type = ['job_function', 'category']
df['JobFunction'].ads.feature_stat()

The `.feature_stat()` method also works at the dataframe level. It will produce a similar output to the output for the series, except it will have an additional column that lists the column name where the metric was computed.

In [None]:
df.ads.feature_stat()

The `.feature_stat()` method outputs its data in row-dominate format as it is easy to work with. However, there are times when column dominate format helps to better understand the data. This is often the case when the data all have similar summary statistics. Converting from the row-dominate to the column-dominate format can be done with the `.pivot_table()` method, which is part of Pandas. Where there are missing values an `NaN` will be inserted.

In [None]:
df.ads.feature_stat().pivot_table(index='Column', columns='Metric', values = 'Value')

<a id="warnings"></a>
# Feature Type Warnings

Part of the exploratory data analysis (EDA) is to check the state or condition of your data. Meaning checking to ensure that all the values are in a given range or are properly formatted with no missing values. For categorical data, you often want to ensure that the cardinality is low enough for the type of modeling that you are doing. Since the feature type system is meant to understand the nature of your data, it's an ideal mechanism to help automate the evaluation of the data. This evaluation is done by registering warnings.

Feature type warnings are functions that are built-in or user-defined. They perform an analysis of a feature to determine if there are any data condition problems with the data. For example, it might report that a feature is skewed when it is expected that the data is normally distributed. Another common example is that the data may have more than some threshold of missing values. ADS comes with various common warnings built-in for the feature types that it supports. However, you are able to create and register any warnings that you want.

Feature warnings are defined at the feature type level. To create a feature type warning you create a function that takes a Pandas series and returns a dataframe with the following columns:
* `Warning`: The name of the warning.
* `Message`: A customized message to the user that describes the warning.
* `Metric`: What metric was used to generate the warning.
* `Value`: The value computed by the metric.

If there are no warnings then an empty dataframe with the above headings must be returned.

The warning handler must be registered with the feature type system using the `.warning.register()` method on the feature type object.

In the next cell, the function `missing_job_function_handler` is created and it checks to make sure that all job function categories are in the dataset. If not it generates a warning. Each warning goes is on a separate row. The `missing_job_function_handler` is registered with the `.warning.register()` where the first parameter is the name of the warning and the second parameter is the name of the function that will be used as the handler. This function accepts a Pandas series.

A call to the `.warning.registered()` method will list the feature type warnings that are registered to the feature type object.

In [None]:
def missing_job_function_handler(series: pd.Series) -> pd.DataFrame:
    job_function = ['Product Management', 'Software Developer', 'Software Manager', 'Admin', 'TPM']
    df = pd.DataFrame([], columns=['Warning', 'Message', 'Metric', 'Value'])
    for label in job_function:
        if len(series[series == label]) == 0:
            df = df.append({'Warning': 'missing category', 
                            'Message': f'The category {label} is not in the dataset',
                            'Metric': 'existance',
                            'Value': False}, ignore_index=True)
    return df

JobFunction = feature_type_manager.feature_type_object('job_function')
JobFunction.warning.register("missing_job_function", missing_job_function_handler, replace=True)
JobFunction.warning.registered()

Feature type warnings can be generated by using a Pandas series and calling `.warning()`. It returns the four columns that were described above for the feature type warning handler (`Warning`, `Message`, `Metric`, and `Value`) plus the column `Feature Type` which is the name of the feature type that generated the warning. Since each feature can have multiple feature types it is possible that different feature types will generate different warnings. 

In [None]:
df['JobFunction'].ads.warning()

Generally, it is more convenient to check the warnings on an entire dataframe. This is done by calling `.warning()` on a dataframe. The output is the same as for the series except there is an additional column, `Column`, that lists the dataframe column that is associated with the warning message.

In [None]:
df.ads.warning()

<a id="correlation"></a>
# Correlation

Generally, a data scientist wants to make a model as parsimonious as possible. This often involves determining what features are highly correlated and removing some of them. While some models such as decision trees are not sensitive to correlated variables, other ones such as ordinary least squares regression are. You may also want to remove correlated variables at it reduces the cost of collecting and processing the data.

The EDA features in ADS speed up your analysis by providing methods to compute different types of correlations. There are several different correlation techniques provided as they have different use cases. Further, there are two sets of methods. One method returns a dataframe with the correlation information and it has a partner method that generates a plot.

What correlation technique you use depends on the type of data that you are working with. When using these correlation techniques you will need to slice your dataframe so that only the appropriate feature types are used in the calculation. The following is a summary of the different correlation techniques and what data should be used.
* `pearson`: The <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient">Pearson correlation coefficient</a> is a normalized measure of the covariance between two sets of data. In essence, it measures the linear correlation between the datasets. This method is used when both datasets consist of continuous values.
* `correlation_ratio`: The <a href="https://en.wikipedia.org/wiki/Correlation_ratio">Correlation ratio</a> measures the extent to which a distribution is spread out within individual categories relative to the spread of the entire population. This metric is used to compare categorical variables to continuous values.
* `cramersv`: The <a href="https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V">Cramér's V</a> provides a measure of the degree of association between two categorical/nominal datasets.

<a id="correlation_pearson"></a>
## Pearson Correlation Coefficient

The <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient">Pearson correlation coefficient</a> is known by a number of names such as Pearson's r, Pearson product moment correlation coefficient, bivariate correlation or the correlation coefficient. It has a range of [-1, 1] where 1 means that the two datasets are perfectly correlated and a value of -1 means that the correlation is perfectly out of phase. Thus, when one dataset is increasing the other one is decreasing.

The Pearson correlation coefficient is a normalized value of the covariance between the continuous datasets X and Y. It is normalized by the product of the standard deviation between X and Y and is given by the following formula:
$$\rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X \sigma_Y}$$

In [None]:
df = pd.read_csv(attrition_path,
                usecols=['Age', 'YearsinIndustry', 'YearsOnJob', 'YearsWithCurrManager', 'YearsAtCurrentLevel'])
df.ads.feature_type = {'Age': ['continuous'], 'YearsinIndustry': ['continuous'], 'YearsOnJob': ['continuous'], 
                     'YearsWithCurrManager': ['continuous'], 'YearsAtCurrentLevel': ['continuous']}
df.ads.pearson()

This same information can be represented in a plot using the `.pearson_plot()` method.

In [None]:
df.ads.pearson_plot()

<a id="correlation_correlation_ratio"></a>
## Correlation Ratio

Statistical dispersion, or scatter, is a measure of the spread of a distribution with variance being a common metric. The <a href="https://en.wikipedia.org/wiki/Correlation_ratio">Correlation ratio</a> is a measure of dispersion with categories relative to the dispersion across the entire dataset. The Correlation ratio is a weighted variance of the category means over the variance of all samples. It is given by the formula:
$$\eta = \sqrt{\frac{\sigma_{\bar{y}}^2}{\sigma_y^2}}$$

where:
$$\sigma_{\bar{y}}^2 = \frac{\sum_x n_x(\bar{y}_x - \bar{y})^2}{\sum_x n_x}$$
$$\sigma_{y}^2 = \frac{\sum_{x,i} n_x(\bar{y}_{x,i} - \bar{y})^2}{n}$$

where $n$ is the total number of observations and $n_x$ is the number of observations in a category $x$. $\bar{y}_x$ is the mean value in category $x$ and $\bar{y}$ is the overall mean.

Values of $\eta$ near zero indicate that there is no dispersion between the means of the different categories. A value of $\eta$ near one suggests that there in no dispersion within the respective categories.

In [None]:
df = pd.read_csv(attrition_path,
                usecols=['JobFunction', 'Age', 'YearsinIndustry', 'YearsOnJob', 'YearsWithCurrManager', 'YearsAtCurrentLevel'])
df.ads.feature_type = {'Age': ['continuous'], 'YearsinIndustry': ['continuous'], 'YearsOnJob': ['continuous'], 
                     'YearsWithCurrManager': ['continuous'], 'YearsAtCurrentLevel': ['continuous'],
                      'JobFunction': ['category']}
df.ads.correlation_ratio()

In [None]:
df.ads.correlation_ratio_plot()

<a id='target'></a>

<a id="correlation_cramers_v"></a>
## Cramér's V

<a href="https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V">Cramér's V</a> is used to measure the amount of association between two categorical/nominal variables. A value of zero means that there is no association between the bivariates and a value of one means that there is complete association. $V$ is the percentage of the maximum association between the variables, which is dependent on the frequency in which the tuples $(x_i, y_j)$ occur.

The value of $V$ is related to the chi-squared statistic, $\Chi^2$ and is given by:
$$V = \sqrt{\frac{\Chi^2}{min(k-1, r-1)n}}$$

where: $k$ and $r$ are the number of categories in the datasets $x$ and $y$. $n$ is the sample size.


In [None]:
df = pd.read_csv(attrition_path,
                 usecols=['TravelForWork', 'JobFunction', 'EducationField', 'EducationalLevel'])
df.ads.feature_type = {'TravelForWork': ['category'], 'JobFunction': ['category'], 'EducationField': ['category'], 
                     'EducationalLevel': ['category']}
df.ads.cramersv()

In [None]:
df.ads.cramersv_plot()

<a id='plots'></a>
# Feature Plot

Visualization of a dataset is a quick way to gain insights into the distribution of values. The feature type system in ADS provides plots for all ADS-supported feature types. However, it is easy to create feature plots for your custom feature types. Calling `.feature_plot()` on a Pandas series will produce a univariate plot. The next cell produces a bar chart with a count of the number of employees and how often they travel.

In [None]:
df = pd.read_csv(attrition_path, 
                usecols=['Attrition', 'TravelForWork', 'JobFunction', 'TrainingTimesLastYear'])
df.ads.feature_type = {'Attrition': ['category'], 'TravelForWork': ['category'],
                       'JobFunction': ['category'], 'TrainingTimesLastYear': ['continuous']}
df['TravelForWork'].ads.feature_plot()

The `feature_plot()` method on a Pandas series returns a `matplotlib.pyplot` object. This allows you to modify the plot to customize it further. The next cell will capture the plot and add a title.

In [None]:
travel_plot = df['TravelForWork'].ads.feature_plot()
travel_plot.set_title("Count of the Number of Employees and How Much they Travel")

It is often expedient to produce the feature plots for all the features in the dataframe. This can be done by calling `.feature_plot()` on the dataframe. It will return a dataframe where each row represents a feature. There are two columns, `Column` which is the name of the column and `Plot` which is the plot object.

In [None]:
df.ads.feature_plot()

<a id="reference"></a>
# References
- [Oracle ADS Library documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [ADS Library Documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)