# Using `redflag`

It's early days, but there are a few things you can do:

- Outlier detection
- Clipping detection
- Imbalance metrics (for labels and any other categorical variables)
- Distribution shape

In [None]:
import redflag as rf

rf.__version__

## Load some data

In [None]:
import pandas as pd

df = pd.read_csv('https://geocomp.s3.amazonaws.com/data/Panoma_training_data.csv')
df.head()

In [None]:
df.info()

## Outliers

Currently there is only one way to detect outliers, using the Z-score. The function transforms the features to Z-score and checks how many samples are more than 3 standard deviations away from the mean (ordinary outliers), and how many are more than 4.89 standard deviations away (extreme outliers). The ratio of actual:expected samples is returned, along with the indices of those samples.

In [None]:
rf.has_outliers(df['PHIND'])

In [None]:
import seaborn as sns

sns.displot(df['PHIND'])

## Clipping

If a feature has been clipped, it will have multiple instances at its min and/or max value. There are legitimate reasons why this might happen, for example the feature may be naturally bounded (e.g. porosity is always greater than 0), or the feature may have been deliberately clipped as part of the data preparation process.

In [None]:
rf.is_clipped(df['GR'])

In [None]:
import seaborn as sns

sns.displot(df['GR'])

## Imbalance metrics

For binary targets, the metric is imbalace ratio (ratio between majority and minority class).

For multiclass targets, the metric is imbalance degree, a single-value measure that explains (a) how many minority classes there are and (b) how skewed the supports are.

In [None]:
rf.class_imbalance(df['Lithology'])

In [None]:
rf.minority_classes(df['Lithology'])

## Distribution shape

Tries to guess the shape of the distribution from the following set from `scipy.stats`:

- `'norm'`
- `'cosine'`
- `'expon'`
- `'exponpow'`
- `'gamma'`
- `'gumbel_l'`
- `'gumbel_r'`
- `'powerlaw'`
- `'triang'`
- `'trapz'`
- `'uniform'`

The name is returned, along with the shape parameters (if any), location and scale.

In [None]:
rf.best_distribution(df['PHIND'])

In [None]:
sns.displot(df['PHIND'])

## Self-correlation

If a feature is correlated to lagged (shifted) versions of itself, then the dataset may be ordered by that feature, or the records may not be independent. If several features are correlated to themselves, then the data instances may not be independent.

In [None]:
rf.is_correlated(df['GR'])

Shuffling the data removes the correlation, but does not mean the records are independent.

In [None]:
import numpy as np

gr = df['GR'].to_numpy(copy=True)
np.random.shuffle(gr)
rf.is_correlated(gr)