In [37]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 12 - A collection of outlier detection datasets

from http://odds.cs.stonybrook.edu/

## Glass Dataset

The original glass identification dataset from UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Glass+Identification is a classification dataset. 
The study of classification of types of glass was motivated by criminological investigation. 
At the scene of the crime, the glass left can be used as evidence, if correctly identified. 
This dataset contains attributes regarding several glass types.

In [26]:
url1 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/Glass.csv'
url2 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/Glass_labels.csv'
data = pd.read_csv(url1,header=None)
labels = pd.read_csv(url2, header=None)

The 9 features are:

    RI: refractive index
    Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
    Mg: Magnesium
    Al: Aluminum
    Si: Silicon
    K: Potassium
    Ca: Calcium
    Ba: Barium
    Fe: Iron

## Cardio Dataset

The original Cardiotocography (Cardio) dataset from UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Cardiotocography consists of measurements of fetal heart rate (FHR) and uterine contraction (UC) features on cardiotocograms classified by expert obstetricians. 
This is a classification dataset, where the classes are normal, suspect, and pathologic. For outlier detection, the normal class formed the inliers, while the pathologic (outlier) class is downsampled to 176 points. 
The suspect class is discarded.

In [28]:
url1 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/cardio.csv'
url2 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/cardio_labels.csv'
data = pd.read_csv(url1,header=None)
labels = pd.read_csv(url2, header=None)

The 21 features are:

    LB - FHR baseline (beats per minute)
    AC - # of accelerations per second
    FM - # of fetal movements per second
    UC - # of uterine contractions per second
    DL - # of light decelerations per second
    DS - # of severe decelerations per second
    DP - # of prolongued decelerations per second
    ASTV - percentage of time with abnormal short term variability
    MSTV - mean value of short term variability
    ALTV - percentage of time with abnormal long term variability
    MLTV - mean value of long term variability
    Width - width of FHR histogram
    Min - minimum of FHR histogram
    Max - Maximum of FHR histogram
    Nmax - # of histogram peaks
    Nzeros - # of histogram zeros
    Mode - histogram mode
    Mean - histogram mean
    Median - histogram median
    Variance - histogram variance
    Tendency - histogram tendency

## Letters Dataset

The original letter recognition dataset from UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Letter+Recognition is a multi-class classification dataset. 
The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet, where letters of the alphabet are represented in 16 dimensions.
To get data suitable for outlier detection, we subsample data from 3 letters to form the normal class and randomly concatenate pairs of them so that their dimensionality doubles. 
To form the outlier class, we randomly select few instances of letters that are not in the normal class and concatenate them with instances from the normal class. 
The concatenation process is performed in order to make the detection much more challenging as each outlier will also show some normal attribute values. In total, we have 1500 normal data points and 100 outliers (6.25% outliers) in 32 dimensions.

In [31]:
url1 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/letters.csv'
url2 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/letters_labels.csv'
data = pd.read_csv(url1,header=None)
labels = pd.read_csv(url2, header=None)

The features are:

    x-box horizontal position of box (integer)
    y-box vertical position of box (integer)
    width width of box (integer)
    high height of box (integer)
    onpix total # on pixels (integer)
    x-bar mean x of on pixels in box (integer)
    y-bar mean y of on pixels in box (integer)
    x2bar mean x variance (integer)
    y2bar mean y variance (integer)
    xybar mean x y correlation (integer)
    x2ybr mean of x * x * y (integer)
    xy2br mean of x * y * y (integer)
    x-ege mean edge count left to right (integer)
    xegvy correlation of x-ege with y (integer)
    y-ege mean edge count bottom to top (integer)
    yegvx correlation of y-ege with x (integer)



## Musk Dataset

The original musk dataset from UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Musk+%28Version+2%29 contains several musk and non-musk classes. 
The non-musk classes j146, j147, and 252 are combined to form the inliers, while the musk classes 213 and 211 are added as outliers without downsampling. Other classes are discarded. 

In [33]:
url1 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/musk.csv'
url2 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/musk_labels.csv'
data = pd.read_csv(url1,header=None)
labels = pd.read_csv(url2, header=None)

## Pima Indians Diabetes Dataset

The original Pima Indians diabetes dataset from UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes is a binary classification dataset.
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. 
The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

In [35]:
url1 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/pima.csv'
url2 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/pima_labels.csv'
data = pd.read_csv(url1,header=None)
labels = pd.read_csv(url2, header=None)

The 8 features are:

    number of times pregnant
    glucose concentration a 2 hours in an oral glucose tolerance test
    blood pressure (mm Hg)
    skin fold thickness (mm)
    serum insulin (mu U/ml)
    mass index (weight in kg/(height in m)^2)
    pedigree function
    age (years)    

## Satellite Dataset

The original Statlog (Landsat Satellite) dataset from UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Statlog+%28Landsat+Satellite%29 is a multi-class classification dataset.
The database consists of the multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood.
Here, the training and test data are combined. 
The smallest three classes, i.e. 2, 4, 5 are combined to form the outliers class, while all the other classes are combined to form an inlier class. 

In [17]:
url1 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/satellite.csv'
url2 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/satellite_labels.csv'
data = pd.read_csv(url1,header=None)
labels = pd.read_csv(url2, header=None)

## Satellite Dataset 2

The original Statlog (Landsat Satellite) dataset from UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Statlog+%28Landsat+Satellite%29 is a multi-class classification dataset. The database consists of the multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood.
Here, the training and test data are combined. Class 2 is down-sampled to 71 outliers, while all the other classes are combined to form an inlier class. 

In [18]:
url1 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/satimage.csv'
url2 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/satimage_labels.csv'
data = pd.read_csv(url1,header=None)
labels = pd.read_csv(url2, header=None)

## Shuttle Dataset

The original Statlog (Shuttle) dataset from UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Statlog+%28Shuttle%29 is a multi-class classification dataset with dimensionality 9. 
Here, the training and test data are combined. The smallest five classes, i.e. 2, 3, 5, 6, 7 are combined to form the outliers class, while class 1 forms the inlier class. Data for class 4 is discarded.

In [19]:
url1 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/shuttle.csv'
url2 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/shuttle_labels.csv'
data = pd.read_csv(url1,header=None)
labels = pd.read_csv(url2, header=None)

## Speech Dataset

The real-world speech data set consists of 3686 segments of English speech spoken with different accents.
This dataset is provided by the Speech Processing Group at Brno University of Technology, Czech Republic.
The majority data corresponds to American accent and only 1.65% corresponds to one of seven other accents (these are referred to as outliers). 
The speech segments are represented by 400-dimensional so called i-vectors which are widely used state-of-the-art features for speaker and language recognition.

In [21]:
url1 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/speech.csv'
url2 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/speech_labels.csv'
data = pd.read_csv(url1,header=None)
labels = pd.read_csv(url2, header=None)

## Thyroid Dataset

The original thyroid disease (ann-thyroid) dataset from UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease is a classification dataset. 
The dataset contains 3428 instances, and has 6 features. 
The problem is to determine whether a patient referred to the clinic is hypothyroid. 

In [22]:
url1 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/thyroid.csv'
url2 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/thyroid_labels.csv'
data = pd.read_csv(url1,header=None)
labels = pd.read_csv(url2, header=None)

## Vowels Dataset

The original Japanese Vowels (Vowels) dataset from UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Japanese+Vowels is a multivariate time series data, where nine male speakers uttered two Japanese vowels /ae/ successively. 
Here, one utterance by a speaker forms a time series whose length is in the range 7-29 and each point of a time series is of 12 features (12 coefficients). 
This is a classification dataset to classify the speakers. For outlier detection, each frame in the training data is treated as an individual data point.
In this case, class (speaker) 1 is downsampled to 50 outliers. The inliers contained classes 6, 7 and 8. Other classes are discarded.

In [23]:
url1 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/vowels.csv'
url2 = 'https://raw.githubusercontent.com/um-perez-alvaro/Anomaly-Detection/master/vowels_labels.csv'
data = pd.read_csv(url1,header=None)
labels = pd.read_csv(url2, header=None)