Create a baseline model on the wine dataset using the SVC classifier!

In [58]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# load the dataset
X, y = load_wine(return_X_y=True)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# summarize the shape of the training dataset
print("Shape of dataset: ",X_train.shape, y_train.shape)

# fit the model
model = SVC()
model.fit(X_train, y_train)
# evaluate the model
y_pred = model.predict(X_test)
# evaluate predictions
accuracy = model.score(X_test,y_test)
print('Accuracy: %.3f' % accuracy)

Shape of dataset:  (124, 13) (124,)
Accuracy: 0.685


Outlier Detection based on Isolation Forests

Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm. It is based on modeling the normal data in such a way as to isolate anomalies that are both few in number and different in the feature space

In [61]:
#Outlier Detection based on Isolation Forests
from sklearn.ensemble import IsolationForest

# identify outliers in the wine training dataset based on IsolationForest class and an assumed contamination of 0.1
iso = IsolationForest(contamination=0.1)
y_out = iso.fit_predict(X_train)
# build a mask to select all rows that are not outliers (inlier=1, outlier=-1)
mask = y_out != -1
X_train_red, y_train_red = X_train[mask, :], y_train[mask]
# Inliers vs. Outliers
print("Inliers: ",X_train_red.shape[0],"Outliers",X_train.shape[0]-X_train_red.shape[0])
# fit the model
model = SVC()
model.fit(X_train_red, y_train_red)
# evaluate the model
y_pred = model.predict(X_test)
# evaluate predictions
accuracy = model.score(X_test,y_test)
print('Accuracy: %.3f' % accuracy)

Inliers:  111 Outliers 13
Accuracy: 0.704


Outlier Detection based on Minimum Covariance Determinant / Elliptic Envelope

If the input variables have a Gaussian distribution, then simple statistical methods like Elliptic Envelope can be used to detect outliers.

Create two functions to modularize your code from the last task:
1. identify_inliers(classifier, X_training, y_training)
2. fit_and_evaluate_model(X_inl, y_inl, X_tes, y_tes)

Create an outlier detection classifier based on the EllipticEnvelope class, contamination=0.1

Call identify_inliers with this classifier and the training data

Call fit_and_evaluate_model with the returned inliers and the test data


In [65]:
from sklearn.covariance import EllipticEnvelope

def identify_inliers(classifier, X_training, y_training):
    y_out = classifier.fit_predict(X_training)
    # build a mask to select all rows that are not outliers (inlier=1, outlier=-1)
    mask = y_out != -1
    X_train_red, y_train_red = X_training[mask, :], y_training[mask]
    # Inliers vs. Outliers
    print("Inliers: ",X_train_red.shape[0],"Outliers",X_train.shape[0]-X_train_red.shape[0])
    return X_train_red, y_train_red

def fit_and_evaluate_model(X_inl, y_inl, X_tes, y_tes):
    # fit the model
    model = SVC()
    model.fit(X_inl, y_inl)
    # evaluate the model
    y_pred = model.predict(X_tes)
    # evaluate predictions
    accuracy = model.score(X_tes,y_tes)
    print('Accuracy: %.3f' % accuracy)
    
ee = EllipticEnvelope(contamination=0.1)
X_inlier, y_inlier= identify_inliers(ee , X_train, y_train)
fit_and_evaluate_model(X_inlier, y_inlier, X_test, y_test)

Inliers:  111 Outliers 13
Accuracy: 0.685


Outlier Detection based on Local Outlier Factor

The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

Create an outlier detection classifier based on the LocalOutlierFactor class

Call identify_inliers with this classifier and the training data

Call fit_and_evaluate_model with the returned inliers and the test data

In [66]:
# evaluate model performance with outliers removed using local outlier factor
from sklearn.neighbors import LocalOutlierFactor

#Create an outlier detection classifier based on the LocalOutlierFactor class
lof = LocalOutlierFactor()

#Call identify_inliers with this classifier and the training data
X_inlier, y_inlier= identify_inliers(ee , X_train, y_train)

#Call fit_and_evaluate_model with the returned inliers and the test data
fit_and_evaluate_model(X_inlier, y_inlier, X_test, y_test)

Inliers:  111 Outliers 13
Accuracy: 0.704


Outlier Detection based on One-Class SVM

When modeling one class, the algorithm captures the density of the majority class and classifies examples on the extremes of the density function as outliers. This modification of SVM is referred to as One-Class SVM.

Create an outlier detection classifier based on the OneClassSVM class, contamination=0.01

Call identify_inliers with this classifier and the training data

Call fit_and_evaluate_model with the returned inliers and the test data

In [67]:
# evaluate model performance with outliers removed using one class SVM
from sklearn.svm import OneClassSVM

#Create an outlier detection classifier based on the OneClassSVM class, contamination=0.01
ee = OneClassSVM(nu=0.01)

#Call identify_inliers with this classifier and the training data
X_inlier, y_inlier= identify_inliers(ee , X_train, y_train)

#Call fit_and_evaluate_model with the returned inliers and the test data
fit_and_evaluate_model(X_inlier, y_inlier, X_test, y_test)

Inliers:  123 Outliers 1
Accuracy: 0.704
