# Comparsion
Scikit-learn provides 3 robust regression estimators: RANSAC, Theil Sen and HuberRegressor
1. HuberRegressor should be faster than RANSAC and Theil Sen unless the number of samples are very large, i.e n_samples >> n_features. This is because RANSAC and Theil Sen fit on smaller subsets of the data. However, both Theil Sen and RANSAC are unlikely to be as robust as HuberRegressor for the default parameters.
2. RANSAC is faster than Theil Sen and scales much better with the number of samples
3. RANSAC will deal better with large outliers in the y direction (most common situation)
4. Theil Sen will cope better with medium-size outliers in the X direction, but this property will disappear in large dimensional settings.

When in doubt, use RANSAC


In [9]:
from matplotlib import pyplot as plt
import numpy as np

from sklearn.linear_model import LinearRegression, TheilSenRegressor, RANSACRegressor
from sklearn.linear_model import HuberRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

np.random.seed(42)

X = np.random.normal(size=4000)
y = np.sin(X)
# Make sure that it X is 2D
X = X[:, np.newaxis]

X_test = np.random.normal(size=200)
y_test = np.sin(X_test)
X_test = X_test[:, np.newaxis]

y_errors = y.copy()
y_errors[::3] = 3

X_errors = X.copy()
X_errors[::3] = 3

y_errors_large = y.copy()
y_errors_large[::3] = 10

X_errors_large = X.copy()
X_errors_large[::3] = 10

estimators = [('OLS', LinearRegression()),
              ('Theil-Sen', TheilSenRegressor(random_state=42)),
              ('RANSAC', RANSACRegressor(random_state=42)),
              ('HuberRegressor', HuberRegressor())
             ]
linestyle = {'OLS': '-',
             'Theil-Sen': '-.',
             'RANSAC': '--',
             'HuberRegressor': '--'
            }
lw = 3

x_plot = np.linspace(X.min(), X.max())
for title, this_X, this_y in [
        ('Modeling Errors Only', X, y),
        ('Corrupt X, Small Deviants', X_errors, y),
        ('Corrupt y, Small Deviants', X, y_errors),
        ('Corrupt X, Large Deviants', X_errors_large, y),
        ('Corrupt y, Large Deviants', X, y_errors_large)]:
    for name, estimator in estimators:
        model = make_pipeline(PolynomialFeatures(3), estimator)
        model.fit(this_X, this_y)
        mse = mean_squared_error(model.predict(X_test), y_test)
        y_plot = model.predict(x_plot[:, np.newaxis])
        print('%s for %s : error = %.3f' % (name, title, mse)) 
    print

OLS for Modeling Errors Only : error = 0.002
Theil-Sen for Modeling Errors Only : error = 0.018
RANSAC for Modeling Errors Only : error = 0.003
HuberRegressor for Modeling Errors Only : error = 0.006

OLS for Corrupt X, Small Deviants : error = 0.001
Theil-Sen for Corrupt X, Small Deviants : error = 0.002
RANSAC for Corrupt X, Small Deviants : error = 0.003
HuberRegressor for Corrupt X, Small Deviants : error = 0.002

OLS for Corrupt y, Small Deviants : error = 1.065
Theil-Sen for Corrupt y, Small Deviants : error = 0.061
RANSAC for Corrupt y, Small Deviants : error = 0.001
HuberRegressor for Corrupt y, Small Deviants : error = 0.005

OLS for Corrupt X, Large Deviants : error = 0.068
Theil-Sen for Corrupt X, Large Deviants : error = 0.106
RANSAC for Corrupt X, Large Deviants : error = 0.148
HuberRegressor for Corrupt X, Large Deviants : error = 0.084

OLS for Corrupt y, Large Deviants : error = 11.264
Theil-Sen for Corrupt y, Large Deviants : error = 0.250
RANSAC for Corrupt y, Large D