### Most popular methods for outlier detection

- Z-Score or Extreme Value Analysis (parametric)
- Probabilistic and Statistical Modeling (parametric)
- Linear Regression Models (PCA, LMS)
- Proximity Based Models (non-parametric)
- Information Theory Models
- High Dimensional Outlier Detection Methods (high dimensional sparse data)

### Z-Score

* Посчитать стандартное отклонение. 
* Провести стандартизацию (посчитать z-score каждой точки).
* Те точки, где z-score больше, чем N (например 3, соответствующее 0.13% данных) стандартных отклонений - выбросы

Другой подход к реализации: https://github.com/junmoan/outlier-detection/blob/master/Outlier_Detection_z_score.ipynb

In [1]:
def outliers_z_score(data, threshold):
    x = data.values
    mean_x = np.mean(x)
    print('mean:', mean_x)
    std_x = np.std(x)
    print('std:', std_x)
    print(threshold, '* std:', std_x * threshold)
    z_scores = [(xi - mean_x) / std_x for xi in x]
    
    outliers = x[np.where(np.abs(z_scores) > threshold)]
    
    print('outliers:', outliers)
    ax = data.plot.kde()
    ax.scatter(x, np.zeros(x.shape[0]))
    ax.scatter(outliers, np.zeros(outliers.shape[0]))
    
    for i in range(5):
        left, right = mean_x - i * std_x, mean_x + i * std_x
        ax.plot([right, right], [-.01,0.1], color='red', alpha=0.1)
        ax.plot([left, left], [-.01,0.1], color='red', alpha=0.1)
        
    left, right = mean_x - threshold * std_x, mean_x + threshold * std_x
    print('left, right thresholds:', left, right)
    ax.plot([right, right], [-.01,0.1], color='red', alpha=0.5)
    ax.plot([left, left], [-.01,0.1], color='red', alpha=0.5)

### IQR method

* Рисуем ящик с усами
* коробка - 25 и 75 перцентили, то есть 1 и 3 квартили (первая и третья четверть данных)
* линия внутри коробки - медиана (середина данных, справа и слева от нее одинаковое количество наблюдений)
* считаем ширину коробки - IQR (interquartile range)
* умножаем IQR на 1.5
* все что на 1.5 * IQR дальше от коробки - выбросы

### DBSCAN: Density Based Spatial Clustering of Applications with Noise

In [2]:
from sklearn.cluster import DBSCAN


def dbscan(data, eps, minPts):
    x = data.values
    
    model = DBSCAN(eps=eps, min_samples=minPts)
    model.fit(x)
    
    labels = model.labels_
    print('labels: ', set(model.labels_))
    plt.scatter(x[:, 0], x[:, 1], c=labels)
    plt.axis('equal')
    plt.show()

### Isolation Forests

In [4]:
from sklearn.ensemble import IsolationForest

def iforest(data, trees, samples):
    x = data.values
    
    model = IsolationForest(n_estimators=trees, max_samples=samples)
    model.fit(x)
    
    labels = abs(model.decision_function(x))
    plt.scatter(x[:, 0], x[:, 1], c=labels)

    
    xx, yy = np.meshgrid(np.linspace(x.min(), x.max(), 100), np.linspace(x.min(), x.max(), 100))
    Z = abs(model.decision_function(np.c_[xx.ravel(), yy.ravel()]))
    Z = Z.reshape(xx.shape)
    cont = plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r, alpha=0.2)
    
    plt.axis('equal')
    plt.show()

### One-Class SVMs

### LOF: Local Outlier Factor

In [3]:
from sklearn.neighbors import LocalOutlierFactor


def lof(data, n_neighbors):
    x = data.values
    
    model = LocalOutlierFactor(n_neighbors=n_neighbors)
    model.fit(x)
    
    labels = abs(model._decision_function(x))
    plt.scatter(x[:, 0], x[:, 1], c=labels)
    
    
    xx, yy = np.meshgrid(np.linspace(x.min(), x.max(), 100), np.linspace(x.min(), x.max(), 100))
    Z = abs(model._decision_function(np.c_[xx.ravel(), yy.ravel()]))
    Z = Z.reshape(xx.shape)
    cont = plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r, alpha=0.2)
    
    plt.axis('equal')
    plt.show()
    

### The ensemble of the above

### Density-based algorithm

### Mean Absolute Deviation(MADe) Method