In [None]:
%pip install sklearn

# Imputer
1. Simple Imputer

2. KNN

3. Iterative Imputer(Regression)


## Simple Imputer

Replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value.

In [4]:
import numpy as np
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
train = [[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]

imputer.fit(train)

In [5]:
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imputer.transform(X))

[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]


## KNN

Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

In [10]:
import numpy as np
from sklearn.impute import KNNImputer


X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)

In [9]:
imputer.fit_transform(X)

array([[1., 2., 5.],
       [3., 4., 3.],
       [4., 6., 5.],
       [8., 8., 7.]])

## Iterative Imputer

Multivariate imputer that estimates each feature from all the others.

A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.

In [11]:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(random_state=0)

imputer.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])

IterativeImputer(random_state=0)

X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]

imputer.transform(X)

array([[ 6.95847623,  2.        ,  3.        ],
       [ 4.        ,  2.6000004 ,  6.        ],
       [10.        ,  4.99999933,  9.        ]])

# Outlier Detection

## Isolation Forest

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

In [13]:
from sklearn.ensemble import IsolationForest


X = [[-1.1], [0.3], [0.5], [100]]
clf = IsolationForest(random_state=0).fit(X)
clf.predict([[0.1], [0], [90]])

array([ 1,  1, -1])

## EllipticEnvelope

An object for detecting outliers in a Gaussian distributed dataset.

In [14]:
import numpy as np
from sklearn.covariance import EllipticEnvelope

true_cov = np.array([[.8, .3],
                      [.3, .4]])
X = np.random.RandomState(0).multivariate_normal(mean=[0, 0],
                                                  cov=true_cov,
                                                  size=500)

cov = EllipticEnvelope(random_state=0).fit(X)

# predict returns 1 for an inlier and -1 for an outlier
cov.predict([[0, 0],
              [3, 3]])

array([ 1, -1])

## OneClassSVM

Unsupervised Outlier Detection.

Estimate the support of a high-dimensional distribution.

The implementation is based on libsvm.

In [15]:
from sklearn.svm import OneClassSVM

X = [[0], [0.44], [0.45], [0.46], [1]]
clf = OneClassSVM(gamma='auto').fit(X)
clf.predict(X)

array([-1,  1,  1,  1, -1], dtype=int64)

## LocalOutlierFactor

Unsupervised Outlier Detection using the Local Outlier Factor (LOF).

The anomaly score of each sample is called the Local Outlier Factor. It measures the local deviation of the density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers.

In [17]:
import numpy as np
from sklearn.neighbors import LocalOutlierFactor

X = [[-1.1], [0.2], [101.1], [0.3]]
clf = LocalOutlierFactor(n_neighbors=2)
clf.fit_predict(X)

array([ 1,  1, -1,  1])

# Order Encoding

1. Ordinal Encoder
2. One hot encoding

# Scalers

1. Min - Max
2. Standard (Much less effected by outliers)

# Custom Transform

Function transformer

# Make Column Selector

# Pipeline

# Melt, Pivot, Unpivot

# Table visulize

# Time Series

# Feature Extraction

# Dimentionaly reduction

# Random Projection

# Plotly