# Sklearn and outlier detection
- Univariate variable extremely high value can be used as a measure of out lier
- A multi variate out lier can be characterized by an unusual combination of values.
    + We can use dimensionality reduction techniques where new features are linear/no-linear combinations of original features.
    + can do 2d or 3d plot to inspect isolated cluster.

# Some other option in sklearn
## Distribution based
- Fits a distribution to signal outliers. 
    + Covariance elliptic class fits a multi variate gaussian.
        * need to set contamination parameter(proportion of the outlier present in your data set).
## Novelty detection based

 - One class svm or isolation forest. It detects if a data point is a novelty or not.

# Let's see Covariance elliptic class in practice

In [None]:
from sklearn.datasets import make_blobs
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns



In [None]:
blob = make_blobs(n_samples=100, n_features=2, centers=1, cluster_std=1.0)

In [None]:
plt.scatter(blob[0][:,0], blob[0][:,1])

In [None]:
from sklearn.covariance import EllipticEnvelope

In [None]:
cov_est= EllipticEnvelope(contamination=.1)

In [None]:
cov_est.fit(blob[0])

In [None]:
predict = cov_est.predict(blob[0])

In [None]:
predict

In [None]:
detection_df = pd.DataFrame(blob[0], columns = ['x', 'y'])
detection_df.head()

In [None]:
detection_df['is_outlier'] =predict
detection_df.head()

In [None]:
sns.lmplot("x", "y", fit_reg=False, hue="is_outlier", data=detection_df)

In [None]:
from sklearn import datasets
house = datasets.california_housing.fetch_california_housing()

In [None]:
cali_house_df = pd.DataFrame(house.data, columns=house.feature_names)
# in units of 100,000.
cali_house_df['avg_house_val'] = house.target
cali_house_df.head()

In [None]:
from sklearn.preprocessing import RobustScaler, StandardScaler

In [None]:
x = cali_house_df.drop('avg_house_val', axis=1)


In [None]:
#sc =RobustScaler()
sc= StandardScaler()
sc.fit(x)

x_rsc = sc.transform(x)


In [None]:
from sklearn.decomposition import PCA 


In [None]:
pca = PCA(n_components=2)
pca.fit(x_rsc)

In [None]:
x_rsc.shape

In [None]:
# Project the data into pca basis
X_pca = pca.transform(x_rsc)
X_pca.shape

In [None]:
pca.explained_variance_ratio_

In [None]:
cov_est= EllipticEnvelope(contamination=.0001, assume_centered=True)

In [None]:
cov_est.fit(X_pca)

In [None]:
predict = cov_est.predict(X_pca)

In [None]:
cali_pca_df= pd.DataFrame(X_pca, columns=['PCA1', 'PCA2'])
cali_pca_df['is_outlier'] = predict
cali_pca_df.head()

In [None]:
sns.lmplot("PCA1", "PCA2", fit_reg=False, hue="is_outlier", data=cali_pca_df)

Elliptic envelope is a parametric method.
- It fits a multivariate guassian.
- It is a strong assumption.

One class SVM is a better choice as it learn from data itself.

https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html

In [None]:
from sklearn.svm import OneClassSVM

In [None]:
pca= PCA(n_components=.95)

In [None]:
X_pca = pca.fit_transform(x_rsc)

In [None]:
X_pca.shape

In [None]:
outlier_frac= 0.0001

In [None]:
nu = 0.95*outlier_frac + 0.05

In [None]:
svm_detection = OneClassSVM(kernel='rbf', degree=3, gamma=1.0/x_rsc.shape[0], nu= nu)

In [None]:
predict = svm_detection.fit_predict(X_pca)

In [None]:
cali_pca_df= pd.DataFrame(X_pca, columns=['PCA'+str(i) for i in range(1, X_pca.shape[1] +1)])
cali_pca_df['is_outlier'] = predict
cali_pca_df.head()

In [None]:
for i in range(1, X_pca.shape[1] +1):
    sns.lmplot("PCA1", "PCA"+str(i), fit_reg=False, hue="is_outlier", data=cali_pca_df)

Other methods
- Isolation Forest
https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html

# Validating the model

When we build machine learning/ data system we need a way to measure how well the system if performing.
Type of validation metric will depend on 

- classification
    + Binary
    + Multi class
- Regression

- Clustering

Check this page of 

https://github.com/benhamner  Ben Hamner Co-founder and CTO of Kaggle.

# binary classification

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()

In [None]:
data.target_names

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.3, random_state=12)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression()

In [None]:
lr.fit(X_train, y_train)

In [None]:
y_test_pred = lr.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
print(lr.score(X_test, y_test))
print(accuracy_score(y_test, y_test_pred))


is this the whole story

In [None]:
cm = confusion_matrix(y_test, y_test_pred)

In [None]:
cm

In [None]:
ax= sns.heatmap(cm, annot=True, cmap=plt.cm.Blues)

ax.set_xlabel('Predcited class')
ax.set_ylabel('True class')

precision= $\frac{TP}{TP+FP}$ among the predicted positive how may are actually positive

recall = $\frac{TP}{TP+FN}$ also called true positive rate, sensitivity

In [None]:
precision_score(y_test, y_test_pred)

In [None]:
recall_score(y_test, y_test_pred)

In [None]:
f1_score(y_test, y_test_pred)

# Multi class classification

When we want to classify data into more than two classes.
- Classify images into different categories.
    + https://www.cs.toronto.edu/~kriz/cifar.html
    + Sentiment classification of tweets, reviews etc

In [None]:
iris_df = sns.load_dataset('iris')
iris_df.head()

Will start with Regression measures next time.



<center> <font size ="6">Thank you </font> </center>
 