Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outlier score highly correlated to over distance to points of origin #64

Closed
flycloudking opened this issue Apr 3, 2019 · 10 comments
Closed

Comments

@flycloudking
Copy link

I calculated the distance of each data points to origins at 0, by use 'np.linalg.norm(x)', while x is just one multi-variate sample, then normalize all these values to 0-1, I called this 'global_score'. When I compare the global score to scores from different methods, it turns out it's highly correlated (0.99) with PCA, autoencoder, CBLOF, KNN. So it seems all these methods are just calculating the overall distance of the samples, instead of anomalies from multiple clusters.
I was very troubled by this fact and hope you can confirm whether this is true and if it is, what's the reason for this.

Thanks

@yzhao062
Copy link
Owner

yzhao062 commented Apr 3, 2019

could you share the code and how to reproduce it?
I guess it is not like how you view it as authoencoder should not even behave that way...
Thanks, Yue

@flycloudking
Copy link
Author

flycloudking commented Apr 3, 2019 via email

@yzhao062
Copy link
Owner

yzhao062 commented Apr 3, 2019

It is still impossible to reproduce the result. What about this?

    from pyod.utils.data import generate_data

    contamination = 0.1  # percentage of outliers
    n_train = 1000  # number of training points
    n_test = 100  # number of testing points

    # Generate sample data
    X_train, y_train, X_test, y_test = \
        generate_data(n_train=n_train,
                      n_test=n_test,
                      n_features=20,
                      contamination=contamination,
                      random_state=42)

The code above will create some numerical data for you. X_train in shape [1000, 20]. Then you could add your code below to show how you process it. Please respond in GitHub otherwise the format is gone.

@flycloudking
Copy link
Author

Hi Yue,

Please see code below, just add library import, i use seaborn for plot. This time, the score are negatively correlated, not sure why. but the correlation is strong.

contamination = 0.1 # percentage of outliers
n_train = 1000 # number of training points
n_test = 100 # number of testing points

Generate sample data

X_train, y_train, X_test, y_test =
generate_data(n_train=n_train,
n_test=n_test,
n_features=20,
contamination=contamination,
random_state=42)

print(X_train.shape)

overall_dist = [np.linalg.norm(x) for x in X_train]
overall_dist = (overall_dist - min(overall_dist)) / (max(overall_dist) - min(overall_dist))

clf_name = 'PCA'
clf = PCA()
clf.fit(X_train)

##normalize the score to 0-1
mm_score = (clf.decision_scores_ - clf.decision_scores_.min()) / (clf.decision_scores_.max() - clf.decision_scores_.min())

get the prediction label and outlier scores of the training data

score_df = pd.DataFrame({clf_name+'outlier' : clf.labels, clf_name+'_score' : mm_score.flatten() })

score_df['global_score'] = overall_dist
#print(dist_df.ABOD_outlier.value_counts(dropna=False))
display(score_df.head())

print(score_df.global_score.corr(score_df[clf_name+'_score']))
sns.jointplot(x='global_score', y=clf_name+'_score', data=score_df)

@yzhao062
Copy link
Owner

yzhao062 commented Apr 5, 2019

I just created an example for you. Download it here.

In this example, I checked the Pearson correlation between the distance and the scores of PCA, IForest, and KNN, the correlations are not that high when the dimension is low (d==2). When the dimension is high, there could be a high correlation. This is an interesting observation but I guess it is caused by the nature of the datasets...it is a simple dataset and the outlier pattern is clear.

So I provide you another example (line 41-67) with real data. You could comment out line 22-37 to run this. You could see this high correlation thing is not that serious on real-world datasets, so the phenomenon is data dependent.

Hope this helps.

@flycloudking
Copy link
Author

I also realized that standardize the data has big impact on the results, simple add

X_train = StandardScaler().fit_transform(X_train)

before anomaly detection will make the correlation even higher, maybe the simulated data are not supposed to be standardized.

@yzhao062
Copy link
Owner

yzhao062 commented Apr 7, 2019

Yeah. So I believe there is no need to worry about the high correlation, which is data dependent. I will close this issue if you are happy with it :)

@flycloudking
Copy link
Author

Not totally, I understand it may depend on data. I have tested on several my data sets, these are real spending data in different countries. some countries have reasonable clusters, some countries has no good clusters at all. All of them show high correlation between the global distance to PCA, autoencoder method. I guess I was a bit disappointed as I was hoping autoencoder may provide a better method for anomaly detection, and it turns out just the same as a distance measure.
Anyway, thanks for spending time investigate this. I appreciate your help.

@flycloudking
Copy link
Author

flycloudking commented Apr 8, 2019 via email

@yzhao062
Copy link
Owner

yzhao062 commented Apr 8, 2019

Cool. I will close this thread. but feel free to open a separate one for feature request :)

@yzhao062 yzhao062 closed this as completed Apr 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants