# What is about 

**Context:** It discussed in some papers that "hubness phenomena" (i.e. presense of nodes with high degree in graphs ) creates problems for many machine learning algorithms. In particular for KNN classifier algorithm.

The Python package "scikit-hubness" ( https://pypi.org/project/scikit-hubness/ , https://arxiv.org/abs/1912.00706 "scikit-hubness: Hubness Reduction and Approximate Neighbor Search" ) proposes (in particular) certain "hubness reduction" algorithms which sometimes improve the scores for KNN classifiers. (See some in example in the paper.)

**Present notebook:** Notebook presents example from the original paper.
KNN with hubness reduction provides improvement in accuracy from 0.793 to 0.893 comparing to standard KNN classifier.
The dataset is part of Dexter Dataset. 

PS

Some other examples of hubness reduction notebooks on kaggle: 

https://www.kaggle.com/alexandervc/moa-improve-knn-via-scikit-hubness-test-of-idea

https://www.kaggle.com/alexandervc/gendatasetsfromopenml-2-hubnes


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
f = '/kaggle/input/hubness-for-high-dimensional-datasets/X_Dexter300from_scikit_hubness.csv'
X = pd.read_csv(f ,  index_col= 0  )
X

In [None]:
f = '/kaggle/input/hubness-for-high-dimensional-datasets/y_Dexter300from_scikit_hubness.csv'
y = pd.read_csv(f ,  index_col= 0  )
y

In [None]:
y = y.values.ravel()

In [None]:
(y==1).sum()/len(y)

In [None]:
!pip install scikit-hubness


# Advantage of hubness reduction - example from the paper

standard KNN from sklearn gives accuracy score 0.793 , with hubness reduction can achieve 0.893

## Standard KNN classifier

In [None]:
%%time 
#from skhubness.data import load_dexter
#X , y = load_dexter ()
from sklearn.neighbors import KNeighborsClassifier as KNeighborsClassifier_sklearn
knn_vanilla = KNeighborsClassifier_sklearn(n_neighbors =5 , metric = "cosine", )
from sklearn.model_selection import  cross_val_score
acc_vanilla = cross_val_score(knn_vanilla , X , y , cv =5)
# Accuracy ( vanilla kNN ):
print(f"{ acc_vanilla.mean():.3f}")
# 0.793

## KNN classifier with Hubness reduction -  "mutual_proximity"

In [None]:
%%time
#from skhubness.data import load_dexter
#X , y = load_dexter ()
from skhubness.neighbors import KNeighborsClassifier
knn_mp = KNeighborsClassifier ( n_neighbors =5 , metric ="cosine", hubness="mutual_proximity" )
from sklearn.model_selection import cross_val_score
acc_mp = cross_val_score (knn_mp , X , y , cv =5)
# Accuracy ( hubness - reduced kNN )
print(f" { acc_mp.mean():.3f}")
# 0.893


## Test other hubness reduction methods 

In [None]:
%%time
#from skhubness.data import load_dexter
#X , y = load_dexter ()
from skhubness.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

for hubness in  ["mutual_proximity", "local_scaling", "dis_sim_local", None]:
    print( hubness )
    knn_mp = KNeighborsClassifier ( n_neighbors =5 , metric ="cosine", hubness= hubness ) # "mutual_proximity" )
    acc_mp = cross_val_score (knn_mp , X , y , cv =5)
    # Accuracy ( hubness - reduced kNN )
    print(f" { acc_mp.mean():.3f}")
    # 0.893
