# What is about 


It discussed in some papers that "hubness phenomena" (i.e. presense of nodes with high degree in graphs ) creates problems for many machine learning algorithms. In particular for KNN classifier algorithm.

The Python package "scikit-hubness" ( https://pypi.org/project/scikit-hubness/ , https://arxiv.org/abs/1912.00706 "scikit-hubness: Hubness Reduction and Approximate Neighbor Search" ) proposes (in particular) certain "hubness reduction" algorithms which sometimes improve the scores for KNN classifiers. (See some in example in the paper.)

In this notebook we test that approach on the datasets with genes expressions from  OpenML database.
(Datasets info is below).

Current version uses only some default params of scikit-hubness tools. Might be tuning params would improve.

-----------------------------------------------------------


OpenML - is large dataset collection for machine learning. 
It contains at least 45 datasets related to genes expressions. 
Here is example how to work with them.

Loading from openml collections 45 datesets related to genes expressions - microarray technology, cancer cells



https://www.ncbi.nlm.nih.gov/gds?term=GSE2109 
Seems origins of datasets is here

Can be found at openml dataset storage, for example:
https://www.openml.org/d/1163
GEMLeR provides a collection of gene expression datasets that can be used for benchmarking gene expression oriented machine learning algorithms. They can be used for estimation of different quality metrics (e.g. accuracy, precision, area under ROC curve, etc.) for classification, feature selection or clustering algorithms"
Can be found by feature number 10937


https://pdfs.semanticscholar.org/4088/4ec5bcd5fbe1626cd3c95179e2ac601a0097.pdf
Hindawi Publishing Corporation
Journal of Biomedicine and Biotechnology
Volume 2010, Article ID 616358, 9 pages
doi:10.1155/2010/616358
Research Article
Stability of Ranked Gene Lists in Large Microarray
Analysis Studies
Gregor Stiglic1 and Peter Kokol1, 2


-----------

To work with openml:

**1 )**

Basically we should know the "did" = dataset id, and then use two lines:

data = openml.datasets.get_dataset(did)

X, y, categorical_indicator, attribute_names = data.get_data(
      dataset_format="array", target=data.default_target_attribute )

**0 )**

To be aware of "did" (dataset id) we can load the information on all datasets first :

openml_df = openml.datasets.list_datasets(output_format="dataframe") # get pandas dataframe

and column "did" of that pandas  dataframe gives "did" , other columns contain other info 

See code below


Further docs: 

see docs: https://docs.openml.org/Python-guide/

See also some code here:

https://datascience.stackexchange.com/a/84241/32240




In [None]:
import numpy as np
import pandas as pd

import time



# scikit-hubness install (should be done before(!) install openml !)

In [None]:
t0 = time.time()
!pip install scikit-hubness
# If does not work on your comp (e.g. on google.colab - does not), then try:
# !pip install git+https://github.com/j-bac/scikit-hubness.git
print(time.time()-t0,'seconds passed')

from skhubness.neighbors import KNeighborsClassifier


# Install openml , get list of datasets, select Genes related datasets

In [None]:
pip install openml

In [None]:
import openml

In [None]:
t0 = time.time()
openml_df = openml.datasets.list_datasets(output_format="dataframe")
print(time.time()-t0,'seconds passed ')

if 0: ## The same can be done with BIGGER lines of code
    openml_list = openml.datasets.list_datasets()  # returns a dict
    # Show a nice table with some key data properties
    openml_df = pd.DataFrame.from_dict(openml_list, orient="index")
    openml_df = datalist[["did", "name", "NumberOfInstances", "NumberOfFeatures", "NumberOfClasses"]]

    print(f"First 10 of {len(datalist)} datasets...")
    openml_df.head(n=10)


openml_df

In [None]:
# Among all datasets there is subset of 45 datasets with  mircoarray Gene Expressions (mainly cancer) 
# can be selected by number of features == 10937
df_gemler = openml_df[ ( openml_df.NumberOfFeatures == 10937)    ].sort_values(["NumberOfInstances"], ascending=False)#.head(n=20)
print(df_gemler.shape)
df_gemler

# Load several datasets

In [None]:
import time
list_datasets = []
t00=time.time()

for i in range(45):
  nm = df_gemler['name'].iloc[i]
  print(nm, i )
  did =  int( df_gemler['did'].iloc[i] )
  t0 = time.time()
  data = openml.datasets.get_dataset(did)
  X, y, categorical_indicator, attribute_names = data.get_data(
      dataset_format="array", target=data.default_target_attribute
  )
  dict_dataset_data = {'X':X, 'y': y, 'name':nm , 'did':did}  
  list_datasets.append(dict_dataset_data)
  print(X.shape, y.shape, 'X size byets:', X.size*X.itemsize,  time.time()-t0,'secs passed' )

print(time.time()-t00,'seconds passed total' )



# Compare KNN to KNN+hubness reduction

In [None]:
df_stat = pd.DataFrame()
verbose = 0 
t_previous_info_print = 0
timedelta4output_in_seconds = 3600


    
t00=time.time()

#from sklearn.preprocessing import StandardScaler
#scaler = StandardScaler()


for i,dict_dataset_data in enumerate(list_datasets):
    X = dict_dataset_data['X']
    y = dict_dataset_data['y']
    nm = dict_dataset_data['name']
    
    cutoff4sample_size = 100_000 # To quickly test idea - put small cutoff
    if cutoff4sample_size >= X.shape[0]: cutoff4sample_size = X.shape[0]

    print(nm)
    
    t0=time.time()
    df_stat.loc[i,'Dataset'] = nm
    
    t0=time.time()
    from sklearn.model_selection import cross_val_score
    from skhubness.neighbors import KNeighborsClassifier

    # vanilla kNN
    knn_standard = KNeighborsClassifier(n_neighbors=5,
                                        metric='cosine')
    acc_standard = cross_val_score(knn_standard, X[:cutoff4sample_size,:], y[:cutoff4sample_size], cv=5)
    df_stat.loc[i,'Score'] = acc_standard.mean()

    
    # kNN with hubness reduction (mutual proximity)
    knn_mp = KNeighborsClassifier(n_neighbors=5,
                                  metric='cosine',
                                  hubness='mutual_proximity')
    acc_mp = cross_val_score(knn_mp, X[:cutoff4sample_size,:], y[:cutoff4sample_size], cv=5)
    df_stat.loc[i,'Score Hub Reduced'] = acc_mp.mean()
    
    df_stat.loc[i,'Score Improve'] = acc_mp.mean() - acc_standard.mean()
    
    # Service things: 
    df_stat.loc[i,'%1 in target'] = np.round(y.sum()/len(y) * 100 , 1) 
    df_stat.loc[i,'Time (seconds)'] = np.round( -(t0-time.time() ) ,2)
    df_stat.loc[i,'Sample size'] = cutoff4sample_size
    
    from skhubness import Hubness
    hub = Hubness(k=5, metric='cosine')
    hub.fit(X[:cutoff4sample_size,:])
    k_skew = hub.score()
    df_stat.loc[i,'Skewness'] = k_skew

    #print(f'Skewness = {k_skew:.3f}')
    #print(f'Robin hood index: {hub.robinhood_index:.3f}')
    #print(f'Antihub occurrence: {hub.antihub_occurrence:.3f}')
    #print(f'Hub occurrence: {hub.hub_occurrence:.3f}')

    if verbose > 10:
        print(f'Accuracy (vanilla kNN): {acc_standard.mean():.3f}')
        print(f'Accuracy (kNN with hubness reduction): {acc_mp.mean():.3f}')
        print( -(t0-time.time() ) , 'seconds passed')
    if (time.time() - t_previous_info_print  ) > timedelta4output_in_seconds:
        print(f'Processed {(i+1):d} datasets. Passed {time.time()-t00:.3f} seconds')
        t_previous_info_print = time.time()
        
        
        
print( -(t00-time.time() ) , 'total seconds passed')    
df_stat        

In [None]:
df_stat['Score Improve'].describe()

In [None]:
import matplotlib.pyplot as plt
plt.hist(df_stat['Score Improve'],bins = 20, label = 'Score Improve')
plt.legend()
plt.grid()
plt.show()
plt.hist(df_stat['Score Improve'].iloc[:20],bins = 20, label = 'Score Improve Top 20 datasets')
plt.legend()
plt.grid()
plt.show()

In [None]:
df_stat.to_csv('df_stat.csv')

In [None]:

plt.figure(figsize = (20,6))
plt.plot(df_stat['Score Improve'], label = 'Score Improve')
plt.legend()
plt.grid()
plt.show()

plt.figure(figsize = (20,6))
plt.plot(df_stat['Score'], label = 'Score')
plt.legend()
plt.grid()
plt.show()

In [None]:
df_stat.iloc[:50,:]


# Create Plots 

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import umap 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
t00=time.time()

for dict_dataset_data in list_datasets:
    X = dict_dataset_data['X']
    y = dict_dataset_data['y']
    nm = dict_dataset_data['name']
    
    print(nm)
    
    t0=time.time()
    fig = plt.figure(figsize = (20,4) )
    fig.suptitle(nm +'\n'+str(X.shape) )
    c = 0;  n_x_subplots = 4

    c += 1; fig.add_subplot(1,n_x_subplots,c)  
    X2 = PCA().fit_transform(X)
    plt.scatter(X2[:,0],X2[:,1]   , c = y )
    plt.title('PCA')
    plt.grid()
    
    c += 1; fig.add_subplot(1,n_x_subplots,c)  
    X2 = PCA().fit_transform(scaler.fit_transform(X) ) 
    plt.scatter(X2[:,0],X2[:,1]   , c = y )
    plt.title('StandardScaler+PCA')
    plt.grid()

    c += 1; fig.add_subplot(1,n_x_subplots,c)  
    X2 = umap.UMAP().fit_transform(scaler.fit_transform(X) ) 
    plt.scatter(X2[:,0],X2[:,1]   , c = y )
    plt.title('umap')
    plt.grid()
    
    c += 1; fig.add_subplot(1,n_x_subplots,c)  
    X2 = umap.UMAP().fit_transform(scaler.fit_transform(X) ) 
    plt.scatter(X2[:,0],X2[:,1]   , c = y )
    plt.title('StandardScaler+umap')
    plt.grid()
    
    print(time.time()-t0,'seconds passed')

print(time.time()-t00,'seconds passed total')

plt.show()    