# What is about 

OpenML - is large dataset collection for machine learning. 
It contains at least 45 datasets related to genes expressions. 
Here is example how to work with them.

Loading from openml collections 45 datesets related to genes expressions - microarray technology, cancer cells



https://www.ncbi.nlm.nih.gov/gds?term=GSE2109 
Seems origins of datasets is here

Can be found at openml dataset storage, for example:
https://www.openml.org/d/1163
GEMLeR provides a collection of gene expression datasets that can be used for benchmarking gene expression oriented machine learning algorithms. They can be used for estimation of different quality metrics (e.g. accuracy, precision, area under ROC curve, etc.) for classification, feature selection or clustering algorithms"
Can be found by feature number 10937


https://pdfs.semanticscholar.org/4088/4ec5bcd5fbe1626cd3c95179e2ac601a0097.pdf
Hindawi Publishing Corporation
Journal of Biomedicine and Biotechnology
Volume 2010, Article ID 616358, 9 pages
doi:10.1155/2010/616358
Research Article
Stability of Ranked Gene Lists in Large Microarray
Analysis Studies
Gregor Stiglic1 and Peter Kokol1, 2


-----------

To work with openml:

**1 )**

Basically we should know the "did" = dataset id, and then use two lines:

data = openml.datasets.get_dataset(did)

X, y, categorical_indicator, attribute_names = data.get_data(
      dataset_format="array", target=data.default_target_attribute )

**0 )**

To be aware of "did" (dataset id) we can load the information on all datasets first :

openml_df = openml.datasets.list_datasets(output_format="dataframe") # get pandas dataframe

and column "did" of that pandas  dataframe gives "did" , other columns contain other info 

See code below


Further docs: 

#see docs: https://docs.openml.org/Python-guide/

See also some code here:

https://datascience.stackexchange.com/a/84241/32240




# Install openml , get list of datasets, select Genes related datasets

In [None]:
pip install openml

In [None]:
import numpy as np
import pandas as pd
import time

import openml

In [None]:
t0 = time.time()
openml_df = openml.datasets.list_datasets(output_format="dataframe")
print(time.time()-t0,'seconds passed ')

if 0: ## The same can be done with BIGGER lines of code
    openml_list = openml.datasets.list_datasets()  # returns a dict
    # Show a nice table with some key data properties
    openml_df = pd.DataFrame.from_dict(openml_list, orient="index")
    openml_df = datalist[["did", "name", "NumberOfInstances", "NumberOfFeatures", "NumberOfClasses"]]

    print(f"First 10 of {len(datalist)} datasets...")
    openml_df.head(n=10)


openml_df

In [None]:
# Among all datasets there is subset of 45 datasets with  mircoarray Gene Expressions (mainly cancer) 
# can be selected by number of features == 10937
df_gemler = openml_df[ ( openml_df.NumberOfFeatures == 10937)    ].sort_values(["NumberOfInstances"], ascending=False)#.head(n=20)
print(df_gemler.shape)
df_gemler

# Load first five datasets

In [None]:
import time
list_datasets = []
for i in range(5):
  nm = df_gemler['name'].iloc[i]
  print(nm, i )
  did =  int( df_gemler['did'].iloc[i] )
  t0 = time.time()
  data = openml.datasets.get_dataset(did)
  X, y, categorical_indicator, attribute_names = data.get_data(
      dataset_format="array", target=data.default_target_attribute
  )
  dict_dataset_data = {'X':X, 'y': y, 'name':nm , 'did':did}  
  list_datasets.append(dict_dataset_data)
  print(X.shape, y.shape, 'X size byets:', X.size*X.itemsize,  time.time()-t0,'secs passed' )

# Create Plots 

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import umap 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

for dict_dataset_data in list_datasets:
    X = dict_dataset_data['X']
    y = dict_dataset_data['y']
    nm = dict_dataset_data['name']
    
    print(nm)
    
    t0=time.time()
    fig = plt.figure(figsize = (20,4) )
    fig.suptitle(nm +'\n'+str(X.shape) )
    c = 0;  n_x_subplots = 4

    c += 1; fig.add_subplot(1,n_x_subplots,c)  
    X2 = PCA().fit_transform(X)
    plt.scatter(X2[:,0],X2[:,1]   , c = y )
    plt.title('PCA')
    plt.grid()
    
    c += 1; fig.add_subplot(1,n_x_subplots,c)  
    X2 = PCA().fit_transform(scaler.fit_transform(X) ) 
    plt.scatter(X2[:,0],X2[:,1]   , c = y )
    plt.title('StandardScaler+PCA')
    plt.grid()

    c += 1; fig.add_subplot(1,n_x_subplots,c)  
    X2 = umap.UMAP().fit_transform(scaler.fit_transform(X) ) 
    plt.scatter(X2[:,0],X2[:,1]   , c = y )
    plt.title('umap')
    plt.grid()
    
    c += 1; fig.add_subplot(1,n_x_subplots,c)  
    X2 = umap.UMAP().fit_transform(scaler.fit_transform(X) ) 
    plt.scatter(X2[:,0],X2[:,1]   , c = y )
    plt.title('StandardScaler+umap')
    plt.grid()
    
    print(time.time()-t0,'seconds passed')


plt.show()    