# Tutorial 1 - Using MetaRegressionFeatures + OpenML API <a class="tocSkip">
 
The objective of this tutorial is to show how to download datasets from OpenML API and calculate his meta features using `MetaRegressionFeatures` class from PyMeta.

## Installing Python's OpenML API

To use OpenML API you need to install it using pip:

```pip install openml```

Then you need to sign up on [OpenML site](https://www.openml.org/) to get your API key for authentication.

## Downloading dataset from OpenML API

First we need to import `openml` package and set the API key.

In [1]:
import openml

# you must config you api key
openml.config.apikey = "your api key goes here"

After setting API key we can download a dataset, keep in mind that dataset is referenced are by a ID integer. This ID is the endpoint of the dataset page at OpenML.

For the Boston house-price dataset, for example, the endpoint is 531 as you can see on page url:

https://www.openml.org/d/531

Let's download this dataset and see summary information about it:

In [2]:
# This is done based on the dataset ID.
dataset = openml.datasets.get_dataset(531)

# Print a summary
print("This is dataset '%s', the target feature is '%s'" %
      (dataset.name, dataset.default_target_attribute))
print("URL: %s" % dataset.url)
print(dataset.description)

This is dataset 'boston', the target feature is 'MEDV'
URL: https://www.openml.org/data/v1/download/52643/boston.arff
**Author**:   
**Source**: Unknown - Date unknown  
**Please cite**:   

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.
Variables in order:
CRIM     per capita crime rate by town
ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS    proportion of non-retail business acres per town
CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX      nitric oxides concentration (parts per 10 million)
RM       average number of rooms per dwelling
AGE      proportion of owner-occupied units built prior to 1940
DIS      weighted distances to five Bost

For this task, we will consider `MEDV` as the target columns passing it to the function get data and retrieving array of features, target, besides categorical columns mask and features names.

In [3]:
X, y, categorical_columns, columns_names = dataset.get_data('MEDV')

In [4]:
import numpy

In [5]:
# view of first five samples of features dataframe
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33


In [12]:
# view of first five output target
y.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

In [7]:
# view of first five saamples of the categorical columns
X.loc[:, categorical_columns].head()

Unnamed: 0,CHAS,RAD
0,0,1
1,0,2
2,0,2
3,0,3
4,0,3


## Get meta features with MetaFeaturesRegression

First, import the `MetaFeaturesRegression` class from PyMeta.

In [8]:
import sys
from os.path import join, abspath
from pathlib import Path

# get project dir
project_dir = Path(abspath('')).resolve().parent
# add it to path
sys.path.append(join(project_dir))

# get MetaFeaturesRegression
from pymeta.meta_learning import MetaFeaturesRegression

Then, instanciate the object.

In [9]:
mfr = MetaFeaturesRegression(
        dataset_name='Boston',
        random_state=42,
        n_jobs=3,
        categorical_mask=categorical_columns        
)

Fit the meta features for Boston dataset.

In [10]:
mfr.fit(X, y)

MetaFeaturesRegression(coeficient_of_variation_target=0.40776152837415536,
            collective_feature_efficiency=0.8932806324110671,
            dataset='Boston', error_of_nn_regressor=3.2978260869565217,
            example_features_ratio=38.92307692307692,
            individual_feature_efficiency=0.17391304347826086,
            input_distribution=0.7683558091314391,
            max_feature_correlation_target=0.8529141394922163,
            max_kurtosis_numerical_features=37.13050912952209,
            max_mean_numerical_features=0.8985678340323596,
            max_skewness_numerical_features=5.223148798243857,
            max_std_numerical_features=0.3216357176621353,
            mean_absolute_residuos=3.3357555472569778,
            mean_feature_correlation=0.493447984615287,
            mean_feature_correlation_target=0.5383475150530329,
            mean_kurtosis_numerical_features=4.324372369472598,
            mean_mean_numerical_features=0.41640480866218943,
            me

In [11]:
# get metafeatures as pandas.DataFrame
mfr.qualities()

Unnamed: 0,0
dataset,Boston
n_of_examples,506
n_of_features,13
proportion_of_categorical,0.153846
example_features_ratio,38.9231
proportion_of_attributes_outliers,0.636364
coeficient_of_variation_target,0.407762
outliers_on_target,1
stationarity_of_target,0
r2_without_categorical,0.722457
