## Integration with Auctus

First, import the class `AutoML`, as well as additional libraries that might be used. If you plan to use AlphaD3m via Docker/Singularity, use:
`DockerAutoML` or `SingularityAutoML` classes.

In [1]:
from alphad3m import AutoML
# from alphad3m_containers import DockerAutoML/SingularityAutoML as AutoML
from io import BytesIO
import requests
import pandas
import json
import zipfile

ModuleNotFoundError: No module named 'alphad3m'

### Generating pipelines

In this example, we are generating pipelines for CSV datasets. The [PHEM dataset](https://gitlab.com/ViDA-NYU/d3m/alphad3m/-/tree/devel/examples/datasets) is used for this example. This data is provided by Ethiopian Public Health Institute (PHEM). It contains reported cases of malnutrition at the level of Woreda.

In [None]:
output_path = 'tmp/'
train_dataset = 'datasets/PHEM/train_data.csv'

automl = AutoML(output_path)
automl.plot_summary_dataset(train_dataset)

In [3]:
automl.search_pipelines(train_dataset, time_bound=5, target='Malnutrition_total Cases', task_keywords=['regression'])

INFO: Initializing AlphaD3M AutoML...
INFO: Creating Docker container automl-container-57437...
INFO: Connecting via gRPC to localhost:57437...
INFO: AlphaD3M AutoML initialized!
INFO: Found pipeline id=a1b793c8-cc6a-46f2-9e8f-6d1871f07cf0, time=0:00:46.867823, scoring...
INFO: Found pipeline id=ddb0ddac-da84-40e8-93ec-e17d0b466989, time=0:00:59.579115, scoring...
INFO: Found pipeline id=ac1829b1-d646-42bb-b9c6-2327f858fb83, time=0:01:09.562032, scoring...
INFO: Found pipeline id=304c141d-2c2c-4bfe-9ff7-4f18cac790d9, time=0:01:11.814985, scoring...
INFO: Found pipeline id=8bfb0119-2726-441e-8f81-34254ec54560, time=0:01:14.227199, scoring...
INFO: Found pipeline id=c4196b62-f5d1-43c7-944e-4b9b79494676, time=0:01:20.966919, scoring...
INFO: Scored pipeline id=a1b793c8-cc6a-46f2-9e8f-6d1871f07cf0, root_mean_squared_error=6.67721
INFO: Scored pipeline id=ddb0ddac-da84-40e8-93ec-e17d0b466989, root_mean_squared_error=6.75922
INFO: Scored pipeline id=8bfb0119-2726-441e-8f81-34254ec54560, root

### Searching for datasets with Auctus

In [4]:
def print_results(results):
    if not results:
        return
    for result in results:
        print(result['metadata']['name'])
        print('Score: ', result['score'])
        if 'augmentation' in result:
            aug_type = result['augmentation']['type']
            print('Augmentation: %s' % aug_type)
            print("Left Columns: %s" %
                  str(result['augmentation']['left_columns_names']))
            print("Right Columns: %s" %
                  str(result['augmentation']['right_columns_names']))
            
        print("-------------------\n")

In [2]:
url = 'https://auctus.vida-nyu.org/api/v1/search'
query = {
    'keywords': ['weather']
}

with open(train_dataset, 'rb') as data:
    response = requests.post(
        url,
        files={
            'data': data,
            'query': ('query.json', json.dumps(query), 'application/json'),
        }
    )
if response.status_code == 400:
    try:
        print('Error: %s' % response.json()['error'])
    except (KeyError, ValueError):
        pass
response.raise_for_status()
query_results = response.json()['results']
print_results(query_results)

NameError: name 'train_dataset' is not defined

In [14]:
index_to_use = 3 # This should be indicated by the user. Choose "Weather Data for Oromia, Ethiopia" dataset

### Downloading a Dataset

In [15]:
dataset_id = query_results[index_to_use]['id']
response = requests.get('https://auctus.vida-nyu.org/api/v1/download/' + dataset_id)
response.raise_for_status()
new_dataset = pandas.read_csv(BytesIO(response.content))
display(new_dataset)

Unnamed: 0,date_time,wind_dir,wind_speed,visibility,temperature,air_pressure,precipitation
0,2009-12-28,186.533333,1.747573,23963.809524,19.860606,1014.888889,11.050000
1,2010-01-04,197.232143,1.556180,24493.513514,19.887079,1014.413846,
2,2010-01-11,172.769231,1.756477,23966.336634,19.951579,1016.488136,
3,2010-01-18,168.333333,2.571429,27142.857143,27.200000,1011.200000,
4,2010-01-25,166.052632,2.536842,24858.000000,22.061053,1014.450000,3.000000
...,...,...,...,...,...,...,...
467,2019-09-30,164.557823,2.360544,24097.560976,19.067317,1017.753846,12.765625
468,2019-10-07,151.896552,2.491379,23376.300578,19.180347,1016.910127,7.280952
469,2019-10-14,137.559055,2.685039,25774.345550,20.099476,1018.671084,7.400000
470,2019-10-21,173.852459,2.565574,24414.634146,21.063415,1018.568750,4.550000


### Augmenting a Dataset

In [16]:
url = 'https://auctus.vida-nyu.org/api/v1/augment'
task = query_results[index_to_use]

with open(train_dataset, 'rb') as data:
    response = requests.post(
        url,
        files={
            'data': data,
            'task': ('task.json', json.dumps(task), 'application/json'),
        },
        stream=True,
    )
if response.status_code == 400:
    try:
        print('Error: %s' % response.json()['error'])
    except (KeyError, ValueError):
        pass
response.raise_for_status()
zip_ = zipfile.ZipFile(BytesIO(response.content), 'r')
learning_data = pandas.read_csv(zip_.open('tables/learningData.csv'))
dataset_doc = json.load(zip_.open('datasetDoc.json'))
zip_.close()
augmented_dataset = 'datasets/PHEM/train_data_augmented.csv'

learning_data.to_csv(augmented_dataset, index=False)
learning_data.head()

Unnamed: 0,Week_date_time,Week_date_time_woreda,RegionName,ZoneName,WoredaName,Total Malaria Confirmed and Clinical,TMalaria_OutP_Cases,TMalaria_InP_Cases,TMalaria_InP_Deaths,TMSuspected Fever Examined,...,max temperature,min temperature,mean air_pressure,sum air_pressure,max air_pressure,min air_pressure,mean precipitation,sum precipitation,max precipitation,min precipitation
0,2017-01-02,2017-01-02_North Shewa_Hidabu Abote,Oromia,North Shewa,Hidabu Abote,7,7.0,0.0,0.0,27.0,...,19.563514,19.563514,1017.787156,1017.787156,1017.787156,1017.787156,,,,
1,2017-01-02,2017-01-02_Qeleme Wellega_Dambi Dolo Hospital,Oromia,Qeleme Wellega,Dambi Dolo Hospital,4,3.0,1.0,0.0,161.0,...,19.563514,19.563514,1017.787156,1017.787156,1017.787156,1017.787156,,,,
2,2017-01-02,2017-01-02_Buno Bedele_Boricha,Oromia,Buno Bedele,Boricha,3,3.0,0.0,0.0,12.0,...,19.563514,19.563514,1017.787156,1017.787156,1017.787156,1017.787156,,,,
3,2017-01-02,2017-01-02_Jimma_Gera,Oromia,Jimma,Gera,1,1.0,0.0,0.0,41.0,...,19.563514,19.563514,1017.787156,1017.787156,1017.787156,1017.787156,,,,
4,2017-01-02,2017-01-02_Arsi_Gololcha,Oromia,Arsi,Gololcha,11,11.0,0.0,0.0,31.0,...,19.563514,19.563514,1017.787156,1017.787156,1017.787156,1017.787156,,,,


### Generating pipelines with augmented dataset

In [17]:
automl_datamart = AutoML(output_path)
automl_datamart.search_pipelines(augmented_dataset, time_bound=5, target='Malnutrition_total Cases', task_keywords=['regression'])

INFO: Initializing AlphaD3M AutoML...
INFO: Creating Docker container automl-container-63038...
INFO: Connecting via gRPC to localhost:63038...
INFO: AlphaD3M AutoML initialized!
INFO: Found pipeline id=10b94779-2830-426c-9fc7-188f6c7a5825, time=0:00:51.006853, scoring...
INFO: Found pipeline id=48bf726f-7ecd-48c1-a102-5258a19f6e43, time=0:01:00.753065, scoring...
INFO: Found pipeline id=49f52617-3e52-4519-b451-557edbf42d2a, time=0:01:13.817289, scoring...
INFO: Found pipeline id=67b56577-5e85-4979-9f2b-6e8a2943cfd5, time=0:01:17.350129, scoring...
INFO: Found pipeline id=71502d01-6edb-401b-ae43-b918777691ec, time=0:01:19.947414, scoring...
INFO: Found pipeline id=ec1253fe-7c4c-4f9f-ad4b-b3503f4a74be, time=0:01:25.073955, scoring...
INFO: Scored pipeline id=10b94779-2830-426c-9fc7-188f6c7a5825, root_mean_squared_error=6.96094
INFO: Scored pipeline id=48bf726f-7ecd-48c1-a102-5258a19f6e43, root_mean_squared_error=6.66513
INFO: Scored pipeline id=71502d01-6edb-401b-ae43-b918777691ec, root

### Visualizing pipelines using Pipeline Profiler

In [18]:
without_datamart = automl.create_pipelineprofiler_inputs()
with_datamart = automl_datamart.create_pipelineprofiler_inputs(source_name='AlphaD3M_AUG')

INFO: Inputs for PipelineProfiler created!
INFO: Inputs for PipelineProfiler created!


To explore the produced pipelines by both one using datamart and the other without using it, we can use [PipelineProfiler](https://github.com/VIDA-NYU/PipelineVis). PipelineProfiler is a visualization that enables users to compare and explore the pipelines generated by the AutoML systems.

After the pipeline search process is completed, we can use PipelineProfiler with:

In [None]:
automl_datamart.plot_comparison_pipelines(precomputed_pipelines=without_datamart+with_datamart)

After the analysis is complete, end the session to stop the process and clean up temporary files:

In [20]:
automl.end_session()
automl_datamart.end_session()

INFO: Ending session...
INFO: Session ended!
