# Exploratory Data Analysis

To get started, we will prototype the workflow locally.

**Warning:** this notebook may fail if your local machine does not have sufficient resources. 

## Install requirements

Install required packages.

In [None]:
!pip install --upgrade dask distributed bokeh fastparquet adlfs xgboost pandas

## Get Data

The data is modified from a Kaggle competition and hosted publicly.

start a distributed Client

In [None]:
from distributed import Client

c = Client()
c

initialize the Pythonic filesystem

**Tip:** if you're not using public data, you need to provide data credentials. These can be retrieved through Azure ML Datastores, e.g.:

```python
from azureml.core import Workspace

ws = Workspace.from_config()
ds = ws.get_default_datastore() # ws.datastores["my-datastore-name"]

storage_options = {
    "account_name": ds.account_name,
    "account_key": ds.account_key
}
```

In [None]:
from adlfs import AzureBlobFileSystem

container_name = "malware"
storage_options = {"account_name": "azuremlexamples"}

fs = AzureBlobFileSystem(**storage_options)
fs

list the processed (partitioned) files

In [None]:
files = fs.ls(f"{container_name}/processed")
files

read data into a (dask) dataframe - note pandas also accepts the ``storage_options`` argument

In [None]:
import dask.dataframe as dd

for f in files:
    if "train" in f:
        df_train = dd.read_parquet(f"az://{f}", storage_options=storage_options)
    elif "test" in f:
        df_test = dd.read_parquet(f"az://{f}", storage_options=storage_options)

df_train

## Test stuff 

In [None]:
# setup remote tracking
from azureml.core import Workspace

ws = Workspace.from_config()
tracking_uri = ws.get_mlflow_tracking_uri()
tracking_uri

In [None]:
import mlflow
import xgboost as xgb 

def train_a_model(tracking_uri, df, params, num_boost_round=10):
    # setup remote tracking
    mlflow.set_tracking_uri(tracking_uri)
    mlflow.set_experiment("many-models-experiments")

    # turn on autolog
    mlflow.xgboost.autolog()

    # load in dataframe
    df = df.compute() 

    # throw out non-numeric columns [insert real data prep here]
    cols = [col for col in df.columns if df.dtypes[col] != "object"]
    data = df[cols].drop("HasDetections", axis=1).values
    label = df["HasDetections"].values

    # train xgboost
    dtrain = xgb.DMatrix(data, label=label)
    model = xgb.train(params, dtrain, num_boost_round=num_boost_round)

    # return model
    return model

In [None]:
num_boost_round = 10
params = {
    "objective": "binary:logistic",
    "learning_rate": 0.1,
    "gamma": 0,
    "max_depth": 8,
}

In [None]:
models = [c.submit(train_a_model, tracking_uri, df, params, num_boost_round=num_boost_round) for df in df_train.partitions]
models

In [None]:
models[0]

In [None]:
models

In [None]:
c.restart()

In [None]:
c.close()