# Introduction

This notebook is used to demonstrate how to work with AzureML Jupyter Notebooks.

The sample data was taken from *kaggle*'s [*What's Cooking*](https://www.kaggle.com/c/whats-cooking/overview) competition.

# Working With Datastores

[Datastores](https://docs.microsoft.com/en-us/azure/machine-learning/concept-data#connect-to-storage-with-datastores) are references to 
Azure storage services. You create a datastore and reference it from within your code. Therefore, you do not need to include connection
information, secret keys, etc. in your code.

Use the [`Datastore` class in `azureml.core`](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py) 
to work with AzureML Datastores.

## Authentication and Workspace Reference

To get the workspace, use the `from_config()` method. In simple scenarios it should just work. Azure
will care for the necessary configuration file. However, if you have multiple AAD tenants, you need
to explicitly authentication (see [`azureml.core.authentication` package](azureml.core.authentication) for details).

In [None]:
from azureml.core import Workspace
from azureml.core.authentication import InteractiveLoginAuthentication

# Change the following variable to your AAD tenant ID
AAD_TENANT = '022e4faf-c745-475a-be06-06b1e1c9e39d'
login = InteractiveLoginAuthentication(tenant_id = AAD_TENANT)

ws = Workspace.from_config(auth = login)

## Iterating Over Datastores

You can get a list of datastores from your workspace using the `datastores` method or your workspace.

In [None]:
for datastore in ws.datastores:
    print('Found datastore: {ds}'.format(ds = datastore))

## Creating a Datastore



You can create datastores using the AzureML portal or using Python. Datastores are registered with various methods of the
[`Datastore` class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore(class)?view=azure-ml-py#methods). The following
example registeres a *Azure Data Lake Storage Gen2*.

The goal of a separate store of datastores is that administrators can create the datastore registration. During the registration process,
they need to deal with authentication and authorization. The details depend on the kind of datastore the admin wants to register
(see [Connect to storage services on Azure](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data) for details).

The data scientists do not need to deal with auth anymore. They can use the datastores that the admins registered.

In [None]:
from azureml.core import Workspace, Datastore

# Specify a datastore name. You will later reference the datastore with this name.
ADLSGEN2_DATASTORE = 'demostore'

# Get ADLS account name from Azure portal.
ADLSGEN2_ACCOUNT = 'stdatasciencelab'

# Get service principal from Key Vault.
kv = ws.get_default_keyvault()
CLIENT_ID = kv.get_secret('azureml-dls-appid')
CLIENT_SECRET = kv.get_secret('azureml-dls-secret')

# Unregister datastore if it already exists
if ADLSGEN2_DATASTORE in ws.datastores:
    print('Datastore already exists, unregistering...')
    datastore = ws.datastores[ADLSGEN2_DATASTORE]
    datastore.unregister()
    print('Datastore unregistered')

# Register datastore
print('Registering datastore')
datastore = Datastore.register_azure_data_lake_gen2(
    workspace=ws, 
    datastore_name=ADLSGEN2_DATASTORE, 
    account_name=ADLSGEN2_ACCOUNT,
    filesystem='data',
    tenant_id=AAD_TENANT,
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET)

# Optionally, we can set the default datastore for our workspace
ws.set_default_datastore(ADLSGEN2_ACCOUNT)

# Working with Datasets

## Registering Datasets

Datasets are packaged data objects that are readily consumable. AzureML knows two types of datasets:

* *File*: Unstructured data
* *Tabular*: Structured data (e.g. CSV, JSON, RDBMS etc.)

In this example we focus on *Tabular*. For details about methods used to create tabular datasets, see
[`TabularDatasetFactory` class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory).

In [None]:
from azureml.core import Workspace, Dataset, Datastore
from azureml.data.datapath import DataPath

# Get datastore referen ce by name or use default datastore
#ds = ws.datastores[ADLSGEN2_DATASTORE]
ds = ws.get_default_datastore()

# Create the data path and dataset
dp = DataPath(ds, 'train.jl')
dset = Dataset.Tabular.from_json_lines_files(path = dp, validate = True)

# Register dataset with name
cooking_ds = dset.register(workspace = ws, name = "cooking-train")

## Using Datasets

Once we registered the dataset, we can use it to e.g. get it as a
[pandas `DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Here we use the [`head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method
to get the first few rows of our dataset.

In [None]:
cooking_ds = ws.datasets['cooking-train']
df = cooking_ds.to_pandas_dataframe()
df.head()

# Have fun with pandas dataframe in `df`

For this sample, we have built a simple preprocessing function cleaning up our data.

In [None]:
from utils import preprocess
print(preprocess(['Half and Half 15 ounce of Grains']))

We can apply the preprocessing formula to our dataframe.

In [None]:
from utils import preprocess
X_train = df['ingredients'].apply(preprocess)
Y_train = df['cuisine']

Now we can use a simple multinomial naive bayes model for prediction. Here we use [*scikit-learn*](https://scikit-learn.org/stable/index.html) for that.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from utils import preprocess

vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)

mnb = MultinomialNB()
mnb.fit(X_train_vec, Y_train)

keywords = preprocess([
        "pork stew meat",
        "salt",
        "tomatoes",
        "tomatillos",
        "chile pepper",
        "pepper",
        "garlic"
        ])
print(keywords)
X_test = vectorizer.transform([keywords])
mnb.predict(X_test)

We can dump the model to persist it.

In [None]:
import joblib
from pathlib import Path
Path("./output").mkdir(exist_ok=True)
joblib.dump(mnb, './output/mnb_model.pkl')

Later we can reload it and do other predictions.

In [None]:
import joblib
mnb = joblib.load('./output/mnb_model.pkl')

X_test = vectorizer.transform([preprocess([
      "sugar",
      "large egg yolks",
      "grated lemon peel",
      "rhubarb",
      "cream",
      "salt",
      "ground cinnamon",
      "golden brown sugar",
      "all-purpose flour",
      "sliced almonds",
      "unsalted butter"
    ])])
mnb.predict(X_test)

We can also use the model to generate results from test data.

In [None]:
import pandas as pd
from utils import preprocess

# Read test dataframe and apply preprocessing function
test_json = pd.read_json('./input/test.json')
test = test_json['ingredients'].apply(preprocess)
testfinal = vectorizer.transform(test)

# Do prediction
result = mnb.predict(testfinal)

# Generate result CSV
result_transformed = pd.DataFrame(result)
result_with_ids = pd.concat([test_json['id'], result_transformed], join = 'outer', axis = 1)
result_with_ids.to_csv('./output/result_vectorizer.csv', index = False)

Let's also try logical regression:

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000).fit(X_train_vec, Y_train)
result = clf.predict(testfinal)

result_transformed = pd.DataFrame(result)
result_with_ids = pd.concat([test_json['id'], result_transformed], join = 'outer', axis = 1)
result_with_ids.to_csv('./output/result_logistic_regression.csv', index = False)

And finally, let's try XGB Classifier:

In [None]:
import xgboost
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

model = XGBClassifier()
model.fit(X_train_vec, Y_train)
result = model.predict(testfinal)

result_transformed = pd.DataFrame(result)
result_with_ids = pd.concat([test_json['id'], result_transformed], join = 'outer', axis = 1)
result_with_ids.to_csv('./output/result_xgbclassifier.csv', index = False)