Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

### Intel NLP-Architect ABSA on AzureML 

### INSTRUCTOR VERSION

> **This instructor version of the notebook gives additional instructions as to which cells should be run in demo mode, and which should not. It assumes that before the demo you will execute the complete notebook, and then during the demo certain cells would be re-run to demonstrate working process.**

This notebook contains an end-to-end walkthrough of using Azure Machine Learning Service to train, finetune and test [Aspect Based Sentiment Analysis Models using Intel's NLP Architect](http://nlp_architect.nervanasys.com/absa.html)

### Prerequisites

* Understand the architecture and terms introduced by Azure Machine Learning (AML)
* Have working Jupyter Notebook Environment. You can:
    - Install Python environment locally, as described below in **Local Installation**
    - Use [Azure Notebooks](https://docs.microsoft.com/ru-ru/azure/notebooks/azure-notebooks-overview/?wt.mc_id=absa-notebook-abornst). In this case you should upload the `absa.ipynb` file to a new Azure Notebooks project, or just clone the [GitHub Repo](https://github.com/microsoft/ignite-learning-paths-training-aiml/tree/master/aiml40).
* Azure Machine Learning Workspace in your Azure Subscription

#### Local Installation

Install the Python SDK: make sure to install notebook, and contrib:

```shell
conda create -n azureml -y Python=3.6
source activate azureml
pip install --upgrade azureml-sdk[notebooks,contrib] 
conda install ipywidgets
jupyter nbextension install --py --user azureml.widgets
jupyter nbextension enable azureml.widgets --user --py
```

You will need to restart jupyter after this Detailed instructions are [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-create-workspace-with-python/?WT.mc_id=absa-notebook-abornst)

If you need a free trial account to get started you can get one [here](https://azure.microsoft.com/en-us/offers/ms-azr-0044p/?WT.mc_id=absa-notebook-abornst)

#### Creating Azure ML Workspace

Azure ML Workspace can be created by using one of the following ways:
* Manually through [Azure Portal](http://portal.azure.com/?WT.mc_id=absa-notebook-abornst) - [here is the complete walkthrough](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-workspace/?wt.mc_id=absa-notebook-abornst)
* Using [Azure CLI](https://docs.microsoft.com/ru-ru/cli/azure/?view=azure-cli-latest&wt.mc_id=absa-notebook-abornst), using the following commands:

```shell
az extension add -n azure-cli-ml
az group create -n absa -l westus2
az ml workspace create -w absa_space -g absa
```

## Initialize workspace

To access an Azure ML Workspace, you will need to import the AML library and the following information:
* A name for your workspace (in our example - `absa_space`)
* Your subscription id (can be obtained by running `az account list`)
* The resource group name (in our case `absa`)

Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace/?WT.mc_id=absa-notebook-abornst) object from the existing workspace you created in the Prerequisites step or create a new one. 

> **This cell can be run without problem, because it will just create a connection object for the workspace. Make sure to insert the correct `subscription_id` value before use, or have `config.json` file ready.**

In [1]:
from azureml.core import Workspace

#subscription_id = ''
#resource_group  = 'absa'
#workspace_name  = 'absa_space'
#ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
#ws.write_config()

try:
    ws = Workspace.from_config()
    print(ws.name, ws.location, ws.resource_group, ws.location, sep='\t')
    print('Library configuration succeeded')
except:
    print('Workspace not found')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


hal	westus2	robots	westus2
Library configuration succeeded


## Compute

There are two computer option run once(preview) and persistent compute for this demo we will use persistent compute to learn more about run once compute check out the [docs](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute?WT.mc_id=absa-notebook-abornst).

> **This cell can be run because it will not re-create a cluster. Although it does not make much sense to run it**

In [2]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cluster_name = "gandalf"

# Verify that cluster does not exist already
try:
    cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D3_V2',
                                                           vm_priority='lowpriority',
                                                           min_nodes=1,
                                                           max_nodes=4)
    cluster = ComputeTarget.create(ws, cluster_name, compute_config)

cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned


## Upload Data

The dataset we are using comes from the [womens ecommerce clothing reviews dataset](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/) and is in the open domain, this can be replaced with any csv file with rows of text as the absa model is unsupervised. 

The documentation for uploading data can be found [here](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.azure_storage_datastore.azureblobdatastore/?WT.mc_id=absa-notebook-abornst) for now we will us the ds.upload command. 

In [None]:
# if using as a separate notebook - fetch files from github repo
if not os.path.isdir('dataset'):
    !mkdir dataset
    !wget -O 'dataset/clothing_absa_train_small.csv' 'https://raw.githubusercontent.com/microsoft/ignite-learning-paths-training-aiml/master/aiml40/dataset/clothing_absa_train_small.csv'
    !wget -O 'dataset/clothing_absa_train.csv' 'https://raw.githubusercontent.com/microsoft/ignite-learning-paths-training-aiml/master/aiml40/dataset/clothing_absa_train.csv'
    !wget -O 'dataset/clothing-absa-validation.json' 'https://raw.githubusercontent.com/microsoft/ignite-learning-paths-training-aiml/master/aiml40/dataset/clothing-absa-validation.json'

In [None]:
!wget -O 'dataset/glove.840B.300d.zip' 'http://nlp.stanford.edu/data/glove.840B.300d.zip'

In [4]:
#import os                            
#lib_root = os.path.dirname(os.path.abspath("__file__"))
#ds = ws.get_default_datastore()
#ds.upload('./dataset', target_path='clothing_data', overwrite=True, show_progress=True)
from azureml.core import Datastore
ds = Datastore.get(ws, 'absa')

Now the the glove file is uploaded to our datastore we can remove it from our local directory.

In [None]:
#!rm 'dataset/glove.840B.300d.zip'

## Create An Expierment

Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment/?WT.mc_id=absa-notebook-abornst) to track all the runs in your workspace for this distributed PyTorch tutorial. 

> **In most of the cases, you want to skip the following 3 cells during the demo, in order not to run the experiment again. However, you may also start another experiment if time permists, in which case you can run them**

In [5]:
from azureml.core import Experiment

In [7]:
experiment_name = 'absa'

exp = Experiment(workspace=ws, name=experiment_name)

In [None]:
from azureml.train.estimator import Estimator

script_params = {
    '--data_folder': ds,
}

nlp_est = Estimator(source_directory='.',
                   script_params=script_params,
                   compute_target=cluster,
                   environment_variables = {'NLP_ARCHITECT_BE':'CPU'},
                   entry_script='train.py',
                   pip_packages=['git+https://github.com/NervanaSystems/nlp-architect.git@absa',
                                 'spacy==2.1.8']
)

In [None]:
run = exp.submit(nlp_est)
run_id = run.id
print(run_id)

Note: If you accidently run the following cell more than once you can cancel a run with the run.cancel() command.

In [None]:
# run.cancel()

> **To retrieve the run, we use run id here. It can either be hard-coded from the previous pre-demo run, or you can rely on the jupyter kernel not restarting, in which case it will be saved in the `run_id` variable. So, if the jupyter engine has not been restarted, you may run cell 2, otherwise run cell 1** 

In [10]:
run = [r for r in exp.get_runs() if r.id == 'absa_1568985331_df076c3c'][0]

NameError: name 'exp' is not defined

In [None]:
run = [r for r in exp.get_runs() if r.id == run_id][0]

> **Run this to show the result of the run, either in progress or completed**

In [None]:
from azureml.widgets import RunDetails

RunDetails(run).show()

## Fine-Tuning NLP Archictect  with AzureML HyperDrive
Although ABSA is an unsupervised method it's hyper parameters such as the aspect and opinion word thresholds can be fined tuned if provided with a small sample of labeled data

In [None]:
from azureml.train.hyperdrive import *
import math

param_sampling = RandomParameterSampling({
         '--asp_thresh': choice(range(2,5)),
         '--op_thresh': choice(range(2,5)), 
         '--max_iter': choice(range(2,5))
    })

### Early Termination Policy
First we will define an early terminination policy. [Median stopping](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.medianstoppingpolicy?WT.mc_id=absa-notebook-abornst) is an early termination policy based on running averages of primary metrics reported by the runs. This policy computes running averages across all training runs and terminates runs whose performance is worse than the median of the running averages. 

This policy takes the following configuration parameters:

- evaluation_interval: the frequency for applying the policy (optional parameter).
- delay_evaluation: delays the first policy evaluation for a specified number of intervals (optional parameter).


In [None]:
early_termination_policy = MedianStoppingPolicy(evaluation_interval=1, delay_evaluation=0)

Refer [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters#specify-early-termination-policy?WT.mc_id=absa-notebook-abornst) for more information on the Median stopping policy and other policies available.

Now that we've defined our early termination policy we can define our Hyper Drive configuration to maximize our Model's weighted F1 score. Hyper Drive can optimize any metric can be optimized as long as it's logged by the training script. 


In [None]:
hd_config = HyperDriveConfig(estimator=nlp_est,
                            hyperparameter_sampling=param_sampling,
                            policy=early_termination_policy,
                            primary_metric_name='f1_weighted',
                            primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                            max_total_runs=16,
                            max_concurrent_runs=4)

Finally, lauch the hyperparameter tuning job.

In [6]:
experiment = Experiment(workspace=ws, name='absa_hyperdrive')

In [None]:
hyperdrive_run = experiment.submit(hd_config)

In [None]:
hyperdrive_run.id

In [10]:
hyperdrive_run = [r for r in experiment.get_runs() if r.id == 'absa_hyperdrive_1578973612991526'][0]

### Monitor HyperDrive runs
We can monitor the progress of the runs with the following Jupyter widget. 

In [None]:
from azureml.widgets import RunDetails

RunDetails(hyperdrive_run).show()

In [None]:
hyperdrive_run.cancel()

### Find and register the best model
Once all the runs complete, we can find the run that produced the model with the highest evaluation (METRIC TBD).

In [11]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
print(best_run)
print('Best Run is:\n  F1: {0:.5f}'.format(
        best_run_metrics['f1_weighted']
     ))

Run(Experiment: absa_hyperdrive,
Id: absa_hyperdrive_1578973612991526_1,
Type: azureml.scriptrun,
Status: Completed)
Best Run is:
  F1: 0.89455


In [12]:
best_run.get_file_names()

['azureml-logs/55_azureml-execution-tvmps_f0be58f37b4253cd11263408e3571dc29440bc3a9e32c89331b3a8ff6ea8b0d5_d.txt',
 'azureml-logs/65_job_prep-tvmps_f0be58f37b4253cd11263408e3571dc29440bc3a9e32c89331b3a8ff6ea8b0d5_d.txt',
 'azureml-logs/70_driver_log.txt',
 'azureml-logs/75_job_post-tvmps_f0be58f37b4253cd11263408e3571dc29440bc3a9e32c89331b3a8ff6ea8b0d5_d.txt',
 'azureml-logs/process_info.json',
 'azureml-logs/process_status.json',
 'logs/azureml/153_azureml.log',
 'logs/azureml/job_prep_azureml.log',
 'logs/azureml/job_release_azureml.log',
 'outputs/generated_aspect_lex.csv',
 'outputs/generated_opinion_lex_reranked.csv',
 'outputs/parsed/clothing_absa_train_small/0.0.json',
 'outputs/parsed/clothing_absa_train_small/0.1.json',
 'outputs/parsed/clothing_absa_train_small/0.10.json',
 'outputs/parsed/clothing_absa_train_small/0.100.json',
 'outputs/parsed/clothing_absa_train_small/0.101.json',
 'outputs/parsed/clothing_absa_train_small/0.102.json',
 'outputs/parsed/clothing_absa_train_sm

In [13]:
best_run.download_files()

In [14]:
import os
from shutil import copyfile, rmtree
if os.path.exists('model'):
    rmtree('model')
    
os.makedirs('model')

aspect_lex = copyfile('outputs/generated_aspect_lex.csv', 'model/generated_aspect_lex.csv')
opinion_lex = copyfile('outputs/generated_opinion_lex_reranked.csv', 'model/generated_opinion_lex_reranked.csv')

In [12]:
best_run.upload_folder(name="model", path="model")

AzureMLException: AzureMLException:
	Message: UserError: Resource Conflict: ArtifactId ExperimentRun/dcid.absa_hyperdrive_1578973612991526_1/model/generated_aspect_lex.csv already exists.
UserError: Resource Conflict: ArtifactId ExperimentRun/dcid.absa_hyperdrive_1578973612991526_1/model/generated_opinion_lex_reranked.csv already exists.
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "UserError: Resource Conflict: ArtifactId ExperimentRun/dcid.absa_hyperdrive_1578973612991526_1/model/generated_aspect_lex.csv already exists.\nUserError: Resource Conflict: ArtifactId ExperimentRun/dcid.absa_hyperdrive_1578973612991526_1/model/generated_opinion_lex_reranked.csv already exists."
    }
}

## Register Model Outputs

In [31]:
model = best_run.register_model(model_name='absa', model_path='model')

In [13]:
from azureml.core import Model


model = Model.register(workspace=ws, model_name='absa', model_path='model', 
                      description='Aspect Based Sentiment Analysis - Intel',
                      tags={'area': 'NLP', 'type': 'unsupervised', 'model_author': "INTEL"})


Registering model absa


## Test Locally

### Install Local PIP Dependencies

In [None]:
!pip install git+https://github.com/NervanaSystems/nlp-architect.git@absa   

In [None]:
!pip install spacy==2.0.18

### Load Model From AzureML

In [32]:
from azureml.core.model import Model
from nlp_architect.models.absa.inference.inference import SentimentInference
c_aspect_lex = 'outputs/generated_aspect_lex.csv'
c_opinion_lex = 'outputs/generated_opinion_lex_reranked.csv' 
inference = SentimentInference(c_aspect_lex, c_opinion_lex)

### Run Model On Sample Data 

In [33]:
docs = ["Loved the sweater but hated the pants",
       "Really great outfit, but the shirt is the wrong size",
       "I absolutely love this jacket! i wear it almost everyday. works as a cardigan or a jacket. my favorite retailer purchase so far"]

sentiment_docs = []

for doc_raw in docs:
    sentiment_doc = inference.run(doc=doc_raw)
    sentiment_docs.append(sentiment_doc)

Processing batch 0
Batch 0 Done
Processing batch 0
Batch 0 Done
Processing batch 0
Batch 0 Done


### Visualize Model Results

In [34]:
import spacy
from spacy import displacy
from nlp_architect.models.absa.inference.data_types import TermType
ents = []
for doc in sentiment_docs:    
    if doc:
        doc_viz = {'text':doc._doc_text, 'ents':[]}
        for s in doc._sentences:
            for ev in s._events:
                for e in ev:
                    if e._type == TermType.ASPECT:
                        ent = {'start': e._start, 'end': e._start + e._len,
                               'label':str(e._polarity.value), 
                               'text':str(e._text)}
                        if all(kown_e['start'] != ent['start'] for kown_e in ents):
                            ents.append(ent)
                            doc_viz['ents'].append(ent)
        doc_viz['ents'].sort(key=lambda m: m["start"])
        displacy.render(doc_viz, style="ent", options={'colors':{'POS':'#7CFC00', 'NEG':'#FF0000'}}, 
                        manual=True, jupyter=True)

## Create configuration files


### Create Enviorment File
create an environment file, called myenv.yml, that specifies all of the script's package dependencies. This file is used to ensure that all of those dependencies are installed in the Docker image. This model needs nlp-architect and the azureml-sdk. 

In [41]:
from azureml.core.conda_dependencies import CondaDependencies 

pip = ["azureml-defaults", "azureml-monitoring", 
       "git+https://github.com/NervanaSystems/nlp-architect.git@absa", 
       "spacy==2.0.18",
       ""]

myenv = CondaDependencies.create(pip_packages=pip)

with open("myenv.yml","w") as f:
    f.write(myenv.serialize_to_string())

### Create Environment Config
Create a Enviorment configuration file and specify the enviroment and enviormental variables required for the application

In [42]:
from azureml.core import Environment

deploy_env = Environment.from_conda_specification('absa_env', "myenv.yml")
deploy_env.environment_variables={'NLP_ARCHITECT_BE': 'CPU'}

### Inference and Deployment Config 
Create an inference configuration that recieves the deployment enviorment and the entry script as well as a deployment configuration to run inferences

In [43]:
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice

inference_config = InferenceConfig(environment=deploy_env,
                                   entry_script="score.py")

deploy_config = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               memory_gb=1,
                                               description='Aspect-Based Sentiment Analysis - Intel')

### Quick Deploy!
Create a deployment of the model using the scoring file.

In [None]:
deployment = Model.deploy(ws, 'absa', 
                 models=[model], 
                 inference_config=inference_config, 
                 deployment_config=deploy_config, 
                 overwrite=True)

## Next Steps

We now have gone through all the steps for production training of a custom open source model using the AzureML Service check out AIML50 to learn how to deploy and models and manage re-training pipelines