Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Text classification by convolutional neural network

This sample pipeline contains some modules that implement with Text CNN for sentiment classification scenarios.

You will learn how to:
* Register modules from local code using module SDK.
* Build pipeline with registered modules and AzureML built-in modules.

## Prerequisites
* Install azure cli with azure-cli-ml extension following the [instructions here](setup-environment.ipynb).

## Setup workspace

Login to azure with cli and set the default workspace using `az ml folder attach` command.

After this operation, the workspace could be retrived with the `Workspace.from_config()` for SDK usage.

In [None]:
# NOTE: Update the following information with your environment

SUBSCRIPTION_ID = '<your subscription ID>'
WORKSPACE_NAME = '<your workspace name>'
RESOURCE_GROUP_NAME = '<your resource group>'

In [None]:
!az login -o none 
!az account set -s $SUBSCRIPTION_ID 
!az ml folder attach -w $WORKSPACE_NAME -g $RESOURCE_GROUP_NAME 

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()

## Retrieve or create an Azure Machine Learning compute target
Azure Machine Learning Compute is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Create a new Azure Machine Learning Compute in the current workspace, if it doesn't already exist. We will then run the pipeline on this compute target.

If we could not find the compute with the given name, then we will create a new compute here. This process is broken down into the following steps:

1. Create the configuration
2. Create the Azure Machine Learning compute

**This process will take a few minutes and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell.**

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

aml_compute_target = "cpu-cluster"
try:
    aml_compute = AmlCompute(ws, aml_compute_target)
    print("found existing compute target.")
except ComputeTargetException:
    print("creating new compute target")
    
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2",
                                                                min_nodes = 1, 
                                                                max_nodes = 4)    
    aml_compute = ComputeTarget.create(ws, aml_compute_target, provisioning_config)
    aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
print("Azure Machine Learning Compute attached")

## Prepare training dataset

Download [IMDB Dataset of 50k Movie Reviews](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) and then register it "From Local" via Azure Machine Learning portal.

In [None]:
import hashlib
from azureml.core import Dataset

training_data_name = 'IMDB_Dataset_Samples'
path = 'assets/text-classification/data/'

if training_data_name not in ws.datasets:
    print('Registering a training dataset ...')

    # Upload path to datastore
    m = hashlib.sha256()
    m.update(path.encode())
    ds_path = m.hexdigest()

    datastore = ws.get_default_datastore()
    path_on_datastore = folder_on_datastore = f'/data/{ds_path}'
    datastore.upload(path, target_path=folder_on_datastore)

    # Create a FileDataset
    datastore_paths = [(datastore, path_on_datastore + '/**')]
    train_data = Dataset.File.from_files(datastore_paths)
    print(f"Registering dataset for path {path}")
    train_data.register(workspace=ws,
                        name=training_data_name,
                        description='Training data (just for illustrative purpose)')
    print('Registerd')
else:
    train_data = ws.datasets[training_data_name]
    print('Training dataset found in workspace')

## Load or register TextCNN modules

Load TextCNN related modules. If module not found, register with module SDK.

In [None]:
from azureml.pipeline.wrapper import Module

try:
    textcnn_train_module_func = Module.load(ws, namespace='microsoft.com/azureml/samples', name='TextCNN Train Model')
    textcnn_score_module_func = Module.load(ws, namespace='microsoft.com/azureml/samples', name='TextCNN Score Model')
    textcnn_preprocess_module_func = Module.load(ws, namespace='microsoft.com/azureml/samples', name='TextCNN Word to Id')
    print("Load modules successfully.")
except:
    print("Registering modules ...")
    textcnn_train_module_func = Module.register(ws, 'modules/textcnn-train/train.yaml')
    textcnn_score_module_func = Module.register(ws, 'modules/textcnn-score/score.yaml')
    textcnn_preprocess_module_func = Module.register(ws, 'modules/textcnn-preprocess/preprocess.yaml')
    print("Modules registered and loaded successfully.")

## Load built-in modules

There are some built-in modules in AzureML Designer. They are located in 'azureml' namespace.

Use the following code to load built-in modules.

In [None]:
split_data_module_func = Module.load(ws, namespace='azureml', name='Split Data')

## Create pipeline and run

Create a pipeline using the modules, and submit experiment to AzureML using module SDK.

Here is a [preview of the pipeline](assets/text-classification/pipeline.png).


In [None]:
from azureml.pipeline.wrapper import dsl

# Create the pipeline
@dsl.pipeline(name='textcnn_train_pipeline_with_builtin_modules',
              description='TextCNN training pipeline with IMDB_Dataset_Samples.csv dataset',
              default_compute_target=aml_compute_target)
def sample_pipeline():
    split_data_module = split_data_module_func(
        dataset=train_data, 
        splitting_mode='Split Rows',
        fraction_of_rows_in_the_first_output_dataset = 0.8,
        randomized_split = True
    )
    
    textcnn_train_module = textcnn_train_module_func(
        train_data_file=split_data_module.outputs.results_dataset1, 
        validation_data_file=split_data_module.outputs.results_dataset2,
        label_column_name='sentiment',
        true_label_value='positive',
        text_column_name='review'
    )
    
    textcnn_preprocess_module = textcnn_preprocess_module_func(
        input_vocab=textcnn_train_module.outputs.vocab, 
        input_text=split_data_module.outputs.results_dataset2,
        text_column_name='review'
    )

    textcnn_score_module = textcnn_score_module_func(
        trained_model=textcnn_train_module.outputs.trained_model, 
        predict_data=textcnn_preprocess_module.outputs.processed_data
    )

pipeline = sample_pipeline()
pipeline.validate()

In [None]:
# Submit pipeline
run = pipeline.submit(
    experiment_name='textcnn_train'
)

# Show details of the run
run

In [None]:
# Wait until the run completes
run.wait_for_completion()