**This Jupyter notebook should be run from within a compute instance on AzureML, in a Python kernel, specifically `Python 3.10 - SDK v2 (Python 3.10.11)`**. 

## Create a client connection to the AzureML workspace

The following cell creates a connection object called `ml_client` which has a connection to the AzureML workspace. You have to create this from every notebook or python script that interacts with the AzureML platform.

In [1]:
!python --version
!pip install azure-ai-ml --upgrade

Python 3.10.11


In [30]:
from azure.ai.ml import MLClient, spark, Input, Output
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import UserIdentityConfiguration

Use this authentication mechanism if you are running this notebook from your compute instance within Azure Machine Learning:

In [31]:
ml_client = MLClient.from_config(
    DefaultAzureCredential(),
)


Found the config file in: /config.json


However, you can also run this control plane notebook from your Laptop. You need to install the python libraries in the `requirements.txt` file.

In [32]:
ml_client = MLClient(
    credential=DefaultAzureCredential(),
    workspace_name="project-group-35",
    subscription_id="21ff0fc0-dd2c-450d-93b7-96eeb3699b22",
    resource_group_name="project-group-35"
)

## Download the Spark-NLP jar to your working directory to be able to to add to the job cluster.

You only need to do this once. The jar file needs to be in the same directory of the script that will be run as a job.

In [33]:
# Download the spark-nlp jar and save it locally. This needs to be done before submitting a job.
import requests
response = requests.get("https://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.12/5.0.2/spark-nlp_2.12-5.0.2.jar")
with open("spark-nlp_2.12-5.0.2.jar", "wb") as f:
    f.write(response.content)

## Define the Job

The following cell defines the job. It is an object of [Spark Class](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.spark?view=azure-python) that contains the required information to run a job:

For more information about the parameters used in the job definition, [read the documentation](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-submit-spark-jobs?view=azureml-api-2&tabs=sdk#submit-a-standalone-spark-job).

In [34]:
sparknlp_job_def = spark(
    display_name="last-test-expect-success_2",  
    code="./",
    entry={"file": "sample-spark-nlp-job-tiana.py"},
    driver_cores=1,
    driver_memory="7g",
    executor_instances=1,
    executor_cores=1,
    executor_memory="7g",
        resources={
        "instance_type": "Standard_E4S_V3",
        "runtime_version": "3.4",
    },
    jars=["spark-nlp_2.12-5.0.2.jar"],
    environment="sparknlp-python-env@latest",
    identity=UserIdentityConfiguration()
)

## Submit the job

The following cell takes the job you defined above and submits it. If you are submitting multiple jobs, you may want to create separate job definition objects for clarity. You can submit more than one job, just remember that each job will spin up a Spark cluster.

In [35]:
import os
print("File exists:", os.path.exists("sample-spark-nlp-job-tiana.py"))

File exists: True


In [36]:
sparknlp_job = ml_client.jobs.create_or_update(sparknlp_job_def)

[32mUploading spark-job-with-sparknlp-copy (45.76 MBs): 100%|██████████| 45760087/45760087 [00:01<00:00, 30064788.54it/s]
[39m



## Get the Job Studio URL

Once you submit the job, you can navigate to it in the AzureML Studio and monitor it's progress. There are ways to do it through the SDK but for now just use the Studio. These are unattended jobs, which means you can shut down this notebook and the Compute Instance, but the job will go through it's lifecycle:

- Job is submitted
- Job is queued
- Job is run
- Job completes (assuming no errors)

**Each job's Studio URL will be different.**

In [37]:
print(sparknlp_job.studio_url)

https://ml.azure.com/runs/silly_peach_9s98nzyw3n?wsid=/subscriptions/21ff0fc0-dd2c-450d-93b7-96eeb3699b22/resourcegroups/project-group-35/workspaces/project-group-35&tid=fd571193-38cb-437b-bb55-60f28d67b643


In [38]:
import zipfile

jar_file = 'spark-nlp_2.12-5.0.2.jar'
with zipfile.ZipFile(jar_file, 'r') as z:
    with z.open('META-INF/MANIFEST.MF') as manifest:
        contents = manifest.read().decode('utf-8')
        for line in contents.splitlines():
            if 'Implementation-Version' in line:
                print(line)

Implementation-Version: 5.0.2
