azureml-sdk: Installs the Azure Machine Learning SDK, which is used for building and managing machine learning workflows on Azure. It includes tools for training models, managing datasets, deploying models, and more.
	•	pillow: Installs the Pillow library, a Python Imaging Library (PIL) fork. This is typically used for image processing tasks, such as loading and manipulating image data.
	•	matplotlib: Installs Matplotlib, a plotting library used for creating static, interactive, and animated visualizations in Python. This can be helpful for data visualization in machine learning tasks.
	•	azure-ai-ml: Installs the Azure AI Machine Learning SDK, which provides specific functionalities for working with machine learning services in Azure AI. It includes tools for managing datasets, running experiments, and managing ML workflows.
	•	azure-identity: Installs the Azure Identity SDK, which helps authenticate to Azure services securely, often via managed identities or credentials such as Service Principal, OAuth, etc.


In [1]:
!pip install azureml-sdk pillow matplotlib



In [2]:
!pip install azure-ai-ml
!pip install azure-identity



	from azureml.core import Workspace: This imports the Workspace class from the Azure ML SDK. The Workspace object represents a centralized environment where you can store datasets, models, experiments, and compute targets.
	•	ws = Workspace(...): This creates a Workspace object by passing the following parameters:
	•	subscription_id: The Azure subscription ID where your resources (like the workspace) are located.
	•	resource_group: The resource group under which the workspace is organized. Resource groups are containers for managing related Azure resources.
	•	workspace_name: The name of the specific Azure ML workspace you’re connecting to. In this case, it’s "Breast_cancer_detection".
	•	print(ws.name, ws.location, ws.resource_group): This line prints the workspace’s:
	•	name: The name of the workspace (in this case, "Breast_cancer_detection").
	•	location: The geographic location where the workspace is hosted (e.g., "eastus2").
	•	resource_group: The resource group name ("naiks01-rg").


In [3]:
from azureml.core import Workspace

ws = Workspace(subscription_id="1eec3e0f-7d92-4d23-a1ec-35283850f6c3",
               resource_group="naiks01-rg",
               workspace_name="Breast_cancer_detection")
print(ws.name, ws.location, ws.resource_group)

Breast_cancer_detection eastus2 naiks01-rg


In [4]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Initialize MLClient
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="1eec3e0f-7d92-4d23-a1ec-35283850f6c3",
    resource_group_name="naiks01-rg",
    workspace_name="Breast_cancer_detection"
)

# Connect to the data asset
data_asset = ml_client.data.get(name="Mammograms", version="1")  # Replace "1" with correct version if needed

# Print data asset details
print(f"Name: {data_asset.name}")
print(f"Path: {data_asset.path}")

Name: Mammograms
Path: azureml://subscriptions/1eec3e0f-7d92-4d23-a1ec-35283850f6c3/resourcegroups/naiks01-rg/workspaces/Breast_cancer_detection/datastores/datalake_breastcancer/paths/Mammograms/


In [5]:
data_path = data_asset.path
print(f"Data Path: {data_path}")

Data Path: azureml://subscriptions/1eec3e0f-7d92-4d23-a1ec-35283850f6c3/resourcegroups/naiks01-rg/workspaces/Breast_cancer_detection/datastores/datalake_breastcancer/paths/Mammograms/


In [6]:
from azureml.core import Environment

# Define a new environment or update the existing one
env = Environment(name="breast-cancer-env")
env.python.conda_dependencies.add_pip_package("joblib")

# Register or update the environment in your workspace
env.register(workspace=ws)

{
    "assetId": "azureml://locations/eastus2/workspaces/c58af33a-ccba-4490-99d4-10e2497b63ed/environments/breast-cancer-env/versions/6",
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:20240908.v1",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "buildContext": null,
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "breast-cancer-env",

In [7]:
benign_folder = os.path.join(data_path, "Benign")
malignant_folder = os.path.join(data_path, "Malignant")

In [8]:
from azure.storage.blob import ContainerClient
from PIL import Image
import io
import numpy as np

# Use your connection string and container name
connection_string = "DefaultEndpointsProtocol=https;AccountName=datalakebreastcancer;AccountKey=70BHuecX7ho/jdOKwWKByc/iUg6lGK6RpvwY6A2dTJZGAkjCPxqz8hbBgjrU9VpZjTCCKHMCyZ9/+AStzKf5tQ==;EndpointSuffix=core.windows.net"
container_name = "breastcancermammograms"

# Initialize the container client
container_client = ContainerClient.from_connection_string(connection_string, container_name)

# Function to load and preprocess images from a specific folder
def load_images_from_azure(subfolder_path, label):
    data = []
    labels = []
    blobs = container_client.list_blobs(name_starts_with=f"{subfolder_path}/")
    for blob in blobs:
        # Download the blob content
        blob_data = container_client.download_blob(blob.name).readall()
        # Open the image using PIL
        image = Image.open(io.BytesIO(blob_data)).resize((224, 224)).convert("RGB")
        data.append(np.array(image))
        labels.append(label)
    return data, labels

# Load benign and malignant images
benign_data, benign_labels = load_images_from_azure("Mammograms/Benign", 0)
malignant_data, malignant_labels = load_images_from_azure("Mammograms/Malignant", 1)

# Combine the data
data = np.array(benign_data + malignant_data)
labels = np.array(benign_labels + malignant_labels)

print(f"Loaded {len(data)} images.")

Loaded 2006 images.


In [9]:
# Normalize pixel values
data = data.astype('float32') / 255.0

In [10]:
from sklearn.model_selection import train_test_split
import numpy as np

# Split the data into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(
    data, labels, test_size=0.2, random_state=42
)

# Save the datasets as .npy files
np.save("train_data.npy", train_data)
np.save("train_labels.npy", train_labels)
np.save("test_data.npy", test_data)
np.save("test_labels.npy", test_labels)


In [11]:
import joblib
import numpy as np
from azureml.core import Run
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Get the Azure ML run context
run = Run.get_context()

def main():
    # Define the file paths relative to the script
    train_data_path = "train_data.npy"
    train_labels_path = "train_labels.npy"
    test_data_path = "test_data.npy"
    test_labels_path = "test_labels.npy"

    # Load the data
    print("Loading data...")
    train_data = np.load(train_data_path)
    train_labels = np.load(train_labels_path)
    test_data = np.load(test_data_path)
    test_labels = np.load(test_labels_path)
    print("Data successfully loaded.")

    # Flatten the data for Random Forest
    print("Preprocessing data...")
    train_data_flatten = train_data.reshape(len(train_data), -1)
    test_data_flatten = test_data.reshape(len(test_data), -1)

    # Train the Random Forest model
    print("Training the model...")
    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_model.fit(train_data_flatten, train_labels)

    # Predict on the test set
    print("Evaluating the model...")
    test_predictions = rf_model.predict(test_data_flatten)

    # Log metrics
    accuracy = accuracy_score(test_labels, test_predictions)
    print(f"Accuracy: {accuracy}")
    run.log("accuracy", accuracy)

    # Log classification report
    report = classification_report(test_labels, test_predictions, output_dict=True)
    for label, metrics in report.items():
        if isinstance(metrics, dict):
            for metric_name, value in metrics.items():
                run.log(f"{label}_{metric_name}", value)

    # Save the model
    print("Saving the model...")
    joblib.dump(rf_model, "random_forest_model.pkl")
    print("Model saved as random_forest_model.pkl.")

    # Upload the model to Azure ML
    print("Uploading the model to Azure ML...")
    run.upload_file(name="outputs/random_forest_model.pkl", path_or_stream="random_forest_model.pkl")
    print("Model uploaded successfully.")

    # Save this script as a .py file
    print("Saving the script as train.py...")
    script_content = """
import argparse
import joblib
import numpy as np
from azureml.core import Run
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Get the Azure ML run context
run = Run.get_context()

def main():
    train_data_path = "train_data.npy"
    train_labels_path = "train_labels.npy"
    test_data_path = "test_data.npy"
    test_labels_path = "test_labels.npy"

    train_data = np.load(train_data_path)
    train_labels = np.load(train_labels_path)
    test_data = np.load(test_data_path)
    test_labels = np.load(test_labels_path)

    train_data_flatten = train_data.reshape(len(train_data), -1)
    test_data_flatten = test_data.reshape(len(test_data), -1)

    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_model.fit(train_data_flatten, train_labels)

    test_predictions = rf_model.predict(test_data_flatten)

    accuracy = accuracy_score(test_labels, test_predictions)
    run.log("accuracy", accuracy)

    report = classification_report(test_labels, test_predictions, output_dict=True)
    for label, metrics in report.items():
        if isinstance(metrics, dict):
            for metric_name, value in metrics.items():
                run.log(f"{label}_{metric_name}", value)

    joblib.dump(rf_model, "random_forest_model.pkl")
    run.upload_file(name="outputs/random_forest_model.pkl", path_or_stream="random_forest_model.pkl")

if __name__ == "__main__":
    main()
"""
    with open("train.py", "w") as script_file:
        script_file.write(script_content)
    run.upload_file(name="outputs/train.py", path_or_stream="train.py")
    print("Script saved and uploaded successfully.")

# Entry point for the script
if __name__ == "__main__":
    main()

Loading data...
Data successfully loaded.
Preprocessing data...
Training the model...
Evaluating the model...
Accuracy: 0.9950248756218906
Attempted to log scalar metric accuracy:
0.9950248756218906
Attempted to log scalar metric 0_precision:
1.0
Attempted to log scalar metric 0_recall:
0.9900990099009901
Attempted to log scalar metric 0_f1-score:
0.9950248756218906
Attempted to log scalar metric 0_support:
202.0
Attempted to log scalar metric 1_precision:
0.9900990099009901
Attempted to log scalar metric 1_recall:
1.0
Attempted to log scalar metric 1_f1-score:
0.9950248756218906
Attempted to log scalar metric 1_support:
200.0
Attempted to log scalar metric macro avg_precision:
0.995049504950495
Attempted to log scalar metric macro avg_recall:
0.995049504950495
Attempted to log scalar metric macro avg_f1-score:
0.9950248756218906
Attempted to log scalar metric macro avg_support:
402.0
Attempted to log scalar metric weighted avg_precision:
0.9950741342790996
Attempted to log scalar metr

In [12]:
import sys
sys.argv = [
    'train.py',
    '--train_data', 'train_data.npy',
    '--train_labels', 'train_labels.npy',
    '--test_data', 'test_data.npy',
    '--test_labels', 'test_labels.npy'
]

In [13]:
import argparse
import joblib
import numpy as np
from azureml.core import Run
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Get the Azure ML run context
run = Run.get_context()

def main():
    # Parse input arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--train_data", type=str, required=True)
    parser.add_argument("--train_labels", type=str, required=True)
    parser.add_argument("--test_data", type=str, required=True)
    parser.add_argument("--test_labels", type=str, required=True)
    args = parser.parse_args()

    # Load the data
    print("Loading data from datastore...")
    train_data = np.load(args.train_data)
    train_labels = np.load(args.train_labels)
    test_data = np.load(args.test_data)
    test_labels = np.load(args.test_labels)
    print("Data successfully loaded.")

    # Flatten the data for Random Forest
    print("Preprocessing data...")
    train_data_flatten = train_data.reshape(len(train_data), -1)
    test_data_flatten = test_data.reshape(len(test_data), -1)

    # Train the Random Forest model
    print("Training the model...")
    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_model.fit(train_data_flatten, train_labels)

    # Predict on the test set
    print("Evaluating the model...")
    test_predictions = rf_model.predict(test_data_flatten)

    # Log metrics
    accuracy = accuracy_score(test_labels, test_predictions)
    print(f"Accuracy: {accuracy}")
    run.log("accuracy", accuracy)

    # Log classification report
    report = classification_report(test_labels, test_predictions, output_dict=True)
    for label, metrics in report.items():
        if isinstance(metrics, dict):
            for metric_name, value in metrics.items():
                run.log(f"{label}_{metric_name}", value)

    # Save the model
    print("Saving the model...")
    joblib.dump(rf_model, "random_forest_model.pkl")
    print("Model saved as random_forest_model.pkl.")

    # Upload the model to Azure ML
    print("Uploading the model to Azure ML...")
    run.upload_file(name="outputs/random_forest_model.pkl", path_or_stream="random_forest_model.pkl")
    print("Model uploaded successfully.")

if __name__ == "__main__":
    main()

Loading data from datastore...
Data successfully loaded.
Preprocessing data...
Training the model...
Evaluating the model...
Accuracy: 0.9950248756218906
Attempted to log scalar metric accuracy:
0.9950248756218906
Attempted to log scalar metric 0_precision:
1.0
Attempted to log scalar metric 0_recall:
0.9900990099009901
Attempted to log scalar metric 0_f1-score:
0.9950248756218906
Attempted to log scalar metric 0_support:
202.0
Attempted to log scalar metric 1_precision:
0.9900990099009901
Attempted to log scalar metric 1_recall:
1.0
Attempted to log scalar metric 1_f1-score:
0.9950248756218906
Attempted to log scalar metric 1_support:
200.0
Attempted to log scalar metric macro avg_precision:
0.995049504950495
Attempted to log scalar metric macro avg_recall:
0.995049504950495
Attempted to log scalar metric macro avg_f1-score:
0.9950248756218906
Attempted to log scalar metric macro avg_support:
402.0
Attempted to log scalar metric weighted avg_precision:
0.9950741342790996
Attempted to 

In [14]:
from azureml.core.compute import ComputeTarget

compute_name = "naiks011"  # Replace with your actual compute instance name

# Retrieve the existing compute instance
compute_target = ComputeTarget(workspace=ws, name=compute_name)
print(f"Using compute target: {compute_name}")

Using compute target: naiks011


In [15]:
from azureml.core import Datastore

datastore = Datastore.get(ws, datastore_name="datalake_breastcancer")
datastore.upload_files(
    files=["train_data.npy", "train_labels.npy", "test_data.npy", "test_labels.npy"],
    target_path="training_data",
    overwrite=True,
)
print("Files uploaded to datastore.")

"datastore.upload_files" is deprecated after version 1.0.69. Please use "FileDatasetFactory.upload_directory" instead. See Dataset API change notice at https://aka.ms/dataset-deprecation.


Uploading an estimated of 4 files
Uploading train_labels.npy
Uploaded train_labels.npy, 1 files out of an estimated total of 4
Uploading test_labels.npy
Uploaded test_labels.npy, 2 files out of an estimated total of 4
Uploading test_data.npy
Uploaded test_data.npy, 3 files out of an estimated total of 4
Uploading train_data.npy
Uploaded train_data.npy, 4 files out of an estimated total of 4
Uploaded 4 files
Files uploaded to datastore.


In [19]:
pip install azureml-sdk

Collecting azureml-sdk
  Downloading azureml_sdk-1.59.0-py3-none-any.whl.metadata (3.6 kB)
Collecting azureml-core~=1.59.0 (from azureml-sdk)
  Downloading azureml_core-1.59.0-py3-none-any.whl.metadata (3.2 kB)
Collecting azureml-dataset-runtime~=1.59.0 (from azureml-dataset-runtime[fuse]~=1.59.0->azureml-sdk)
  Downloading azureml_dataset_runtime-1.59.0-py3-none-any.whl.metadata (1.2 kB)
Collecting azureml-train-core~=1.59.0 (from azureml-sdk)
  Downloading azureml_train_core-1.59.0-py3-none-any.whl.metadata (1.8 kB)
Collecting azureml-train-automl-client~=1.59.0 (from azureml-sdk)
  Downloading azureml_train_automl_client-1.59.0-py3-none-any.whl.metadata (1.4 kB)
Collecting azureml-pipeline~=1.59.0 (from azureml-sdk)
  Downloading azureml_pipeline-1.59.0-py3-none-any.whl.metadata (1.8 kB)
Collecting fusepy<4.0.0,>=3.0.1 (from azureml-dataset-runtime[fuse]~=1.59.0->azureml-sdk)
  Downloading fusepy-3.0.1.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting az

In [20]:
from azureml.pipeline.steps import PythonScriptStep
from azureml.data.datapath import DataPath

In [42]:
train_step = PythonScriptStep(
    name="Train Random Forest Model",
    script_name="train.py",
    arguments=[
        "--train_data", DataPath(datastore, "training_data/train_data.npy"),
        "--train_labels", DataPath(datastore, "training_data/train_labels.npy"),
        "--test_data", DataPath(datastore, "training_data/test_data.npy"),
        "--test_labels", DataPath(datastore, "training_data/test_labels.npy"),
    ],
    compute_target=compute_target,
    source_directory="./pipeline_scripts",  # Use the new directory
)

In [43]:
# Create and submit the pipeline
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[train_step])
print("Pipeline created successfully.")

pipeline_run = pipeline.submit("breast-cancer-detection-pipeline")
pipeline_run.wait_for_completion(show_output=True)

Pipeline created successfully.
Created step Train Random Forest Model [943e9702][ab52994c-bb1e-4370-8155-702910c22924], (This step will run and generate new outputs)
Submitted PipelineRun 53520b44-8bee-4ddd-9905-74618e78e4fd
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/53520b44-8bee-4ddd-9905-74618e78e4fd?wsid=/subscriptions/1eec3e0f-7d92-4d23-a1ec-35283850f6c3/resourcegroups/naiks01-rg/workspaces/Breast_cancer_detection&tid=b7dc318e-8abb-4c84-9a6a-3ae9fff0999f
PipelineRunId: 53520b44-8bee-4ddd-9905-74618e78e4fd
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/53520b44-8bee-4ddd-9905-74618e78e4fd?wsid=/subscriptions/1eec3e0f-7d92-4d23-a1ec-35283850f6c3/resourcegroups/naiks01-rg/workspaces/Breast_cancer_detection&tid=b7dc318e-8abb-4c84-9a6a-3ae9fff0999f
PipelineRun Status: NotStarted
PipelineRun Status: Running


Expected a StepRun object but received <class 'azureml.core.run.Run'> instead.
This usually indicates a package conflict with one of the dependencies of azureml-core or azureml-pipeline-core.
Please check for package conflicts in your python environment






PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': '53520b44-8bee-4ddd-9905-74618e78e4fd', 'status': 'Completed', 'startTimeUtc': '2024-12-12T02:28:17.312188Z', 'endTimeUtc': '2024-12-12T02:29:20.999056Z', 'services': {}, 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}', 'azureml.continue_on_step_failure': 'False', 'azureml.continue_on_failed_optional_input': 'True', 'azureml.pipelineComponent': 'pipelinerun', 'azureml.pipelines.stages': '{"Initialization":null,"Execution":{"StartTime":"2024-12-12T02:28:17.5765695+00:00","EndTime":"2024-12-12T02:29:20.9007625+00:00","Status":"Finished"}}'}, 'inputDatasets': [], 'outputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://breastcancerde4596660225.blob.core.windows.net/azureml/ExperimentRun/dcid.53520b44-8bee-4ddd-9905-74618e78e4fd/logs/azureml/executionlogs.txt?sv=2019-07-07&sr=b&sig=BAjD17WmesR9O8nhiZ%2FUt%2F9IUhFe2DiNs8MfG

'Finished'

In [None]:
from azureml.data.dataset_factory import DataPath

train_step = PythonScriptStep(
    name="Train Random Forest Model",
    script_name="train.py",
    arguments=[
        "--train_data", DataPath(datastore, "training_data/train_data.npy"),
        "--train_labels", DataPath(datastore, "training_data/train_labels.npy"),
        "--test_data", DataPath(datastore, "training_data/test_data.npy"),
        "--test_labels", DataPath(datastore, "training_data/test_labels.npy"),
    ],
    compute_target=compute_target,
    source_directory="./pipeline_scripts",  # Ensure this matches where train.py is located
)

# Create and submit the pipeline
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[train_step])
print("Pipeline created successfully.")

pipeline_run = pipeline.submit("breast-cancer-detection-pipeline")
pipeline_run.wait_for_completion(show_output=True)

In [None]:
from azureml.pipeline.core import Pipeline

# Create the pipeline using the defined step
pipeline = Pipeline(workspace=ws, steps=[train_step])
print("Pipeline created successfully.")

In [None]:
# Submit the pipeline to Azure ML
pipeline_run = pipeline.submit("breast-cancer-detection-pipeline")
print("Pipeline submitted. Waiting for completion...")

In [None]:
pipeline_run.wait_for_completion(show_output=True)