
	•	azureml-sdk: Installs the Azure Machine Learning SDK, which is used for building and managing machine learning workflows on Azure. It includes tools for training models, managing datasets, deploying models, and more.
	•	pillow: Installs the Pillow library, a Python Imaging Library (PIL) fork. This is typically used for image processing tasks, such as loading and manipulating image data.
	•	matplotlib: Installs Matplotlib, a plotting library used for creating static, interactive, and animated visualizations in Python. This can be helpful for data visualization in machine learning tasks.
	•	azure-ai-ml: Installs the Azure AI Machine Learning SDK, which provides specific functionalities for working with machine learning services in Azure AI. It includes tools for managing datasets, running experiments, and managing ML workflows.
	•	azure-identity: Installs the Azure Identity SDK, which helps authenticate to Azure services securely, often via managed identities or credentials such as Service Principal, OAuth, etc.




In [None]:
!pip install azureml-sdk pillow matplotlib

In [None]:
!pip install azure-ai-ml
!pip install azure-identity


	•	from azureml.core import Workspace: This imports the Workspace class from the Azure ML SDK. The Workspace object represents a centralized environment where you can store datasets, models, experiments, and compute targets.
	•	ws = Workspace(...): This creates a Workspace object by passing the following parameters:
	•	subscription_id: The Azure subscription ID where your resources (like the workspace) are located.
	•	resource_group: The resource group under which the workspace is organized. Resource groups are containers for managing related Azure resources.
	•	workspace_name: The name of the specific Azure ML workspace you’re connecting to. In this case, it’s "Breast_cancer_detection".
	•	print(ws.name, ws.location, ws.resource_group): This line prints the workspace’s:
	•	name: The name of the workspace (in this case, "Breast_cancer_detection").
	•	location: The geographic location where the workspace is hosted (e.g., "eastus2").
	•	resource_group: The resource group name ("naiks01-rg").

This code effectively connects to your Azure Machine Learning workspace, allowing you to perform tasks like data management, model training, and deployment on Azure. It also confirms the connection by printing workspace details.


In [None]:
from azureml.core import Workspace

ws = Workspace(subscription_id="1eec3e0f-7d92-4d23-a1ec-35283850f6c3",
               resource_group="naiks01-rg",
               workspace_name="Breast_cancer_detection")
print(ws.name, ws.location, ws.resource_group)



1. Importing Required Libraries:

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

	•	MLClient: This class is part of the azure-ai-ml SDK. It allows you to interact with Azure Machine Learning services and manage various resources like datasets, models, and experiments. It provides methods to work with resources stored in the Azure ML workspace.
	•	DefaultAzureCredential: This class is from the azure-identity library. It provides a simplified way to authenticate to Azure services using a variety of methods, such as:
	•	Environment variables
	•	Managed identities (for Azure services)
	•	Azure CLI credentials
	•	Visual Studio Code credentials
	•	And more
	•	It automatically selects the appropriate credential method based on the environment.

2. Initializing the MLClient:

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="1eec3e0f-7d92-4d23-a1ec-35283850f6c3",
    resource_group_name="naiks01-rg",
    workspace_name="Breast_cancer_detection"
)

	•	This initializes the MLClient object, which is used to interact with Azure ML resources (such as datasets, models, experiments).
	•	Parameters:
	•	DefaultAzureCredential(): Provides the authentication needed to access Azure resources. It will automatically select the appropriate authentication method.
	•	subscription_id: Specifies the Azure subscription ID where the Azure ML workspace is located.
	•	resource_group_name: Specifies the resource group containing the workspace.
	•	workspace_name: Specifies the name of the Azure ML workspace.

By initializing the MLClient, you’re able to access and manage your resources within the Azure ML workspace.

3. Connecting to the Data Asset:

data_asset = ml_client.data.get(name="Mammograms", version="1")

	•	This line retrieves a data asset from the Azure ML workspace using the ml_client.
	•	ml_client.data.get(): This method allows you to fetch a data asset from the workspace. You specify:
	•	name="Mammograms": The name of the data asset you want to access. In this case, the dataset is named “Mammograms,” which likely contains data related to breast cancer detection.
	•	version="1": The version of the dataset you want to access. In this case, the first version of the “Mammograms” dataset is being retrieved. If you need a different version, you can change this number.

4. Printing Data Asset Details:

print(f"Name: {data_asset.name}")
print(f"Path: {data_asset.path}")

	•	After retrieving the data asset, these lines print out:
	•	data_asset.name: The name of the data asset (e.g., “Mammograms”).
	•	data_asset.path: The path to the data asset, which provides the location of the data in Azure storage or data lake. This is important because you will use this path to access the dataset for further processing or training.

Summary:

	•	This code connects to Azure ML, authenticates using the DefaultAzureCredential, and retrieves a dataset named “Mammograms” from the workspace.
	•	It prints out the name and path of the dataset to confirm the connection and provide details about the dataset location.



In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Initialize MLClient
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="1eec3e0f-7d92-4d23-a1ec-35283850f6c3",
    resource_group_name="naiks01-rg",
    workspace_name="Breast_cancer_detection"
)

# Connect to the data asset
data_asset = ml_client.data.get(name="Mammograms", version="1")  

# Print data asset details
print(f"Name: {data_asset.name}")
print(f"Path: {data_asset.path}")

In [None]:
data_path = data_asset.path
print(f"Data Path: {data_path}")

Environment Creation: A new environment named "breast-cancer-env" is created (or an existing one is updated).
	•	Package Addition: The joblib Python package is added to the environment, ensuring that it is available for any scripts or models using this environment.
	•	Environment Registration: The environment is then registered in the Azure ML workspace, making it available for use in Azure ML experiments.

        

In [None]:
from azureml.core import Environment


env = Environment(name="breast-cancer-env")
env.python.conda_dependencies.add_pip_package("joblib")


env.register(workspace=ws)

The lines of code you provided are doing the following:

1. Joining Paths:

benign_folder = os.path.join(data_path, "Benign")
malignant_folder = os.path.join(data_path, "Malignant")

	•	os.path.join(): This function is used to concatenate paths in a platform-independent way. It ensures that the correct directory separator (/ or \, depending on the operating system) is used.
	•	data_path: This is the path to the directory or dataset location, which has been defined earlier in the code (likely as the path to the “Mammograms” dataset in your Azure ML workspace).
	•	"Benign" and "Malignant": These are folder names likely corresponding to two categories of mammogram images—benign (non-cancerous) and malignant (cancerous). These folders will contain images or data files for each type of breast cancer case.
	•	Result:
	•	benign_folder: This will store the full path to the “Benign” folder inside the dataset, which contains images of benign (non-cancerous) cases.
	•	malignant_folder: This will store the full path to the “Malignant” folder inside the dataset, which contains images of malignant (cancerous) cases.

Summary:

	•	These lines are constructing full paths to two subdirectories—“Benign” and “Malignant”—inside the data_path. These directories likely hold the relevant datasets for training a breast cancer detection model, where benign and malignant refer to the two classes (types of tumors) the model will classify.

Would you like to know how to use these directories further, or do you have any specific questions?

In [None]:
benign_folder = os.path.join(data_path, "Benign")
malignant_folder = os.path.join(data_path, "Malignant")

In the code you provided, the Azure ML data asset folder (such as the “Mammograms” folder) serves a specific purpose in the Azure ML workflow, even though you’re directly accessing Azure Blob Storage for loading the images. Here’s why the Azure ML data asset is still useful, even if you’re directly interacting with Blob Storage for training:

1. Seamless Integration with Azure ML Pipelines:

	•	When you create a data asset in Azure ML, you’re essentially registering a reference to your data (in Blob Storage or Data Lake) within the Azure ML environment. This data asset is integrated into Azure ML’s infrastructure and makes it easier to use in experiments, pipelines, and model training without manually dealing with storage connections each time.
	•	Azure ML Pipelines can directly use data assets, allowing you to refer to the data abstractly, making the entire workflow easier to manage.

2. Versioning:

	•	One key benefit of using Azure ML Data Assets is versioning. When you register a data asset, Azure ML tracks different versions of the dataset. This is helpful when you want to:
	•	Keep track of changes in the data (e.g., updates or new versions of the dataset).
	•	Reproduce experiments with the same version of the data, ensuring consistency in results.
	•	Without using the data asset, you’d have to manually track the version of the data you’re using (e.g., by naming your blobs or folders with version identifiers), which could become cumbersome.

3. Centralized Data Management:

	•	Storing data as a data asset in Azure ML centralizes your data management. The Azure ML workspace becomes the place where you organize, track, and access all datasets. You don’t have to manually reference paths and connections every time you work with the data in different experiments or pipelines.
	•	This is especially helpful when working in teams, as everyone can easily access the same dataset through Azure ML, reducing errors from working with mismatched paths or versions.

4. Data Sharing and Collaboration:

	•	If you are working in a collaborative environment, data assets make it easier to share datasets with other team members or teams within Azure ML. When you register a dataset as a data asset, it becomes available for use by anyone with the appropriate access in the Azure ML workspace.
	•	If you didn’t use data assets and only referenced Blob Storage directly, data sharing would require sharing specific storage access keys, which is less flexible and can lead to security concerns.

5. Security and Access Control:

	•	Azure ML Data Assets allow you to manage access control and security more easily than direct access to Blob Storage. Azure ML integrates with Azure Active Directory (AAD) for role-based access control (RBAC), so you can restrict access to the data based on user roles, making it more secure and easier to manage in a multi-user environment.

6. Simplified Workflow:

	•	Using data assets simplifies your workflow. Instead of manually handling the connection to Blob Storage (downloading blobs, managing credentials, paths, etc.), the data asset lets you treat the data as a managed resource within Azure ML.
	•	This is especially useful in automated machine learning pipelines, where you’d prefer not to deal with the complexities of file system management but rather refer to registered datasets in a uniform way.

7. Best Practices in Azure ML:

	•	Azure ML Data Assets are a best practice in Azure ML when working with datasets, as they provide a layer of abstraction that makes data management easier, more scalable, and more secure. Even if you’re directly accessing Blob Storage, it’s often preferable to register the dataset as a data asset to take full advantage of Azure ML’s features.

Conclusion:

Even though you’re directly accessing Azure Blob Storage for loading the images, the Azure ML Data Asset serves as an organizational and management tool within Azure ML:
	•	It simplifies and standardizes data access.
	•	It supports versioning, security, and access control.
	•	It integrates seamlessly with Azure ML pipelines, experiments, and workflows.

By creating the data asset in Azure ML, you’re aligning with best practices that make it easier to manage, share, and secure data across your machine learning projects.

Let me know if you’d like to dive deeper into how Azure ML data assets are used in pipelines or experiments!

In [None]:
from azure.storage.blob import ContainerClient
from PIL import Image
import io
import numpy as np


connection_string = "DefaultEndpointsProtocol=https;AccountName=datalakebreastcancer;AccountKey=70BHuecX7ho/jdOKwWKByc/iUg6lGK6RpvwY6A2dTJZGAkjCPxqz8hbBgjrU9VpZjTCCKHMCyZ9/+AStzKf5tQ==;EndpointSuffix=core.windows.net"
container_name = "breastcancermammograms"

# Initialize the container client
container_client = ContainerClient.from_connection_string(connection_string, container_name)

# Function to load and preprocess images from a specific folder
def load_images_from_azure(subfolder_path, label):
    data = []
    labels = []
    blobs = container_client.list_blobs(name_starts_with=f"{subfolder_path}/")
    for blob in blobs:
        # Download the blob content
        blob_data = container_client.download_blob(blob.name).readall()
        # Open the image using PIL
        image = Image.open(io.BytesIO(blob_data)).resize((224, 224)).convert("RGB")
        data.append(np.array(image))
        labels.append(label)
    return data, labels

# Load benign and malignant images
benign_data, benign_labels = load_images_from_azure("Mammograms/Benign", 0)
malignant_data, malignant_labels = load_images_from_azure("Mammograms/Malignant", 1)

# Combine the data
data = np.array(benign_data + malignant_data)
labels = np.array(benign_labels + malignant_labels)

print(f"Loaded {len(data)} images.")

In [None]:
# Normalize pixel values
data = data.astype('float32') / 255.0

In [None]:
from sklearn.model_selection import train_test_split
import numpy as np

# Split the data into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(
    data, labels, test_size=0.2, random_state=42
)

# Save the datasets as .npy files
np.save("train_data.npy", train_data)
np.save("train_labels.npy", train_labels)
np.save("test_data.npy", test_data)
np.save("test_labels.npy", test_labels)


In [None]:
import joblib
import numpy as np
from azureml.core import Run
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Get the Azure ML run context
run = Run.get_context()

def main():
    # Define the file paths relative to the script
    train_data_path = "train_data.npy"
    train_labels_path = "train_labels.npy"
    test_data_path = "test_data.npy"
    test_labels_path = "test_labels.npy"

    # Load the data
    print("Loading data...")
    train_data = np.load(train_data_path)
    train_labels = np.load(train_labels_path)
    test_data = np.load(test_data_path)
    test_labels = np.load(test_labels_path)
    print("Data successfully loaded.")

    # Flatten the data for Random Forest
    print("Preprocessing data...")
    train_data_flatten = train_data.reshape(len(train_data), -1)
    test_data_flatten = test_data.reshape(len(test_data), -1)

    # Train the Random Forest model
    print("Training the model...")
    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_model.fit(train_data_flatten, train_labels)

    # Predict on the test set
    print("Evaluating the model...")
    test_predictions = rf_model.predict(test_data_flatten)

    # Log metrics
    accuracy = accuracy_score(test_labels, test_predictions)
    print(f"Accuracy: {accuracy}")
    run.log("accuracy", accuracy)

    # Log classification report
    report = classification_report(test_labels, test_predictions, output_dict=True)
    for label, metrics in report.items():
        if isinstance(metrics, dict):
            for metric_name, value in metrics.items():
                run.log(f"{label}_{metric_name}", value)

    # Save the model
    print("Saving the model...")
    joblib.dump(rf_model, "random_forest_model.pkl")
    print("Model saved as random_forest_model.pkl.")

    # Upload the model to Azure ML
    print("Uploading the model to Azure ML...")
    run.upload_file(name="outputs/random_forest_model.pkl", path_or_stream="random_forest_model.pkl")
    print("Model uploaded successfully.")

    # Save this script as a .py file
    print("Saving the script as train.py...")
    script_content = """
import argparse
import joblib
import numpy as np
from azureml.core import Run
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Get the Azure ML run context
run = Run.get_context()

def main():
    train_data_path = "train_data.npy"
    train_labels_path = "train_labels.npy"
    test_data_path = "test_data.npy"
    test_labels_path = "test_labels.npy"

    train_data = np.load(train_data_path)
    train_labels = np.load(train_labels_path)
    test_data = np.load(test_data_path)
    test_labels = np.load(test_labels_path)

    train_data_flatten = train_data.reshape(len(train_data), -1)
    test_data_flatten = test_data.reshape(len(test_data), -1)

    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_model.fit(train_data_flatten, train_labels)

    test_predictions = rf_model.predict(test_data_flatten)

    accuracy = accuracy_score(test_labels, test_predictions)
    run.log("accuracy", accuracy)

    report = classification_report(test_labels, test_predictions, output_dict=True)
    for label, metrics in report.items():
        if isinstance(metrics, dict):
            for metric_name, value in metrics.items():
                run.log(f"{label}_{metric_name}", value)

    joblib.dump(rf_model, "random_forest_model.pkl")
    run.upload_file(name="outputs/random_forest_model.pkl", path_or_stream="random_forest_model.pkl")

if __name__ == "__main__":
    main()
"""
    with open("train.py", "w") as script_file:
        script_file.write(script_content)
    run.upload_file(name="outputs/train.py", path_or_stream="train.py")
    print("Script saved and uploaded successfully.")

# Entry point for the script
if __name__ == "__main__":
    main()

In [None]:
import sys
sys.argv = [
    'train.py',
    '--train_data', 'train_data.npy',
    '--train_labels', 'train_labels.npy',
    '--test_data', 'test_data.npy',
    '--test_labels', 'test_labels.npy'
]

In [None]:
import argparse
import joblib
import numpy as np
from azureml.core import Run
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Get the Azure ML run context
run = Run.get_context()

def main():
    # Parse input arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--train_data", type=str, required=True)
    parser.add_argument("--train_labels", type=str, required=True)
    parser.add_argument("--test_data", type=str, required=True)
    parser.add_argument("--test_labels", type=str, required=True)
    args = parser.parse_args()

    # Load the data
    print("Loading data from datastore...")
    train_data = np.load(args.train_data)
    train_labels = np.load(args.train_labels)
    test_data = np.load(args.test_data)
    test_labels = np.load(args.test_labels)
    print("Data successfully loaded.")

    # Flatten the data for Random Forest
    print("Preprocessing data...")
    train_data_flatten = train_data.reshape(len(train_data), -1)
    test_data_flatten = test_data.reshape(len(test_data), -1)

    # Train the Random Forest model
    print("Training the model...")
    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_model.fit(train_data_flatten, train_labels)

    # Predict on the test set
    print("Evaluating the model...")
    test_predictions = rf_model.predict(test_data_flatten)

    # Log metrics
    accuracy = accuracy_score(test_labels, test_predictions)
    print(f"Accuracy: {accuracy}")
    run.log("accuracy", accuracy)

    # Log classification report
    report = classification_report(test_labels, test_predictions, output_dict=True)
    for label, metrics in report.items():
        if isinstance(metrics, dict):
            for metric_name, value in metrics.items():
                run.log(f"{label}_{metric_name}", value)

    # Save the model
    print("Saving the model...")
    joblib.dump(rf_model, "random_forest_model.pkl")
    print("Model saved as random_forest_model.pkl.")

    # Upload the model to Azure ML
    print("Uploading the model to Azure ML...")
    run.upload_file(name="outputs/random_forest_model.pkl", path_or_stream="random_forest_model.pkl")
    print("Model uploaded successfully.")

if __name__ == "__main__":
    main()

In [None]:
from azureml.core.compute import ComputeTarget

compute_name = "naiks011"  # Replace with your actual compute instance name

# Retrieve the existing compute instance
compute_target = ComputeTarget(workspace=ws, name=compute_name)
print(f"Using compute target: {compute_name}")

In [None]:
from azureml.core import Datastore

datastore = Datastore.get(ws, datastore_name="datalake_breastcancer")
datastore.upload_files(
    files=["train_data.npy", "train_labels.npy", "test_data.npy", "test_labels.npy"],
    target_path="training_data",
    overwrite=True,
)
print("Files uploaded to datastore.")

In [None]:
pip install azureml-sdk

In [None]:
from azureml.pipeline.steps import PythonScriptStep
from azureml.data.datapath import DataPath

In [None]:
train_step = PythonScriptStep(
    name="Train Random Forest Model",
    script_name="train.py",
    arguments=[
        "--train_data", DataPath(datastore, "training_data/train_data.npy"),
        "--train_labels", DataPath(datastore, "training_data/train_labels.npy"),
        "--test_data", DataPath(datastore, "training_data/test_data.npy"),
        "--test_labels", DataPath(datastore, "training_data/test_labels.npy"),
    ],
    compute_target=compute_target,
    source_directory="./pipeline_scripts",  # Use the  directory
)

In [None]:
# Create and submit the pipeline
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[train_step])
print("Pipeline created successfully.")

pipeline_run = pipeline.submit("breast-cancer-detection-pipeline")
pipeline_run.wait_for_completion(show_output=True)